* [RFC PATCH 01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI
2025-01-07 14:27 [RFC PATCH 00/12] Private MMIO support for private assigned dev Xu Yilun
@ 2025-01-07 14:27 ` Xu Yilun
2025-01-08 8:01 ` Christian König
2025-01-07 14:27 ` [RFC PATCH 02/12] vfio: Export vfio device get and put registration helpers Xu Yilun
` (11 subsequent siblings)
12 siblings, 1 reply; 134+ messages in thread
From: Xu Yilun @ 2025-01-07 14:27 UTC (permalink / raw)
To: kvm, dri-devel, linux-media, linaro-mm-sig, sumit.semwal,
christian.koenig, pbonzini, seanjc, alex.williamson, jgg,
vivek.kasireddy, dan.j.williams, aik
Cc: yilun.xu, yilun.xu, linux-coco, linux-kernel, lukas, yan.y.zhao,
daniel.vetter, leon, baolu.lu, zhenzhong.duan, tao1.su
Introduce a new API for dma-buf importer, also add a dma_buf_ops
callback for dma-buf exporter. This API is for subsystem importers who
map the dma-buf to some user defined address space, e.g. for IOMMUFD to
map the dma-buf to userspace IOVA via IOMMU page table, or for KVM to
map the dma-buf to GPA via KVM MMU (e.g. EPT).
Currently dma-buf is only used to get DMA address for device's default
domain by using kernel DMA APIs. But for these new use-cases, importers
only need the pfn of the dma-buf resource to build their own mapping
tables. So the map_dma_buf() callback is not mandatory for exporters
anymore. Also the importers could choose not to provide
struct device *dev on dma_buf_attach() if they don't call
dma_buf_map_attachment().
Like dma_buf_map_attachment(), the importer should firstly call
dma_buf_attach/dynamic_attach() then call dma_buf_get_pfn_unlocked().
If the importer choose to do dynamic attach, it also should handle the
dma-buf move notification.
Only the unlocked version of dma_buf_get_pfn is implemented for now,
just because no locked version is used for now.
Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
---
IIUC, Only get_pfn() is needed but no put_pfn(). The whole dma-buf is
de/referenced at dma-buf attach/detach time.
Specifically, for static attachment, the exporter should always make
memory resource available/pinned on first dma_buf_attach(), and
release/unpin memory resource on last dma_buf_detach(). For dynamic
attachment, the exporter could populate & invalidate the memory
resource at any time, it's OK as long as the importers follow dma-buf
move notification. So no pinning is needed for get_pfn() and no
put_pfn() is needed.
---
drivers/dma-buf/dma-buf.c | 90 +++++++++++++++++++++++++++++++--------
include/linux/dma-buf.h | 13 ++++++
2 files changed, 86 insertions(+), 17 deletions(-)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 7eeee3a38202..83d1448b6dcc 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -630,10 +630,10 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
size_t alloc_size = sizeof(struct dma_buf);
int ret;
- if (WARN_ON(!exp_info->priv || !exp_info->ops
- || !exp_info->ops->map_dma_buf
- || !exp_info->ops->unmap_dma_buf
- || !exp_info->ops->release))
+ if (WARN_ON(!exp_info->priv || !exp_info->ops ||
+ (!!exp_info->ops->map_dma_buf != !!exp_info->ops->unmap_dma_buf) ||
+ (!exp_info->ops->map_dma_buf && !exp_info->ops->get_pfn) ||
+ !exp_info->ops->release))
return ERR_PTR(-EINVAL);
if (WARN_ON(exp_info->ops->cache_sgt_mapping &&
@@ -909,7 +909,10 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
struct dma_buf_attachment *attach;
int ret;
- if (WARN_ON(!dmabuf || !dev))
+ if (WARN_ON(!dmabuf))
+ return ERR_PTR(-EINVAL);
+
+ if (WARN_ON(dmabuf->ops->map_dma_buf && !dev))
return ERR_PTR(-EINVAL);
if (WARN_ON(importer_ops && !importer_ops->move_notify))
@@ -941,7 +944,7 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
*/
if (dma_buf_attachment_is_dynamic(attach) !=
dma_buf_is_dynamic(dmabuf)) {
- struct sg_table *sgt;
+ struct sg_table *sgt = NULL;
dma_resv_lock(attach->dmabuf->resv, NULL);
if (dma_buf_is_dynamic(attach->dmabuf)) {
@@ -950,13 +953,16 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
goto err_unlock;
}
- sgt = __map_dma_buf(attach, DMA_BIDIRECTIONAL);
- if (!sgt)
- sgt = ERR_PTR(-ENOMEM);
- if (IS_ERR(sgt)) {
- ret = PTR_ERR(sgt);
- goto err_unpin;
+ if (dmabuf->ops->map_dma_buf) {
+ sgt = __map_dma_buf(attach, DMA_BIDIRECTIONAL);
+ if (!sgt)
+ sgt = ERR_PTR(-ENOMEM);
+ if (IS_ERR(sgt)) {
+ ret = PTR_ERR(sgt);
+ goto err_unpin;
+ }
}
+
dma_resv_unlock(attach->dmabuf->resv);
attach->sgt = sgt;
attach->dir = DMA_BIDIRECTIONAL;
@@ -1119,7 +1125,8 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach,
might_sleep();
- if (WARN_ON(!attach || !attach->dmabuf))
+ if (WARN_ON(!attach || !attach->dmabuf ||
+ !attach->dmabuf->ops->map_dma_buf))
return ERR_PTR(-EINVAL);
dma_resv_assert_held(attach->dmabuf->resv);
@@ -1195,7 +1202,8 @@ dma_buf_map_attachment_unlocked(struct dma_buf_attachment *attach,
might_sleep();
- if (WARN_ON(!attach || !attach->dmabuf))
+ if (WARN_ON(!attach || !attach->dmabuf ||
+ !attach->dmabuf->ops->map_dma_buf))
return ERR_PTR(-EINVAL);
dma_resv_lock(attach->dmabuf->resv, NULL);
@@ -1222,7 +1230,8 @@ void dma_buf_unmap_attachment(struct dma_buf_attachment *attach,
{
might_sleep();
- if (WARN_ON(!attach || !attach->dmabuf || !sg_table))
+ if (WARN_ON(!attach || !attach->dmabuf ||
+ !attach->dmabuf->ops->unmap_dma_buf || !sg_table))
return;
dma_resv_assert_held(attach->dmabuf->resv);
@@ -1254,7 +1263,8 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
{
might_sleep();
- if (WARN_ON(!attach || !attach->dmabuf || !sg_table))
+ if (WARN_ON(!attach || !attach->dmabuf ||
+ !attach->dmabuf->ops->unmap_dma_buf || !sg_table))
return;
dma_resv_lock(attach->dmabuf->resv, NULL);
@@ -1263,6 +1273,52 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
}
EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF");
+/**
+ * dma_buf_get_pfn_unlocked -
+ * @attach: [in] attachment to get pfn from
+ * @pgoff: [in] page offset of the buffer against the start of dma_buf
+ * @pfn: [out] returns the pfn of the buffer
+ * @max_order [out] returns the max mapping order of the buffer
+ */
+int dma_buf_get_pfn_unlocked(struct dma_buf_attachment *attach,
+ pgoff_t pgoff, u64 *pfn, int *max_order)
+{
+ struct dma_buf *dmabuf = attach->dmabuf;
+ int ret;
+
+ if (WARN_ON(!attach || !attach->dmabuf ||
+ !attach->dmabuf->ops->get_pfn))
+ return -EINVAL;
+
+ /*
+ * Open:
+ *
+ * When dma_buf is dynamic but dma_buf move is disabled, the buffer
+ * should be pinned before use, See dma_buf_map_attachment() for
+ * reference.
+ *
+ * But for now no pin is intended inside dma_buf_get_pfn(), otherwise
+ * need another API to unpin the dma_buf. So just fail out this case.
+ */
+ if (dma_buf_is_dynamic(attach->dmabuf) &&
+ !IS_ENABLED(CONFIG_DMABUF_MOVE_NOTIFY))
+ return -ENOENT;
+
+ dma_resv_lock(attach->dmabuf->resv, NULL);
+ ret = dmabuf->ops->get_pfn(attach, pgoff, pfn, max_order);
+ /*
+ * Open:
+ *
+ * Is dma_resv_wait_timeout() needed? I assume no. The DMA buffer
+ * content synchronization could be done when the buffer is to be
+ * mapped by importer.
+ */
+ dma_resv_unlock(attach->dmabuf->resv);
+
+ return ret;
+}
+EXPORT_SYMBOL_NS_GPL(dma_buf_get_pfn_unlocked, "DMA_BUF");
+
/**
* dma_buf_move_notify - notify attachments that DMA-buf is moving
*
@@ -1662,7 +1718,7 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused)
attach_count = 0;
list_for_each_entry(attach_obj, &buf_obj->attachments, node) {
- seq_printf(s, "\t%s\n", dev_name(attach_obj->dev));
+ seq_printf(s, "\t%s\n", attach_obj->dev ? dev_name(attach_obj->dev) : NULL);
attach_count++;
}
dma_resv_unlock(buf_obj->resv);
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 36216d28d8bd..b16183edfb3a 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -194,6 +194,17 @@ struct dma_buf_ops {
* if the call would block.
*/
+ /**
+ * @get_pfn:
+ *
+ * This is called by dma_buf_get_pfn(). It is used to get the pfn
+ * of the buffer positioned by the page offset against the start of
+ * the dma_buf. It can only be called if @attach has been called
+ * successfully.
+ */
+ int (*get_pfn)(struct dma_buf_attachment *attach, pgoff_t pgoff,
+ u64 *pfn, int *max_order);
+
/**
* @release:
*
@@ -629,6 +640,8 @@ dma_buf_map_attachment_unlocked(struct dma_buf_attachment *attach,
void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
struct sg_table *sg_table,
enum dma_data_direction direction);
+int dma_buf_get_pfn_unlocked(struct dma_buf_attachment *attach,
+ pgoff_t pgoff, u64 *pfn, int *max_order);
int dma_buf_mmap(struct dma_buf *, struct vm_area_struct *,
unsigned long);
--
2.25.1
^ permalink raw reply related [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI
2025-01-07 14:27 ` [RFC PATCH 01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI Xu Yilun
@ 2025-01-08 8:01 ` Christian König
2025-01-08 13:23 ` Jason Gunthorpe
0 siblings, 1 reply; 134+ messages in thread
From: Christian König @ 2025-01-08 8:01 UTC (permalink / raw)
To: Xu Yilun, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, pbonzini, seanjc, alex.williamson, jgg,
vivek.kasireddy, dan.j.williams, aik
Cc: yilun.xu, linux-coco, linux-kernel, lukas, yan.y.zhao,
daniel.vetter, leon, baolu.lu, zhenzhong.duan, tao1.su
Am 07.01.25 um 15:27 schrieb Xu Yilun:
> Introduce a new API for dma-buf importer, also add a dma_buf_ops
> callback for dma-buf exporter. This API is for subsystem importers who
> map the dma-buf to some user defined address space, e.g. for IOMMUFD to
> map the dma-buf to userspace IOVA via IOMMU page table, or for KVM to
> map the dma-buf to GPA via KVM MMU (e.g. EPT).
>
> Currently dma-buf is only used to get DMA address for device's default
> domain by using kernel DMA APIs. But for these new use-cases, importers
> only need the pfn of the dma-buf resource to build their own mapping
> tables.
As far as I can see I have to fundamentally reject this whole approach.
It's intentional DMA-buf design that we don't expose struct pages nor
PFNs to the importer. Essentially DMA-buf only transports DMA addresses.
In other words the mapping is done by the exporter and *not* the importer.
What we certainly can do is to annotate those DMA addresses to a better
specify in which domain they are applicable, e.g. if they are PCIe bus
addresses or some inter device bus addresses etc...
But moving the functionality to map the pages/PFNs to DMA addresses into
the importer is an absolutely clear NO-GO.
Regards,
Christian.
> So the map_dma_buf() callback is not mandatory for exporters
> anymore. Also the importers could choose not to provide
> struct device *dev on dma_buf_attach() if they don't call
> dma_buf_map_attachment().
>
> Like dma_buf_map_attachment(), the importer should firstly call
> dma_buf_attach/dynamic_attach() then call dma_buf_get_pfn_unlocked().
> If the importer choose to do dynamic attach, it also should handle the
> dma-buf move notification.
>
> Only the unlocked version of dma_buf_get_pfn is implemented for now,
> just because no locked version is used for now.
>
> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
>
> ---
> IIUC, Only get_pfn() is needed but no put_pfn(). The whole dma-buf is
> de/referenced at dma-buf attach/detach time.
>
> Specifically, for static attachment, the exporter should always make
> memory resource available/pinned on first dma_buf_attach(), and
> release/unpin memory resource on last dma_buf_detach(). For dynamic
> attachment, the exporter could populate & invalidate the memory
> resource at any time, it's OK as long as the importers follow dma-buf
> move notification. So no pinning is needed for get_pfn() and no
> put_pfn() is needed.
> ---
> drivers/dma-buf/dma-buf.c | 90 +++++++++++++++++++++++++++++++--------
> include/linux/dma-buf.h | 13 ++++++
> 2 files changed, 86 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 7eeee3a38202..83d1448b6dcc 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -630,10 +630,10 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
> size_t alloc_size = sizeof(struct dma_buf);
> int ret;
>
> - if (WARN_ON(!exp_info->priv || !exp_info->ops
> - || !exp_info->ops->map_dma_buf
> - || !exp_info->ops->unmap_dma_buf
> - || !exp_info->ops->release))
> + if (WARN_ON(!exp_info->priv || !exp_info->ops ||
> + (!!exp_info->ops->map_dma_buf != !!exp_info->ops->unmap_dma_buf) ||
> + (!exp_info->ops->map_dma_buf && !exp_info->ops->get_pfn) ||
> + !exp_info->ops->release))
> return ERR_PTR(-EINVAL);
>
> if (WARN_ON(exp_info->ops->cache_sgt_mapping &&
> @@ -909,7 +909,10 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
> struct dma_buf_attachment *attach;
> int ret;
>
> - if (WARN_ON(!dmabuf || !dev))
> + if (WARN_ON(!dmabuf))
> + return ERR_PTR(-EINVAL);
> +
> + if (WARN_ON(dmabuf->ops->map_dma_buf && !dev))
> return ERR_PTR(-EINVAL);
>
> if (WARN_ON(importer_ops && !importer_ops->move_notify))
> @@ -941,7 +944,7 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
> */
> if (dma_buf_attachment_is_dynamic(attach) !=
> dma_buf_is_dynamic(dmabuf)) {
> - struct sg_table *sgt;
> + struct sg_table *sgt = NULL;
>
> dma_resv_lock(attach->dmabuf->resv, NULL);
> if (dma_buf_is_dynamic(attach->dmabuf)) {
> @@ -950,13 +953,16 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
> goto err_unlock;
> }
>
> - sgt = __map_dma_buf(attach, DMA_BIDIRECTIONAL);
> - if (!sgt)
> - sgt = ERR_PTR(-ENOMEM);
> - if (IS_ERR(sgt)) {
> - ret = PTR_ERR(sgt);
> - goto err_unpin;
> + if (dmabuf->ops->map_dma_buf) {
> + sgt = __map_dma_buf(attach, DMA_BIDIRECTIONAL);
> + if (!sgt)
> + sgt = ERR_PTR(-ENOMEM);
> + if (IS_ERR(sgt)) {
> + ret = PTR_ERR(sgt);
> + goto err_unpin;
> + }
> }
> +
> dma_resv_unlock(attach->dmabuf->resv);
> attach->sgt = sgt;
> attach->dir = DMA_BIDIRECTIONAL;
> @@ -1119,7 +1125,8 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach,
>
> might_sleep();
>
> - if (WARN_ON(!attach || !attach->dmabuf))
> + if (WARN_ON(!attach || !attach->dmabuf ||
> + !attach->dmabuf->ops->map_dma_buf))
> return ERR_PTR(-EINVAL);
>
> dma_resv_assert_held(attach->dmabuf->resv);
> @@ -1195,7 +1202,8 @@ dma_buf_map_attachment_unlocked(struct dma_buf_attachment *attach,
>
> might_sleep();
>
> - if (WARN_ON(!attach || !attach->dmabuf))
> + if (WARN_ON(!attach || !attach->dmabuf ||
> + !attach->dmabuf->ops->map_dma_buf))
> return ERR_PTR(-EINVAL);
>
> dma_resv_lock(attach->dmabuf->resv, NULL);
> @@ -1222,7 +1230,8 @@ void dma_buf_unmap_attachment(struct dma_buf_attachment *attach,
> {
> might_sleep();
>
> - if (WARN_ON(!attach || !attach->dmabuf || !sg_table))
> + if (WARN_ON(!attach || !attach->dmabuf ||
> + !attach->dmabuf->ops->unmap_dma_buf || !sg_table))
> return;
>
> dma_resv_assert_held(attach->dmabuf->resv);
> @@ -1254,7 +1263,8 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
> {
> might_sleep();
>
> - if (WARN_ON(!attach || !attach->dmabuf || !sg_table))
> + if (WARN_ON(!attach || !attach->dmabuf ||
> + !attach->dmabuf->ops->unmap_dma_buf || !sg_table))
> return;
>
> dma_resv_lock(attach->dmabuf->resv, NULL);
> @@ -1263,6 +1273,52 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
> }
> EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF");
>
> +/**
> + * dma_buf_get_pfn_unlocked -
> + * @attach: [in] attachment to get pfn from
> + * @pgoff: [in] page offset of the buffer against the start of dma_buf
> + * @pfn: [out] returns the pfn of the buffer
> + * @max_order [out] returns the max mapping order of the buffer
> + */
> +int dma_buf_get_pfn_unlocked(struct dma_buf_attachment *attach,
> + pgoff_t pgoff, u64 *pfn, int *max_order)
> +{
> + struct dma_buf *dmabuf = attach->dmabuf;
> + int ret;
> +
> + if (WARN_ON(!attach || !attach->dmabuf ||
> + !attach->dmabuf->ops->get_pfn))
> + return -EINVAL;
> +
> + /*
> + * Open:
> + *
> + * When dma_buf is dynamic but dma_buf move is disabled, the buffer
> + * should be pinned before use, See dma_buf_map_attachment() for
> + * reference.
> + *
> + * But for now no pin is intended inside dma_buf_get_pfn(), otherwise
> + * need another API to unpin the dma_buf. So just fail out this case.
> + */
> + if (dma_buf_is_dynamic(attach->dmabuf) &&
> + !IS_ENABLED(CONFIG_DMABUF_MOVE_NOTIFY))
> + return -ENOENT;
> +
> + dma_resv_lock(attach->dmabuf->resv, NULL);
> + ret = dmabuf->ops->get_pfn(attach, pgoff, pfn, max_order);
> + /*
> + * Open:
> + *
> + * Is dma_resv_wait_timeout() needed? I assume no. The DMA buffer
> + * content synchronization could be done when the buffer is to be
> + * mapped by importer.
> + */
> + dma_resv_unlock(attach->dmabuf->resv);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_NS_GPL(dma_buf_get_pfn_unlocked, "DMA_BUF");
> +
> /**
> * dma_buf_move_notify - notify attachments that DMA-buf is moving
> *
> @@ -1662,7 +1718,7 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused)
> attach_count = 0;
>
> list_for_each_entry(attach_obj, &buf_obj->attachments, node) {
> - seq_printf(s, "\t%s\n", dev_name(attach_obj->dev));
> + seq_printf(s, "\t%s\n", attach_obj->dev ? dev_name(attach_obj->dev) : NULL);
> attach_count++;
> }
> dma_resv_unlock(buf_obj->resv);
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index 36216d28d8bd..b16183edfb3a 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -194,6 +194,17 @@ struct dma_buf_ops {
> * if the call would block.
> */
>
> + /**
> + * @get_pfn:
> + *
> + * This is called by dma_buf_get_pfn(). It is used to get the pfn
> + * of the buffer positioned by the page offset against the start of
> + * the dma_buf. It can only be called if @attach has been called
> + * successfully.
> + */
> + int (*get_pfn)(struct dma_buf_attachment *attach, pgoff_t pgoff,
> + u64 *pfn, int *max_order);
> +
> /**
> * @release:
> *
> @@ -629,6 +640,8 @@ dma_buf_map_attachment_unlocked(struct dma_buf_attachment *attach,
> void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
> struct sg_table *sg_table,
> enum dma_data_direction direction);
> +int dma_buf_get_pfn_unlocked(struct dma_buf_attachment *attach,
> + pgoff_t pgoff, u64 *pfn, int *max_order);
>
> int dma_buf_mmap(struct dma_buf *, struct vm_area_struct *,
> unsigned long);
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI
2025-01-08 8:01 ` Christian König
@ 2025-01-08 13:23 ` Jason Gunthorpe
2025-01-08 13:44 ` Christian König
0 siblings, 1 reply; 134+ messages in thread
From: Jason Gunthorpe @ 2025-01-08 13:23 UTC (permalink / raw)
To: Christian König, Christoph Hellwig, Leon Romanovsky
Cc: Xu Yilun, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, pbonzini, seanjc, alex.williamson, vivek.kasireddy,
dan.j.williams, aik, yilun.xu, linux-coco, linux-kernel, lukas,
yan.y.zhao, daniel.vetter, leon, baolu.lu, zhenzhong.duan,
tao1.su
On Wed, Jan 08, 2025 at 09:01:46AM +0100, Christian König wrote:
> Am 07.01.25 um 15:27 schrieb Xu Yilun:
> > Introduce a new API for dma-buf importer, also add a dma_buf_ops
> > callback for dma-buf exporter. This API is for subsystem importers who
> > map the dma-buf to some user defined address space, e.g. for IOMMUFD to
> > map the dma-buf to userspace IOVA via IOMMU page table, or for KVM to
> > map the dma-buf to GPA via KVM MMU (e.g. EPT).
> >
> > Currently dma-buf is only used to get DMA address for device's default
> > domain by using kernel DMA APIs. But for these new use-cases, importers
> > only need the pfn of the dma-buf resource to build their own mapping
> > tables.
>
> As far as I can see I have to fundamentally reject this whole approach.
>
> It's intentional DMA-buf design that we don't expose struct pages nor PFNs
> to the importer. Essentially DMA-buf only transports DMA addresses.
>
> In other words the mapping is done by the exporter and *not* the importer.
>
> What we certainly can do is to annotate those DMA addresses to a better
> specify in which domain they are applicable, e.g. if they are PCIe bus
> addresses or some inter device bus addresses etc...
>
> But moving the functionality to map the pages/PFNs to DMA addresses into the
> importer is an absolutely clear NO-GO.
Oh?
Having the importer do the mapping is the correct way to operate the
DMA API and the new API that Leon has built to fix the scatterlist
abuse in dmabuf relies on importer mapping as part of it's
construction.
Why on earth do you want the exporter to map? That is completely
backwards and unworkable in many cases. The disfunctional P2P support
in dmabuf is like that principally because of this.
That said, I don't think get_pfn() is an especially good interface,
but we will need to come with something that passes the physical pfn
out.
Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI
2025-01-08 13:23 ` Jason Gunthorpe
@ 2025-01-08 13:44 ` Christian König
2025-01-08 14:58 ` Jason Gunthorpe
0 siblings, 1 reply; 134+ messages in thread
From: Christian König @ 2025-01-08 13:44 UTC (permalink / raw)
To: Jason Gunthorpe, Christoph Hellwig, Leon Romanovsky
Cc: Xu Yilun, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, pbonzini, seanjc, alex.williamson, vivek.kasireddy,
dan.j.williams, aik, yilun.xu, linux-coco, linux-kernel, lukas,
yan.y.zhao, daniel.vetter, leon, baolu.lu, zhenzhong.duan,
tao1.su
Am 08.01.25 um 14:23 schrieb Jason Gunthorpe:
> On Wed, Jan 08, 2025 at 09:01:46AM +0100, Christian König wrote:
>> Am 07.01.25 um 15:27 schrieb Xu Yilun:
>>> Introduce a new API for dma-buf importer, also add a dma_buf_ops
>>> callback for dma-buf exporter. This API is for subsystem importers who
>>> map the dma-buf to some user defined address space, e.g. for IOMMUFD to
>>> map the dma-buf to userspace IOVA via IOMMU page table, or for KVM to
>>> map the dma-buf to GPA via KVM MMU (e.g. EPT).
>>>
>>> Currently dma-buf is only used to get DMA address for device's default
>>> domain by using kernel DMA APIs. But for these new use-cases, importers
>>> only need the pfn of the dma-buf resource to build their own mapping
>>> tables.
>> As far as I can see I have to fundamentally reject this whole approach.
>>
>> It's intentional DMA-buf design that we don't expose struct pages nor PFNs
>> to the importer. Essentially DMA-buf only transports DMA addresses.
>>
>> In other words the mapping is done by the exporter and *not* the importer.
>>
>> What we certainly can do is to annotate those DMA addresses to a better
>> specify in which domain they are applicable, e.g. if they are PCIe bus
>> addresses or some inter device bus addresses etc...
>>
>> But moving the functionality to map the pages/PFNs to DMA addresses into the
>> importer is an absolutely clear NO-GO.
> Oh?
>
> Having the importer do the mapping is the correct way to operate the
> DMA API and the new API that Leon has built to fix the scatterlist
> abuse in dmabuf relies on importer mapping as part of it's
> construction.
Exactly on that I strongly disagree on.
DMA-buf works by providing DMA addresses the importer can work with and
*NOT* the underlying location of the buffer.
> Why on earth do you want the exporter to map?
Because the exporter owns the exported buffer and only the exporter
knows to how correctly access it.
> That is completely backwards and unworkable in many cases. The disfunctional P2P support
> in dmabuf is like that principally because of this.
No, that is exactly what we need.
Using the scatterlist to transport the DMA addresses was clearly a
mistake, but providing the DMA addresses by the exporter has proved many
times to be the right approach.
Keep in mind that the exported buffer is not necessary memory, but can
also be MMIO or stuff which is only accessible through address space
windows where you can't create a PFN nor struct page for.
> That said, I don't think get_pfn() is an especially good interface,
> but we will need to come with something that passes the physical pfn
> out.
No, physical pfn is absolutely not a good way of passing the location of
data around because it is limited to what the CPU sees as address space.
We have use cases where DMA-buf transports the location of CPU invisible
data which only the involved devices can see.
Regards,
Christian.
>
> Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI
2025-01-08 13:44 ` Christian König
@ 2025-01-08 14:58 ` Jason Gunthorpe
2025-01-08 15:25 ` Christian König
0 siblings, 1 reply; 134+ messages in thread
From: Jason Gunthorpe @ 2025-01-08 14:58 UTC (permalink / raw)
To: Christian König
Cc: Christoph Hellwig, Leon Romanovsky, Xu Yilun, kvm, dri-devel,
linux-media, linaro-mm-sig, sumit.semwal, pbonzini, seanjc,
alex.williamson, vivek.kasireddy, dan.j.williams, aik, yilun.xu,
linux-coco, linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon,
baolu.lu, zhenzhong.duan, tao1.su
On Wed, Jan 08, 2025 at 02:44:26PM +0100, Christian König wrote:
> > Having the importer do the mapping is the correct way to operate the
> > DMA API and the new API that Leon has built to fix the scatterlist
> > abuse in dmabuf relies on importer mapping as part of it's
> > construction.
>
> Exactly on that I strongly disagree on.
>
> DMA-buf works by providing DMA addresses the importer can work with and
> *NOT* the underlying location of the buffer.
The expectation is that the DMA API will be used to DMA map (most)
things, and the DMA API always works with a physaddr_t/pfn
argument. Basically, everything that is not a private address space
should be supported by improving the DMA API. We are on course for
finally getting all the common cases like P2P and MMIO solved
here. That alone will take care of alot.
For P2P cases we are going toward (PFN + P2P source information) as
input to the DMA API. The additional "P2P source information" provides
a good way for co-operating drivers to represent private address
spaces as well. Both importer and exporter can have full understanding
what is being mapped and do the correct things, safely.
So, no, we don't loose private address space support when moving to
importer mapping, in fact it works better because the importer gets
more information about what is going on.
I have imagined a staged approach were DMABUF gets a new API that
works with the new DMA API to do importer mapping with "P2P source
information" and a gradual conversion.
Exporter mapping falls down in too many cases already:
1) Private addresses spaces don't work fully well because many devices
need some indication what address space is being used and scatter list
can't really properly convey that. If the DMABUF has a mixture of CPU
and private it becomes a PITA
2) Multi-path PCI can require the importer to make mapping decisions
unique to the device and program device specific information for the
multi-path. We are doing this in mlx5 today and have hacks because
DMABUF is destroying the information the importer needs to choose the
correct PCI path.
3) Importing devices need to know if they are working with PCI P2P
addresses during mapping because they need to do things like turn on
ATS on their DMA. As for multi-path we have the same hacks inside mlx5
today that assume DMABUFs are always P2P because we cannot determine
if things are P2P or not after being DMA mapped.
4) TPH bits needs to be programmed into the importer device but are
derived based on the NUMA topology of the DMA target. The importer has
no idea what the DMA target actually was because the exporter mapping
destroyed that information.
5) iommufd and kvm are both using CPU addresses without DMA. No
exporter mapping is possible
Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI
2025-01-08 14:58 ` Jason Gunthorpe
@ 2025-01-08 15:25 ` Christian König
2025-01-08 16:22 ` Jason Gunthorpe
0 siblings, 1 reply; 134+ messages in thread
From: Christian König @ 2025-01-08 15:25 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Christoph Hellwig, Leon Romanovsky, Xu Yilun, kvm, dri-devel,
linux-media, linaro-mm-sig, sumit.semwal, pbonzini, seanjc,
alex.williamson, vivek.kasireddy, dan.j.williams, aik, yilun.xu,
linux-coco, linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon,
baolu.lu, zhenzhong.duan, tao1.su
Am 08.01.25 um 15:58 schrieb Jason Gunthorpe:
> On Wed, Jan 08, 2025 at 02:44:26PM +0100, Christian König wrote:
>
>>> Having the importer do the mapping is the correct way to operate the
>>> DMA API and the new API that Leon has built to fix the scatterlist
>>> abuse in dmabuf relies on importer mapping as part of it's
>>> construction.
>> Exactly on that I strongly disagree on.
>>
>> DMA-buf works by providing DMA addresses the importer can work with and
>> *NOT* the underlying location of the buffer.
> The expectation is that the DMA API will be used to DMA map (most)
> things, and the DMA API always works with a physaddr_t/pfn
> argument. Basically, everything that is not a private address space
> should be supported by improving the DMA API. We are on course for
> finally getting all the common cases like P2P and MMIO solved
> here. That alone will take care of alot.
Well, from experience the DMA API has failed more often than it actually
worked in the way required by drivers.
Especially that we tried to hide architectural complexity in there
instead of properly expose limitations to drivers is not something I
consider a good design approach.
So I see putting even more into that extremely critical.
> For P2P cases we are going toward (PFN + P2P source information) as
> input to the DMA API. The additional "P2P source information" provides
> a good way for co-operating drivers to represent private address
> spaces as well. Both importer and exporter can have full understanding
> what is being mapped and do the correct things, safely.
I can say from experience that this is clearly not going to work for all
use cases.
It would mean that we have to pull a massive amount of driver specific
functionality into the DMA API.
Things like programming access windows for PCI BARs is completely driver
specific and as far as I can see can't be part of the DMA API without
things like callbacks.
With that in mind the DMA API would become a mid layer between different
drivers and that is really not something you are suggesting, isn't it?
> So, no, we don't loose private address space support when moving to
> importer mapping, in fact it works better because the importer gets
> more information about what is going on.
Well, sounds like I wasn't able to voice my concern. Let me try again:
We should not give importers information they don't need. Especially not
information about the backing store of buffers.
So that importers get more information about what's going on is a bad thing.
> I have imagined a staged approach were DMABUF gets a new API that
> works with the new DMA API to do importer mapping with "P2P source
> information" and a gradual conversion.
To make it clear as maintainer of that subsystem I would reject such a
step with all I have.
We have already gone down that road and it didn't worked at all and was
a really big pain to pull people back from it.
> Exporter mapping falls down in too many cases already:
>
> 1) Private addresses spaces don't work fully well because many devices
> need some indication what address space is being used and scatter list
> can't really properly convey that. If the DMABUF has a mixture of CPU
> and private it becomes a PITA
Correct, yes. That's why I said that scatterlist was a bad choice for
the interface.
But exposing the backing store to importers and then let them do
whatever they want with it sounds like an even worse idea.
> 2) Multi-path PCI can require the importer to make mapping decisions
> unique to the device and program device specific information for the
> multi-path. We are doing this in mlx5 today and have hacks because
> DMABUF is destroying the information the importer needs to choose the
> correct PCI path.
That's why the exporter gets the struct device of the importer so that
it can plan how those accesses are made. Where exactly is the problem
with that?
When you have an use case which is not covered by the existing DMA-buf
interfaces then please voice that to me and other maintainers instead of
implementing some hack.
> 3) Importing devices need to know if they are working with PCI P2P
> addresses during mapping because they need to do things like turn on
> ATS on their DMA. As for multi-path we have the same hacks inside mlx5
> today that assume DMABUFs are always P2P because we cannot determine
> if things are P2P or not after being DMA mapped.
Why would you need ATS on PCI P2P and not for system memory accesses?
> 4) TPH bits needs to be programmed into the importer device but are
> derived based on the NUMA topology of the DMA target. The importer has
> no idea what the DMA target actually was because the exporter mapping
> destroyed that information.
Yeah, but again that is completely intentional.
I assume you mean TLP processing hints when you say TPH and those should
be part of the DMA addresses provided by the exporter.
That an importer tries to look behind the curtain and determines the
NUMA placement and topology themselves is clearly a no-go from the
design perspective.
> 5) iommufd and kvm are both using CPU addresses without DMA. No
> exporter mapping is possible
We have customers using both KVM and XEN with DMA-buf, so I can clearly
confirm that this isn't true.
Regards,
Christian.
>
> Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI
2025-01-08 15:25 ` Christian König
@ 2025-01-08 16:22 ` Jason Gunthorpe
2025-01-08 17:56 ` Xu Yilun
` (2 more replies)
0 siblings, 3 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2025-01-08 16:22 UTC (permalink / raw)
To: Christian König
Cc: Christoph Hellwig, Leon Romanovsky, Xu Yilun, kvm, dri-devel,
linux-media, linaro-mm-sig, sumit.semwal, pbonzini, seanjc,
alex.williamson, vivek.kasireddy, dan.j.williams, aik, yilun.xu,
linux-coco, linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon,
baolu.lu, zhenzhong.duan, tao1.su
On Wed, Jan 08, 2025 at 04:25:54PM +0100, Christian König wrote:
> Am 08.01.25 um 15:58 schrieb Jason Gunthorpe:
> > On Wed, Jan 08, 2025 at 02:44:26PM +0100, Christian König wrote:
> >
> > > > Having the importer do the mapping is the correct way to operate the
> > > > DMA API and the new API that Leon has built to fix the scatterlist
> > > > abuse in dmabuf relies on importer mapping as part of it's
> > > > construction.
> > > Exactly on that I strongly disagree on.
> > >
> > > DMA-buf works by providing DMA addresses the importer can work with and
> > > *NOT* the underlying location of the buffer.
> > The expectation is that the DMA API will be used to DMA map (most)
> > things, and the DMA API always works with a physaddr_t/pfn
> > argument. Basically, everything that is not a private address space
> > should be supported by improving the DMA API. We are on course for
> > finally getting all the common cases like P2P and MMIO solved
> > here. That alone will take care of alot.
>
> Well, from experience the DMA API has failed more often than it actually
> worked in the way required by drivers.
The DMA API has been static and very hard to change in these ways for
a long time. I think Leon's new API will break through this and we
will be able finally address these issues.
> > For P2P cases we are going toward (PFN + P2P source information) as
> > input to the DMA API. The additional "P2P source information" provides
> > a good way for co-operating drivers to represent private address
> > spaces as well. Both importer and exporter can have full understanding
> > what is being mapped and do the correct things, safely.
>
> I can say from experience that this is clearly not going to work for all use
> cases.
>
> It would mean that we have to pull a massive amount of driver specific
> functionality into the DMA API.
That isn't what I mean. There are two distinct parts, the means to
describe the source (PFN + P2P source information) that is compatible
with the DMA API, and the DMA API itself that works with a few general
P2P source information types.
Private source information would be detected by co-operating drivers
and go down driver private paths. It would be rejected by other
drivers. This broadly follows how the new API is working.
So here I mean you can use the same PFN + Source API between importer
and exporter and the importer can simply detect the special source and
do the private stuff. It is not shifting things under the DMA API, it
is building along side it using compatible design approaches. You
would match the source information, cast it to a driver structure, do
whatever driver math is needed to compute the local DMA address and
then write it to the device. Nothing is hard or "not going to work"
here.
> > So, no, we don't loose private address space support when moving to
> > importer mapping, in fact it works better because the importer gets
> > more information about what is going on.
>
> Well, sounds like I wasn't able to voice my concern. Let me try again:
>
> We should not give importers information they don't need. Especially not
> information about the backing store of buffers.
>
> So that importers get more information about what's going on is a bad thing.
I strongly disagree because we are suffering today in mlx5 because of
this viewpoint. You cannot predict in advance what importers are going
to need. I already listed many examples where it does not work today
as is.
> > I have imagined a staged approach were DMABUF gets a new API that
> > works with the new DMA API to do importer mapping with "P2P source
> > information" and a gradual conversion.
>
> To make it clear as maintainer of that subsystem I would reject such a step
> with all I have.
This is unexpected, so you want to just leave dmabuf broken? Do you
have any plan to fix it, to fix the misuse of the DMA API, and all
the problems I listed below? This is a big deal, it is causing real
problems today.
If it going to be like this I think we will stop trying to use dmabuf
and do something simpler for vfio/kvm/iommufd :(
> We have already gone down that road and it didn't worked at all and
> was a really big pain to pull people back from it.
Nobody has really seriously tried to improve the DMA API before, so I
don't think this is true at all.
> > Exporter mapping falls down in too many cases already:
> >
> > 1) Private addresses spaces don't work fully well because many devices
> > need some indication what address space is being used and scatter list
> > can't really properly convey that. If the DMABUF has a mixture of CPU
> > and private it becomes a PITA
>
> Correct, yes. That's why I said that scatterlist was a bad choice for the
> interface.
>
> But exposing the backing store to importers and then let them do whatever
> they want with it sounds like an even worse idea.
You keep saying this without real justification. To me it is a nanny
style of API design. But also I don't see how you can possibly fix the
above without telling the importer alot more information.
> > 2) Multi-path PCI can require the importer to make mapping decisions
> > unique to the device and program device specific information for the
> > multi-path. We are doing this in mlx5 today and have hacks because
> > DMABUF is destroying the information the importer needs to choose the
> > correct PCI path.
>
> That's why the exporter gets the struct device of the importer so that it
> can plan how those accesses are made. Where exactly is the problem with
> that?
A single struct device does not convey the multipath options. We have
multiple struct devices (and multiple PCI endpoints) doing DMA
concurrently under one driver.
Multipath always needs additional meta information in the importer
side to tell the device which path to select. A naked dma address is
not sufficient.
Today we guess that DMABUF will be using P2P and hack to choose a P2P
struct device to pass the exporter. We need to know what is in the
dmabuf before we can choose which of the multiple struct devices the
driver has to use for DMA mapping.
But even simple CPU centric cases we will eventually want to select
the proper NUMA local PCI channel matching struct device for CPU only
buffers.
> When you have an use case which is not covered by the existing DMA-buf
> interfaces then please voice that to me and other maintainers instead of
> implementing some hack.
Do you have any suggestion for any of this then? We have a good plan
to fix this stuff and more. Many experts in their fields have agreed
on the different parts now. We haven't got to dmabuf because I had no
idea there would be an objection like this.
> > 3) Importing devices need to know if they are working with PCI P2P
> > addresses during mapping because they need to do things like turn on
> > ATS on their DMA. As for multi-path we have the same hacks inside mlx5
> > today that assume DMABUFs are always P2P because we cannot determine
> > if things are P2P or not after being DMA mapped.
>
> Why would you need ATS on PCI P2P and not for system memory accesses?
ATS has a significant performance cost. It is mandatory for PCI P2P,
but ideally should be avoided for CPU memory.
> > 4) TPH bits needs to be programmed into the importer device but are
> > derived based on the NUMA topology of the DMA target. The importer has
> > no idea what the DMA target actually was because the exporter mapping
> > destroyed that information.
>
> Yeah, but again that is completely intentional.
>
> I assume you mean TLP processing hints when you say TPH and those should be
> part of the DMA addresses provided by the exporter.
Yes, but is not part of the DMA addresses.
> That an importer tries to look behind the curtain and determines the NUMA
> placement and topology themselves is clearly a no-go from the design
> perspective.
I strongly disagree, this is important. Drivers need this information
in a future TPH/UIO/multipath PCI world.
> > 5) iommufd and kvm are both using CPU addresses without DMA. No
> > exporter mapping is possible
>
> We have customers using both KVM and XEN with DMA-buf, so I can clearly
> confirm that this isn't true.
Today they are mmaping the dma-buf into a VMA and then using KVM's
follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable
dma-buf must have a CPU PFN.
Here Xu implements basically the same path, except without the VMA
indirection, and it suddenly not OK? Illogical.
Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI
2025-01-08 16:22 ` Jason Gunthorpe
@ 2025-01-08 17:56 ` Xu Yilun
2025-01-10 19:24 ` Simona Vetter
2025-01-08 18:44 ` Simona Vetter
[not found] ` <0e7f92bd-7da3-4328-9081-0957b3d155ca@amd.com>
2 siblings, 1 reply; 134+ messages in thread
From: Xu Yilun @ 2025-01-08 17:56 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Christian König, Christoph Hellwig, Leon Romanovsky, kvm,
dri-devel, linux-media, linaro-mm-sig, sumit.semwal, pbonzini,
seanjc, alex.williamson, vivek.kasireddy, dan.j.williams, aik,
yilun.xu, linux-coco, linux-kernel, lukas, yan.y.zhao,
daniel.vetter, leon, baolu.lu, zhenzhong.duan, tao1.su
> > > 5) iommufd and kvm are both using CPU addresses without DMA. No
> > > exporter mapping is possible
> >
> > We have customers using both KVM and XEN with DMA-buf, so I can clearly
> > confirm that this isn't true.
>
> Today they are mmaping the dma-buf into a VMA and then using KVM's
> follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable
> dma-buf must have a CPU PFN.
Yes, the final target for KVM is still the CPU PFN, just with the help
of CPU mapping table.
I also found the xen gntdev-dmabuf is calculating pfn from mapped
sgt.
From Christion's point, I assume only sgl->dma_address should be
used by importers but in fact not. More importers are 'abusing' sg dma
helpers.
That said there are existing needs for importers to know more about the
real buffer resource, for mapping, or even more than mapping,
e.g. dmabuf_imp_grant_foreign_access()
Thanks,
Yilun
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI
2025-01-08 17:56 ` Xu Yilun
@ 2025-01-10 19:24 ` Simona Vetter
2025-01-10 20:16 ` Jason Gunthorpe
0 siblings, 1 reply; 134+ messages in thread
From: Simona Vetter @ 2025-01-10 19:24 UTC (permalink / raw)
To: Xu Yilun
Cc: Jason Gunthorpe, Christian König, Christoph Hellwig,
Leon Romanovsky, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, pbonzini, seanjc, alex.williamson, vivek.kasireddy,
dan.j.williams, aik, yilun.xu, linux-coco, linux-kernel, lukas,
yan.y.zhao, daniel.vetter, leon, baolu.lu, zhenzhong.duan,
tao1.su
On Thu, Jan 09, 2025 at 01:56:02AM +0800, Xu Yilun wrote:
> > > > 5) iommufd and kvm are both using CPU addresses without DMA. No
> > > > exporter mapping is possible
> > >
> > > We have customers using both KVM and XEN with DMA-buf, so I can clearly
> > > confirm that this isn't true.
> >
> > Today they are mmaping the dma-buf into a VMA and then using KVM's
> > follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable
> > dma-buf must have a CPU PFN.
>
> Yes, the final target for KVM is still the CPU PFN, just with the help
> of CPU mapping table.
>
> I also found the xen gntdev-dmabuf is calculating pfn from mapped
> sgt.
See the comment, it's ok because it's a fake device with fake iommu and
the xen code has special knowledge to peek behind the curtain.
-Sima
> From Christion's point, I assume only sgl->dma_address should be
> used by importers but in fact not. More importers are 'abusing' sg dma
> helpers.
>
> That said there are existing needs for importers to know more about the
> real buffer resource, for mapping, or even more than mapping,
> e.g. dmabuf_imp_grant_foreign_access()
--
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI
2025-01-10 19:24 ` Simona Vetter
@ 2025-01-10 20:16 ` Jason Gunthorpe
0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2025-01-10 20:16 UTC (permalink / raw)
To: Xu Yilun, Christian König, Christoph Hellwig,
Leon Romanovsky, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, pbonzini, seanjc, alex.williamson, vivek.kasireddy,
dan.j.williams, aik, yilun.xu, linux-coco, linux-kernel, lukas,
yan.y.zhao, leon, baolu.lu, zhenzhong.duan, tao1.su
On Fri, Jan 10, 2025 at 08:24:22PM +0100, Simona Vetter wrote:
> On Thu, Jan 09, 2025 at 01:56:02AM +0800, Xu Yilun wrote:
> > > > > 5) iommufd and kvm are both using CPU addresses without DMA. No
> > > > > exporter mapping is possible
> > > >
> > > > We have customers using both KVM and XEN with DMA-buf, so I can clearly
> > > > confirm that this isn't true.
> > >
> > > Today they are mmaping the dma-buf into a VMA and then using KVM's
> > > follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable
> > > dma-buf must have a CPU PFN.
> >
> > Yes, the final target for KVM is still the CPU PFN, just with the help
> > of CPU mapping table.
> >
> > I also found the xen gntdev-dmabuf is calculating pfn from mapped
> > sgt.
>
> See the comment, it's ok because it's a fake device with fake iommu and
> the xen code has special knowledge to peek behind the curtain.
/*
* Now convert sgt to array of gfns without accessing underlying pages.
* It is not allowed to access the underlying struct page of an sg table
* exported by DMA-buf, but since we deal with special Xen dma device here
* (not a normal physical one) look at the dma addresses in the sg table
* and then calculate gfns directly from them.
*/
for_each_sgtable_dma_page(sgt, &sg_iter, 0) {
dma_addr_t addr = sg_page_iter_dma_address(&sg_iter);
unsigned long pfn = bfn_to_pfn(XEN_PFN_DOWN(dma_to_phys(dev, addr)));
*barf*
Can we please all agree that is a horrible abuse of the DMA API and
lets not point it as some acceptable "solution"? KVM and iommufd do
not have fake struct devices with fake iommus.
Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI
2025-01-08 16:22 ` Jason Gunthorpe
2025-01-08 17:56 ` Xu Yilun
@ 2025-01-08 18:44 ` Simona Vetter
2025-01-08 19:22 ` Xu Yilun
[not found] ` <0e7f92bd-7da3-4328-9081-0957b3d155ca@amd.com>
2 siblings, 1 reply; 134+ messages in thread
From: Simona Vetter @ 2025-01-08 18:44 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Christian König, Christoph Hellwig, Leon Romanovsky,
Xu Yilun, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, pbonzini, seanjc, alex.williamson, vivek.kasireddy,
dan.j.williams, aik, yilun.xu, linux-coco, linux-kernel, lukas,
yan.y.zhao, daniel.vetter, leon, baolu.lu, zhenzhong.duan,
tao1.su
On Wed, Jan 08, 2025 at 12:22:27PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 08, 2025 at 04:25:54PM +0100, Christian König wrote:
> > Am 08.01.25 um 15:58 schrieb Jason Gunthorpe:
> > > I have imagined a staged approach were DMABUF gets a new API that
> > > works with the new DMA API to do importer mapping with "P2P source
> > > information" and a gradual conversion.
> >
> > To make it clear as maintainer of that subsystem I would reject such a step
> > with all I have.
>
> This is unexpected, so you want to just leave dmabuf broken? Do you
> have any plan to fix it, to fix the misuse of the DMA API, and all
> the problems I listed below? This is a big deal, it is causing real
> problems today.
>
> If it going to be like this I think we will stop trying to use dmabuf
> and do something simpler for vfio/kvm/iommufd :(
As the gal who help edit the og dma-buf spec 13 years ago, I think adding
pfn isn't a terrible idea. By design, dma-buf is the "everything is
optional" interface. And in the beginning, even consistent locking was
optional, but we've managed to fix that by now :-/
Where I do agree with Christian is that stuffing pfn support into the
dma_buf_attachment interfaces feels a bit much wrong.
> > We have already gone down that road and it didn't worked at all and
> > was a really big pain to pull people back from it.
>
> Nobody has really seriously tried to improve the DMA API before, so I
> don't think this is true at all.
Aside, I really hope this finally happens!
> > > 3) Importing devices need to know if they are working with PCI P2P
> > > addresses during mapping because they need to do things like turn on
> > > ATS on their DMA. As for multi-path we have the same hacks inside mlx5
> > > today that assume DMABUFs are always P2P because we cannot determine
> > > if things are P2P or not after being DMA mapped.
> >
> > Why would you need ATS on PCI P2P and not for system memory accesses?
>
> ATS has a significant performance cost. It is mandatory for PCI P2P,
> but ideally should be avoided for CPU memory.
Huh, I didn't know that. And yeah kinda means we've butchered the pci p2p
stuff a bit I guess ...
> > > 5) iommufd and kvm are both using CPU addresses without DMA. No
> > > exporter mapping is possible
> >
> > We have customers using both KVM and XEN with DMA-buf, so I can clearly
> > confirm that this isn't true.
>
> Today they are mmaping the dma-buf into a VMA and then using KVM's
> follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable
> dma-buf must have a CPU PFN.
>
> Here Xu implements basically the same path, except without the VMA
> indirection, and it suddenly not OK? Illogical.
So the big difference is that for follow_pfn() you need mmu_notifier since
the mmap might move around, whereas with pfn smashed into
dma_buf_attachment you need dma_resv_lock rules, and the move_notify
callback if you go dynamic.
So I guess my first question is, which locking rules do you want here for
pfn importers?
If mmu notifiers is fine, then I think the current approach of follow_pfn
should be ok. But if you instead dma_resv_lock rules (or the cpu mmap
somehow is an issue itself), then I think the clean design is create a new
separate access mechanism just for that. It would be the 5th or so (kernel
vmap, userspace mmap, dma_buf_attach and driver private stuff like
virtio_dma_buf.c where you access your buffer with a uuid), so really not
a big deal.
And for non-contrived exporters we might be able to implement the other
access methods in terms of the pfn method generically, so this wouldn't
even be a terrible maintenance burden going forward. And meanwhile all the
contrived exporters just keep working as-is.
The other part is that cpu mmap is optional, and there's plenty of strange
exporters who don't implement. But you can dma map the attachment into
plenty devices. This tends to mostly be a thing on SoC devices with some
very funky memory. But I guess you don't care about these use-case, so
should be ok.
I couldn't come up with a good name for these pfn users, maybe
dma_buf_pfn_attachment? This does _not_ have a struct device, but maybe
some of these new p2p source specifiers (or a list of those which are
allowed, no idea how this would need to fit into the new dma api).
Cheers, Sima
--
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI
2025-01-08 18:44 ` Simona Vetter
@ 2025-01-08 19:22 ` Xu Yilun
[not found] ` <58e97916-e6fd-41ef-84b4-bbf53ed0e8e4@amd.com>
0 siblings, 1 reply; 134+ messages in thread
From: Xu Yilun @ 2025-01-08 19:22 UTC (permalink / raw)
To: Jason Gunthorpe, Christian König, Christoph Hellwig,
Leon Romanovsky, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, pbonzini, seanjc, alex.williamson, vivek.kasireddy,
dan.j.williams, aik, yilun.xu, linux-coco, linux-kernel, lukas,
yan.y.zhao, leon, baolu.lu, zhenzhong.duan, tao1.su
On Wed, Jan 08, 2025 at 07:44:54PM +0100, Simona Vetter wrote:
> On Wed, Jan 08, 2025 at 12:22:27PM -0400, Jason Gunthorpe wrote:
> > On Wed, Jan 08, 2025 at 04:25:54PM +0100, Christian König wrote:
> > > Am 08.01.25 um 15:58 schrieb Jason Gunthorpe:
> > > > I have imagined a staged approach were DMABUF gets a new API that
> > > > works with the new DMA API to do importer mapping with "P2P source
> > > > information" and a gradual conversion.
> > >
> > > To make it clear as maintainer of that subsystem I would reject such a step
> > > with all I have.
> >
> > This is unexpected, so you want to just leave dmabuf broken? Do you
> > have any plan to fix it, to fix the misuse of the DMA API, and all
> > the problems I listed below? This is a big deal, it is causing real
> > problems today.
> >
> > If it going to be like this I think we will stop trying to use dmabuf
> > and do something simpler for vfio/kvm/iommufd :(
>
> As the gal who help edit the og dma-buf spec 13 years ago, I think adding
> pfn isn't a terrible idea. By design, dma-buf is the "everything is
> optional" interface. And in the beginning, even consistent locking was
> optional, but we've managed to fix that by now :-/
>
> Where I do agree with Christian is that stuffing pfn support into the
> dma_buf_attachment interfaces feels a bit much wrong.
So it could a dmabuf interface like mmap/vmap()? I was also wondering
about that. But finally I start to use dma_buf_attachment interface
because of leveraging existing buffer pin and move_notify.
>
> > > We have already gone down that road and it didn't worked at all and
> > > was a really big pain to pull people back from it.
> >
> > Nobody has really seriously tried to improve the DMA API before, so I
> > don't think this is true at all.
>
> Aside, I really hope this finally happens!
>
> > > > 3) Importing devices need to know if they are working with PCI P2P
> > > > addresses during mapping because they need to do things like turn on
> > > > ATS on their DMA. As for multi-path we have the same hacks inside mlx5
> > > > today that assume DMABUFs are always P2P because we cannot determine
> > > > if things are P2P or not after being DMA mapped.
> > >
> > > Why would you need ATS on PCI P2P and not for system memory accesses?
> >
> > ATS has a significant performance cost. It is mandatory for PCI P2P,
> > but ideally should be avoided for CPU memory.
>
> Huh, I didn't know that. And yeah kinda means we've butchered the pci p2p
> stuff a bit I guess ...
>
> > > > 5) iommufd and kvm are both using CPU addresses without DMA. No
> > > > exporter mapping is possible
> > >
> > > We have customers using both KVM and XEN with DMA-buf, so I can clearly
> > > confirm that this isn't true.
> >
> > Today they are mmaping the dma-buf into a VMA and then using KVM's
> > follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable
> > dma-buf must have a CPU PFN.
> >
> > Here Xu implements basically the same path, except without the VMA
> > indirection, and it suddenly not OK? Illogical.
>
> So the big difference is that for follow_pfn() you need mmu_notifier since
> the mmap might move around, whereas with pfn smashed into
> dma_buf_attachment you need dma_resv_lock rules, and the move_notify
> callback if you go dynamic.
>
> So I guess my first question is, which locking rules do you want here for
> pfn importers?
follow_pfn() is unwanted for private MMIO, so dma_resv_lock.
>
> If mmu notifiers is fine, then I think the current approach of follow_pfn
> should be ok. But if you instead dma_resv_lock rules (or the cpu mmap
> somehow is an issue itself), then I think the clean design is create a new
cpu mmap() is an issue, this series is aimed to eliminate userspace
mapping for private MMIO resources.
> separate access mechanism just for that. It would be the 5th or so (kernel
> vmap, userspace mmap, dma_buf_attach and driver private stuff like
> virtio_dma_buf.c where you access your buffer with a uuid), so really not
> a big deal.
OK, will think more about that.
Thanks,
Yilun
>
> And for non-contrived exporters we might be able to implement the other
> access methods in terms of the pfn method generically, so this wouldn't
> even be a terrible maintenance burden going forward. And meanwhile all the
> contrived exporters just keep working as-is.
>
> The other part is that cpu mmap is optional, and there's plenty of strange
> exporters who don't implement. But you can dma map the attachment into
> plenty devices. This tends to mostly be a thing on SoC devices with some
> very funky memory. But I guess you don't care about these use-case, so
> should be ok.
>
> I couldn't come up with a good name for these pfn users, maybe
> dma_buf_pfn_attachment? This does _not_ have a struct device, but maybe
> some of these new p2p source specifiers (or a list of those which are
> allowed, no idea how this would need to fit into the new dma api).
>
> Cheers, Sima
> --
> Simona Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
^ permalink raw reply [flat|nested] 134+ messages in thread
[parent not found: <0e7f92bd-7da3-4328-9081-0957b3d155ca@amd.com>]
* Re: [RFC PATCH 01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI
[not found] ` <0e7f92bd-7da3-4328-9081-0957b3d155ca@amd.com>
@ 2025-01-09 9:28 ` Leon Romanovsky
0 siblings, 0 replies; 134+ messages in thread
From: Leon Romanovsky @ 2025-01-09 9:28 UTC (permalink / raw)
To: Christian König
Cc: Jason Gunthorpe, Christoph Hellwig, Xu Yilun, kvm, dri-devel,
linux-media, linaro-mm-sig, sumit.semwal, pbonzini, seanjc,
alex.williamson, vivek.kasireddy, dan.j.williams, aik, yilun.xu,
linux-coco, linux-kernel, lukas, yan.y.zhao, daniel.vetter,
baolu.lu, zhenzhong.duan, tao1.su
On Thu, Jan 09, 2025 at 10:10:01AM +0100, Christian König wrote:
> Am 08.01.25 um 17:22 schrieb Jason Gunthorpe:
> > [SNIP]
> > > > For P2P cases we are going toward (PFN + P2P source information) as
> > > > input to the DMA API. The additional "P2P source information" provides
> > > > a good way for co-operating drivers to represent private address
> > > > spaces as well. Both importer and exporter can have full understanding
> > > > what is being mapped and do the correct things, safely.
> > > I can say from experience that this is clearly not going to work for all use
> > > cases.
> > >
> > > It would mean that we have to pull a massive amount of driver specific
> > > functionality into the DMA API.
> > That isn't what I mean. There are two distinct parts, the means to
> > describe the source (PFN + P2P source information) that is compatible
> > with the DMA API, and the DMA API itself that works with a few general
> > P2P source information types.
> >
> > Private source information would be detected by co-operating drivers
> > and go down driver private paths. It would be rejected by other
> > drivers. This broadly follows how the new API is working.
> >
> > So here I mean you can use the same PFN + Source API between importer
> > and exporter and the importer can simply detect the special source and
> > do the private stuff. It is not shifting things under the DMA API, it
> > is building along side it using compatible design approaches. You
> > would match the source information, cast it to a driver structure, do
> > whatever driver math is needed to compute the local DMA address and
> > then write it to the device. Nothing is hard or "not going to work"
> > here.
>
> Well to be honest that sounds like an absolutely horrible design.
>
> You are moving all responsibilities for inter driver handling into the
> drivers themselves without any supervision by the core OS.
>
> Drivers are notoriously buggy and should absolutely not do things like that
> on their own.
IMHO, you and Jason give different meaning to word "driver" in this
discussion. It is upto to the subsystems to decide how to provide new
API to the end drivers. Worth to read this LWN article first.
Dancing the DMA two-step - https://lwn.net/Articles/997563/
>
> Do you have pointers to this new API?
Latest version is here - https://lore.kernel.org/all/cover.1734436840.git.leon@kernel.org/
Unfortunately, I forgot to copy/paste cover letter but it can be seen in
previous version https://lore.kernel.org/all/cover.1733398913.git.leon@kernel.org/.
The most complex example is block layer implementation which hides DMA API from
block drivers. https://lore.kernel.org/all/cover.1730037261.git.leon@kernel.org/
Thanks
^ permalink raw reply [flat|nested] 134+ messages in thread
* [RFC PATCH 02/12] vfio: Export vfio device get and put registration helpers
2025-01-07 14:27 [RFC PATCH 00/12] Private MMIO support for private assigned dev Xu Yilun
2025-01-07 14:27 ` [RFC PATCH 01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI Xu Yilun
@ 2025-01-07 14:27 ` Xu Yilun
2025-01-07 14:27 ` [RFC PATCH 03/12] vfio/pci: Share the core device pointer while invoking feature functions Xu Yilun
` (10 subsequent siblings)
12 siblings, 0 replies; 134+ messages in thread
From: Xu Yilun @ 2025-01-07 14:27 UTC (permalink / raw)
To: kvm, dri-devel, linux-media, linaro-mm-sig, sumit.semwal,
christian.koenig, pbonzini, seanjc, alex.williamson, jgg,
vivek.kasireddy, dan.j.williams, aik
Cc: yilun.xu, yilun.xu, linux-coco, linux-kernel, lukas, yan.y.zhao,
daniel.vetter, leon, baolu.lu, zhenzhong.duan, tao1.su
From: Vivek Kasireddy <vivek.kasireddy@intel.com>
These helpers are useful for managing additional references taken
on the device from other associated VFIO modules.
Original-patch-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
---
drivers/vfio/vfio_main.c | 2 ++
include/linux/vfio.h | 2 ++
2 files changed, 4 insertions(+)
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 1fd261efc582..620a3ee5d04d 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -171,11 +171,13 @@ void vfio_device_put_registration(struct vfio_device *device)
if (refcount_dec_and_test(&device->refcount))
complete(&device->comp);
}
+EXPORT_SYMBOL_GPL(vfio_device_put_registration);
bool vfio_device_try_get_registration(struct vfio_device *device)
{
return refcount_inc_not_zero(&device->refcount);
}
+EXPORT_SYMBOL_GPL(vfio_device_try_get_registration);
/*
* VFIO driver API
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 000a6cab2d31..2258b0585330 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -279,6 +279,8 @@ static inline void vfio_put_device(struct vfio_device *device)
int vfio_register_group_dev(struct vfio_device *device);
int vfio_register_emulated_iommu_dev(struct vfio_device *device);
void vfio_unregister_group_dev(struct vfio_device *device);
+bool vfio_device_try_get_registration(struct vfio_device *device);
+void vfio_device_put_registration(struct vfio_device *device);
int vfio_assign_device_set(struct vfio_device *device, void *set_id);
unsigned int vfio_device_set_open_count(struct vfio_device_set *dev_set);
--
2.25.1
^ permalink raw reply related [flat|nested] 134+ messages in thread
* [RFC PATCH 03/12] vfio/pci: Share the core device pointer while invoking feature functions
2025-01-07 14:27 [RFC PATCH 00/12] Private MMIO support for private assigned dev Xu Yilun
2025-01-07 14:27 ` [RFC PATCH 01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI Xu Yilun
2025-01-07 14:27 ` [RFC PATCH 02/12] vfio: Export vfio device get and put registration helpers Xu Yilun
@ 2025-01-07 14:27 ` Xu Yilun
2025-01-07 14:27 ` [RFC PATCH 04/12] vfio/pci: Allow MMIO regions to be exported through dma-buf Xu Yilun
` (9 subsequent siblings)
12 siblings, 0 replies; 134+ messages in thread
From: Xu Yilun @ 2025-01-07 14:27 UTC (permalink / raw)
To: kvm, dri-devel, linux-media, linaro-mm-sig, sumit.semwal,
christian.koenig, pbonzini, seanjc, alex.williamson, jgg,
vivek.kasireddy, dan.j.williams, aik
Cc: yilun.xu, yilun.xu, linux-coco, linux-kernel, lukas, yan.y.zhao,
daniel.vetter, leon, baolu.lu, zhenzhong.duan, tao1.su
From: Vivek Kasireddy <vivek.kasireddy@intel.com>
There is no need to share the main device pointer (struct vfio_device *)
with all the feature functions as they only need the core device
pointer. Therefore, extract the core device pointer once in the
caller (vfio_pci_core_ioctl_feature) and share it instead.
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
---
drivers/vfio/pci/vfio_pci_core.c | 30 +++++++++++++-----------------
1 file changed, 13 insertions(+), 17 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 1ab58da9f38a..c3269d708411 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -300,11 +300,9 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev,
return 0;
}
-static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags,
+static int vfio_pci_core_pm_entry(struct vfio_pci_core_device *vdev, u32 flags,
void __user *arg, size_t argsz)
{
- struct vfio_pci_core_device *vdev =
- container_of(device, struct vfio_pci_core_device, vdev);
int ret;
ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0);
@@ -321,12 +319,10 @@ static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags,
}
static int vfio_pci_core_pm_entry_with_wakeup(
- struct vfio_device *device, u32 flags,
+ struct vfio_pci_core_device *vdev, u32 flags,
struct vfio_device_low_power_entry_with_wakeup __user *arg,
size_t argsz)
{
- struct vfio_pci_core_device *vdev =
- container_of(device, struct vfio_pci_core_device, vdev);
struct vfio_device_low_power_entry_with_wakeup entry;
struct eventfd_ctx *efdctx;
int ret;
@@ -377,11 +373,9 @@ static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
up_write(&vdev->memory_lock);
}
-static int vfio_pci_core_pm_exit(struct vfio_device *device, u32 flags,
+static int vfio_pci_core_pm_exit(struct vfio_pci_core_device *vdev, u32 flags,
void __user *arg, size_t argsz)
{
- struct vfio_pci_core_device *vdev =
- container_of(device, struct vfio_pci_core_device, vdev);
int ret;
ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0);
@@ -1486,11 +1480,10 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
}
EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
-static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
- uuid_t __user *arg, size_t argsz)
+static int vfio_pci_core_feature_token(struct vfio_pci_core_device *vdev,
+ u32 flags, uuid_t __user *arg,
+ size_t argsz)
{
- struct vfio_pci_core_device *vdev =
- container_of(device, struct vfio_pci_core_device, vdev);
uuid_t uuid;
int ret;
@@ -1517,16 +1510,19 @@ static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
void __user *arg, size_t argsz)
{
+ struct vfio_pci_core_device *vdev =
+ container_of(device, struct vfio_pci_core_device, vdev);
+
switch (flags & VFIO_DEVICE_FEATURE_MASK) {
case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY:
- return vfio_pci_core_pm_entry(device, flags, arg, argsz);
+ return vfio_pci_core_pm_entry(vdev, flags, arg, argsz);
case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP:
- return vfio_pci_core_pm_entry_with_wakeup(device, flags,
+ return vfio_pci_core_pm_entry_with_wakeup(vdev, flags,
arg, argsz);
case VFIO_DEVICE_FEATURE_LOW_POWER_EXIT:
- return vfio_pci_core_pm_exit(device, flags, arg, argsz);
+ return vfio_pci_core_pm_exit(vdev, flags, arg, argsz);
case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
- return vfio_pci_core_feature_token(device, flags, arg, argsz);
+ return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
default:
return -ENOTTY;
}
--
2.25.1
^ permalink raw reply related [flat|nested] 134+ messages in thread
* [RFC PATCH 04/12] vfio/pci: Allow MMIO regions to be exported through dma-buf
2025-01-07 14:27 [RFC PATCH 00/12] Private MMIO support for private assigned dev Xu Yilun
` (2 preceding siblings ...)
2025-01-07 14:27 ` [RFC PATCH 03/12] vfio/pci: Share the core device pointer while invoking feature functions Xu Yilun
@ 2025-01-07 14:27 ` Xu Yilun
2025-01-07 14:27 ` [RFC PATCH 05/12] vfio/pci: Support get_pfn() callback for dma-buf Xu Yilun
` (8 subsequent siblings)
12 siblings, 0 replies; 134+ messages in thread
From: Xu Yilun @ 2025-01-07 14:27 UTC (permalink / raw)
To: kvm, dri-devel, linux-media, linaro-mm-sig, sumit.semwal,
christian.koenig, pbonzini, seanjc, alex.williamson, jgg,
vivek.kasireddy, dan.j.williams, aik
Cc: yilun.xu, yilun.xu, linux-coco, linux-kernel, lukas, yan.y.zhao,
daniel.vetter, leon, baolu.lu, zhenzhong.duan, tao1.su
From: Vivek Kasireddy <vivek.kasireddy@intel.com>
This is a reduced version of Vivek's series [1]. Removed the
dma_buf_ops.attach/map/unmap_dma_buf/mmap() as they are not necessary
in this series, also because of the WIP p2p dma mapping opens [2]. Just
focus on the private MMIO get PFN function in this early stage.
>From Jason Gunthorpe:
"dma-buf has become a way to safely acquire a handle to non-struct page
memory that can still have lifetime controlled by the exporter. Notably
RDMA can now import dma-buf FDs and build them into MRs which allows for
PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory
from PCI device BARs.
The patch design loosely follows the pattern in commit
db1a8dd916aa ("habanalabs: add support for dma-buf exporter") except this
does not support pinning.
Instead, this implements what, in the past, we've called a revocable
attachment using move. In normal situations the attachment is pinned, as a
BAR does not change physical address. However when the VFIO device is
closed, or a PCI reset is issued, access to the MMIO memory is revoked.
Revoked means that move occurs, but an attempt to immediately re-map the
memory will fail. In the reset case a future move will be triggered when
MMIO access returns. As both close and reset are under userspace control
it is expected that userspace will suspend use of the dma-buf before doing
these operations, the revoke is purely for kernel self-defense against a
hostile userspace."
[1] https://lore.kernel.org/kvm/20240624065552.1572580-4-vivek.kasireddy@intel.com/
[2] https://lore.kernel.org/all/IA0PR11MB7185FDD56CFDD0A2B8D21468F83B2@IA0PR11MB7185.namprd11.prod.outlook.com/
Original-patch-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
---
drivers/vfio/pci/Makefile | 1 +
drivers/vfio/pci/dma_buf.c | 223 +++++++++++++++++++++++++++++
drivers/vfio/pci/vfio_pci_config.c | 22 ++-
drivers/vfio/pci/vfio_pci_core.c | 20 ++-
drivers/vfio/pci/vfio_pci_priv.h | 25 ++++
include/linux/vfio_pci_core.h | 1 +
include/uapi/linux/vfio.h | 29 ++++
7 files changed, 316 insertions(+), 5 deletions(-)
create mode 100644 drivers/vfio/pci/dma_buf.c
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index cf00c0a7e55c..0cfdc9ede82f 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -2,6 +2,7 @@
vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o
+vfio-pci-core-$(CONFIG_DMA_SHARED_BUFFER) += dma_buf.o
obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
vfio-pci-y := vfio_pci.o
diff --git a/drivers/vfio/pci/dma_buf.c b/drivers/vfio/pci/dma_buf.c
new file mode 100644
index 000000000000..1d5f46744922
--- /dev/null
+++ b/drivers/vfio/pci/dma_buf.c
@@ -0,0 +1,223 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.
+ */
+#include <linux/dma-buf.h>
+#include <linux/dma-resv.h>
+
+#include "vfio_pci_priv.h"
+
+MODULE_IMPORT_NS("DMA_BUF");
+
+struct vfio_pci_dma_buf {
+ struct dma_buf *dmabuf;
+ struct vfio_pci_core_device *vdev;
+ struct list_head dmabufs_elm;
+ unsigned int nr_ranges;
+ struct vfio_region_dma_range *dma_ranges;
+ bool revoked;
+};
+
+static void vfio_pci_dma_buf_unpin(struct dma_buf_attachment *attachment)
+{
+}
+
+static int vfio_pci_dma_buf_pin(struct dma_buf_attachment *attachment)
+{
+ /*
+ * Uses the dynamic interface but must always allow for
+ * dma_buf_move_notify() to do revoke
+ */
+ return -EINVAL;
+}
+
+static int vfio_pci_dma_buf_get_pfn(struct dma_buf_attachment *attachment,
+ pgoff_t pgoff, u64 *pfn, int *max_order)
+{
+ /* TODO */
+ return -EOPNOTSUPP;
+}
+
+static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
+{
+ struct vfio_pci_dma_buf *priv = dmabuf->priv;
+
+ /*
+ * Either this or vfio_pci_dma_buf_cleanup() will remove from the list.
+ * The refcount prevents both.
+ */
+ if (priv->vdev) {
+ down_write(&priv->vdev->memory_lock);
+ list_del_init(&priv->dmabufs_elm);
+ up_write(&priv->vdev->memory_lock);
+ vfio_device_put_registration(&priv->vdev->vdev);
+ }
+ kfree(priv);
+}
+
+static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
+ .pin = vfio_pci_dma_buf_pin,
+ .unpin = vfio_pci_dma_buf_unpin,
+ .get_pfn = vfio_pci_dma_buf_get_pfn,
+ .release = vfio_pci_dma_buf_release,
+};
+
+static int check_dma_ranges(struct vfio_pci_dma_buf *priv, u64 *dmabuf_size)
+{
+ struct vfio_region_dma_range *dma_ranges = priv->dma_ranges;
+ struct pci_dev *pdev = priv->vdev->pdev;
+ resource_size_t bar_size;
+ int i;
+
+ for (i = 0; i < priv->nr_ranges; i++) {
+ /*
+ * For PCI the region_index is the BAR number like
+ * everything else.
+ */
+ if (dma_ranges[i].region_index >= VFIO_PCI_ROM_REGION_INDEX)
+ return -EINVAL;
+
+ bar_size = pci_resource_len(pdev, dma_ranges[i].region_index);
+ if (!bar_size)
+ return -EINVAL;
+
+ if (!dma_ranges[i].offset && !dma_ranges[i].length)
+ dma_ranges[i].length = bar_size;
+
+ if (!IS_ALIGNED(dma_ranges[i].offset, PAGE_SIZE) ||
+ !IS_ALIGNED(dma_ranges[i].length, PAGE_SIZE) ||
+ dma_ranges[i].length > bar_size ||
+ dma_ranges[i].offset >= bar_size ||
+ dma_ranges[i].offset + dma_ranges[i].length > bar_size)
+ return -EINVAL;
+
+ *dmabuf_size += dma_ranges[i].length;
+ }
+
+ return 0;
+}
+
+int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
+ struct vfio_device_feature_dma_buf __user *arg,
+ size_t argsz)
+{
+ struct vfio_device_feature_dma_buf get_dma_buf;
+ struct vfio_region_dma_range *dma_ranges;
+ DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
+ struct vfio_pci_dma_buf *priv;
+ u64 dmabuf_size = 0;
+ int ret;
+
+ ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
+ sizeof(get_dma_buf));
+ if (ret != 1)
+ return ret;
+
+ if (copy_from_user(&get_dma_buf, arg, sizeof(get_dma_buf)))
+ return -EFAULT;
+
+ dma_ranges = memdup_array_user(&arg->dma_ranges,
+ get_dma_buf.nr_ranges,
+ sizeof(*dma_ranges));
+ if (IS_ERR(dma_ranges))
+ return PTR_ERR(dma_ranges);
+
+ priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+ if (!priv) {
+ kfree(dma_ranges);
+ return -ENOMEM;
+ }
+
+ priv->vdev = vdev;
+ priv->nr_ranges = get_dma_buf.nr_ranges;
+ priv->dma_ranges = dma_ranges;
+
+ ret = check_dma_ranges(priv, &dmabuf_size);
+ if (ret)
+ goto err_free_priv;
+
+ if (!vfio_device_try_get_registration(&vdev->vdev)) {
+ ret = -ENODEV;
+ goto err_free_priv;
+ }
+
+ exp_info.ops = &vfio_pci_dmabuf_ops;
+ exp_info.size = dmabuf_size;
+ exp_info.flags = get_dma_buf.open_flags;
+ exp_info.priv = priv;
+
+ priv->dmabuf = dma_buf_export(&exp_info);
+ if (IS_ERR(priv->dmabuf)) {
+ ret = PTR_ERR(priv->dmabuf);
+ goto err_dev_put;
+ }
+
+ /* dma_buf_put() now frees priv */
+ INIT_LIST_HEAD(&priv->dmabufs_elm);
+ down_write(&vdev->memory_lock);
+ dma_resv_lock(priv->dmabuf->resv, NULL);
+ priv->revoked = !__vfio_pci_memory_enabled(vdev);
+ list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs);
+ dma_resv_unlock(priv->dmabuf->resv);
+ up_write(&vdev->memory_lock);
+
+ /*
+ * dma_buf_fd() consumes the reference, when the file closes the dmabuf
+ * will be released.
+ */
+ return dma_buf_fd(priv->dmabuf, get_dma_buf.open_flags);
+
+err_dev_put:
+ vfio_device_put_registration(&vdev->vdev);
+err_free_priv:
+ kfree(dma_ranges);
+ kfree(priv);
+ return ret;
+}
+
+void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
+{
+ struct vfio_pci_dma_buf *priv;
+ struct vfio_pci_dma_buf *tmp;
+
+ lockdep_assert_held_write(&vdev->memory_lock);
+
+ list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
+ /*
+ * Returns true if a reference was successfully obtained.
+ * The caller must interlock with the dmabuf's release
+ * function in some way, such as RCU, to ensure that this
+ * is not called on freed memory.
+ */
+ if (!get_file_rcu(&priv->dmabuf->file))
+ continue;
+
+ if (priv->revoked != revoked) {
+ dma_resv_lock(priv->dmabuf->resv, NULL);
+ priv->revoked = revoked;
+ dma_buf_move_notify(priv->dmabuf);
+ dma_resv_unlock(priv->dmabuf->resv);
+ }
+ dma_buf_put(priv->dmabuf);
+ }
+}
+
+void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
+{
+ struct vfio_pci_dma_buf *priv;
+ struct vfio_pci_dma_buf *tmp;
+
+ down_write(&vdev->memory_lock);
+ list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
+ if (!get_file_rcu(&priv->dmabuf->file))
+ continue;
+ dma_resv_lock(priv->dmabuf->resv, NULL);
+ list_del_init(&priv->dmabufs_elm);
+ priv->vdev = NULL;
+ priv->revoked = true;
+ dma_buf_move_notify(priv->dmabuf);
+ dma_resv_unlock(priv->dmabuf->resv);
+ vfio_device_put_registration(&vdev->vdev);
+ dma_buf_put(priv->dmabuf);
+ }
+ up_write(&vdev->memory_lock);
+}
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index ea2745c1ac5e..5cc200e15edc 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -589,10 +589,12 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY);
new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
- if (!new_mem)
+ if (!new_mem) {
vfio_pci_zap_and_down_write_memory_lock(vdev);
- else
+ vfio_pci_dma_buf_move(vdev, true);
+ } else {
down_write(&vdev->memory_lock);
+ }
/*
* If the user is writing mem/io enable (new_mem/io) and we
@@ -627,6 +629,8 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
*virt_cmd &= cpu_to_le16(~mask);
*virt_cmd |= cpu_to_le16(new_cmd & mask);
+ if (__vfio_pci_memory_enabled(vdev))
+ vfio_pci_dma_buf_move(vdev, false);
up_write(&vdev->memory_lock);
}
@@ -707,12 +711,16 @@ static int __init init_pci_cap_basic_perm(struct perm_bits *perm)
static void vfio_lock_and_set_power_state(struct vfio_pci_core_device *vdev,
pci_power_t state)
{
- if (state >= PCI_D3hot)
+ if (state >= PCI_D3hot) {
vfio_pci_zap_and_down_write_memory_lock(vdev);
- else
+ vfio_pci_dma_buf_move(vdev, true);
+ } else {
down_write(&vdev->memory_lock);
+ }
vfio_pci_set_power_state(vdev, state);
+ if (__vfio_pci_memory_enabled(vdev))
+ vfio_pci_dma_buf_move(vdev, false);
up_write(&vdev->memory_lock);
}
@@ -900,7 +908,10 @@ static int vfio_exp_config_write(struct vfio_pci_core_device *vdev, int pos,
if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) {
vfio_pci_zap_and_down_write_memory_lock(vdev);
+ vfio_pci_dma_buf_move(vdev, true);
pci_try_reset_function(vdev->pdev);
+ if (__vfio_pci_memory_enabled(vdev))
+ vfio_pci_dma_buf_move(vdev, true);
up_write(&vdev->memory_lock);
}
}
@@ -982,7 +993,10 @@ static int vfio_af_config_write(struct vfio_pci_core_device *vdev, int pos,
if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) {
vfio_pci_zap_and_down_write_memory_lock(vdev);
+ vfio_pci_dma_buf_move(vdev, true);
pci_try_reset_function(vdev->pdev);
+ if (__vfio_pci_memory_enabled(vdev))
+ vfio_pci_dma_buf_move(vdev, true);
up_write(&vdev->memory_lock);
}
}
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index c3269d708411..f69eda5956ad 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -287,6 +287,8 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev,
* semaphore.
*/
vfio_pci_zap_and_down_write_memory_lock(vdev);
+ vfio_pci_dma_buf_move(vdev, true);
+
if (vdev->pm_runtime_engaged) {
up_write(&vdev->memory_lock);
return -EINVAL;
@@ -370,6 +372,8 @@ static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
*/
down_write(&vdev->memory_lock);
__vfio_pci_runtime_pm_exit(vdev);
+ if (__vfio_pci_memory_enabled(vdev))
+ vfio_pci_dma_buf_move(vdev, false);
up_write(&vdev->memory_lock);
}
@@ -690,6 +694,8 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev)
#endif
vfio_pci_core_disable(vdev);
+ vfio_pci_dma_buf_cleanup(vdev);
+
mutex_lock(&vdev->igate);
if (vdev->err_trigger) {
eventfd_ctx_put(vdev->err_trigger);
@@ -1234,7 +1240,10 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
*/
vfio_pci_set_power_state(vdev, PCI_D0);
+ vfio_pci_dma_buf_move(vdev, true);
ret = pci_try_reset_function(vdev->pdev);
+ if (__vfio_pci_memory_enabled(vdev))
+ vfio_pci_dma_buf_move(vdev, false);
up_write(&vdev->memory_lock);
return ret;
@@ -1523,6 +1532,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
return vfio_pci_core_pm_exit(vdev, flags, arg, argsz);
case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
+ case VFIO_DEVICE_FEATURE_DMA_BUF:
+ return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
default:
return -ENOTTY;
}
@@ -2098,6 +2109,7 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev)
INIT_LIST_HEAD(&vdev->dummy_resources_list);
INIT_LIST_HEAD(&vdev->ioeventfds_list);
INIT_LIST_HEAD(&vdev->sriov_pfs_item);
+ INIT_LIST_HEAD(&vdev->dmabufs);
init_rwsem(&vdev->memory_lock);
xa_init(&vdev->ctx);
@@ -2480,11 +2492,17 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
* cause the PCI config space reset without restoring the original
* state (saved locally in 'vdev->pm_save').
*/
- list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list)
+ list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list) {
+ vfio_pci_dma_buf_move(vdev, true);
vfio_pci_set_power_state(vdev, PCI_D0);
+ }
ret = pci_reset_bus(pdev);
+ list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list)
+ if (__vfio_pci_memory_enabled(vdev))
+ vfio_pci_dma_buf_move(vdev, false);
+
vdev = list_last_entry(&dev_set->device_list,
struct vfio_pci_core_device, vdev.dev_set_list);
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index 5e4fa69aee16..d27f383f3931 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -101,4 +101,29 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
}
+#ifdef CONFIG_DMA_SHARED_BUFFER
+int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
+ struct vfio_device_feature_dma_buf __user *arg,
+ size_t argsz);
+void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev);
+void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked);
+#else
+static int
+vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
+ struct vfio_device_feature_dma_buf __user *arg,
+ size_t argsz)
+{
+ return -ENOTTY;
+}
+
+static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
+{
+}
+
+static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev,
+ bool revoked)
+{
+}
+#endif
+
#endif
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index fbb472dd99b3..da5d8955ae56 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -94,6 +94,7 @@ struct vfio_pci_core_device {
struct vfio_pci_core_device *sriov_pf_core_dev;
struct notifier_block nb;
struct rw_semaphore memory_lock;
+ struct list_head dmabufs;
};
/* Will be exported for vfio pci drivers usage */
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index c8dbf8219c4f..f43dfbde7352 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1458,6 +1458,35 @@ struct vfio_device_feature_bus_master {
};
#define VFIO_DEVICE_FEATURE_BUS_MASTER 10
+/**
+ * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the
+ * regions selected.
+ *
+ * For struct struct vfio_device_feature_dma_buf, open_flags are the typical
+ * flags passed to open(2), eg O_RDWR, O_CLOEXEC, etc. nr_ranges is the total
+ * number of dma_ranges that comprise the dmabuf.
+ *
+ * For struct vfio_region_dma_range, region_index/offset/length specify a slice
+ * of the region to create the dmabuf from, if both offset & length are 0 then
+ * the whole region is used.
+ *
+ * Return: The fd number on success, -1 and errno is set on failure.
+ */
+struct vfio_region_dma_range {
+ __u32 region_index;
+ __u32 __pad;
+ __u64 offset;
+ __u64 length;
+};
+
+struct vfio_device_feature_dma_buf {
+ __u32 open_flags;
+ __u32 nr_ranges;
+ struct vfio_region_dma_range dma_ranges[];
+};
+
+#define VFIO_DEVICE_FEATURE_DMA_BUF 11
+
/* -------- API for Type1 VFIO IOMMU -------- */
/**
--
2.25.1
^ permalink raw reply related [flat|nested] 134+ messages in thread
* [RFC PATCH 05/12] vfio/pci: Support get_pfn() callback for dma-buf
2025-01-07 14:27 [RFC PATCH 00/12] Private MMIO support for private assigned dev Xu Yilun
` (3 preceding siblings ...)
2025-01-07 14:27 ` [RFC PATCH 04/12] vfio/pci: Allow MMIO regions to be exported through dma-buf Xu Yilun
@ 2025-01-07 14:27 ` Xu Yilun
2025-01-07 14:27 ` [RFC PATCH 06/12] KVM: Support vfio_dmabuf backed MMIO region Xu Yilun
` (7 subsequent siblings)
12 siblings, 0 replies; 134+ messages in thread
From: Xu Yilun @ 2025-01-07 14:27 UTC (permalink / raw)
To: kvm, dri-devel, linux-media, linaro-mm-sig, sumit.semwal,
christian.koenig, pbonzini, seanjc, alex.williamson, jgg,
vivek.kasireddy, dan.j.williams, aik
Cc: yilun.xu, yilun.xu, linux-coco, linux-kernel, lukas, yan.y.zhao,
daniel.vetter, leon, baolu.lu, zhenzhong.duan, tao1.su
Implement get_pfn() callback for exported MMIO resources.
Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
---
drivers/vfio/pci/dma_buf.c | 30 ++++++++++++++++++++++++++++--
1 file changed, 28 insertions(+), 2 deletions(-)
diff --git a/drivers/vfio/pci/dma_buf.c b/drivers/vfio/pci/dma_buf.c
index 1d5f46744922..ad12cfb85099 100644
--- a/drivers/vfio/pci/dma_buf.c
+++ b/drivers/vfio/pci/dma_buf.c
@@ -33,8 +33,34 @@ static int vfio_pci_dma_buf_pin(struct dma_buf_attachment *attachment)
static int vfio_pci_dma_buf_get_pfn(struct dma_buf_attachment *attachment,
pgoff_t pgoff, u64 *pfn, int *max_order)
{
- /* TODO */
- return -EOPNOTSUPP;
+ struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
+ struct vfio_region_dma_range *dma_ranges = priv->dma_ranges;
+ u64 offset = pgoff << PAGE_SHIFT;
+ int i;
+
+ dma_resv_assert_held(priv->dmabuf->resv);
+
+ if (priv->revoked)
+ return -ENODEV;
+
+ if (offset >= priv->dmabuf->size)
+ return -EINVAL;
+
+ for (i = 0; i < priv->nr_ranges; i++) {
+ if (offset < dma_ranges[i].length)
+ break;
+
+ offset -= dma_ranges[i].length;
+ }
+
+ *pfn = PHYS_PFN(pci_resource_start(priv->vdev->pdev, dma_ranges[i].region_index) +
+ dma_ranges[i].offset + offset);
+
+ /* TODO: large page mapping is yet to be supported */
+ if (max_order)
+ *max_order = 0;
+
+ return 0;
}
static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
--
2.25.1
^ permalink raw reply related [flat|nested] 134+ messages in thread
* [RFC PATCH 06/12] KVM: Support vfio_dmabuf backed MMIO region
2025-01-07 14:27 [RFC PATCH 00/12] Private MMIO support for private assigned dev Xu Yilun
` (4 preceding siblings ...)
2025-01-07 14:27 ` [RFC PATCH 05/12] vfio/pci: Support get_pfn() callback for dma-buf Xu Yilun
@ 2025-01-07 14:27 ` Xu Yilun
2025-01-07 14:27 ` [RFC PATCH 07/12] KVM: x86/mmu: Handle page fault for vfio_dmabuf backed MMIO Xu Yilun
` (6 subsequent siblings)
12 siblings, 0 replies; 134+ messages in thread
From: Xu Yilun @ 2025-01-07 14:27 UTC (permalink / raw)
To: kvm, dri-devel, linux-media, linaro-mm-sig, sumit.semwal,
christian.koenig, pbonzini, seanjc, alex.williamson, jgg,
vivek.kasireddy, dan.j.williams, aik
Cc: yilun.xu, yilun.xu, linux-coco, linux-kernel, lukas, yan.y.zhao,
daniel.vetter, leon, baolu.lu, zhenzhong.duan, tao1.su
Extend KVM_SET_USER_MEMORY_REGION2 to support mapping vfio_dmabuf
backed MMIO region into a guest.
The main purpose of this change is for KVM to map MMIO resources
without firstly mapping into the host, similar to what is done in
guest_memfd. The immediate use case is for CoCo VMs to support private
MMIO.
Similar to private guest memory, private MMIO is also not intended to be
accessed by host. The host access to private MMIO would be rejected by
private devices (known as TDI in TDISP spec) and cause the TDI exit the
secure state. The further impact to the system may vary according to
device implementation. The TDISP spec doesn't mandate any error
reporting or logging, the TLP may be handled as an Unsupported Request,
or just be dropped. In my test environment, an AER NonFatalErr is
reported and no further impact. So from HW perspective, disallowing
host access to private MMIO is not that critical but nice to have.
But stick to find pfn via userspace mapping while allowing the pfn been
privately mapped conflicts with the private mapping concept. And it
virtually allows userspace to map any address as private. Before
fault in, KVM cannot distinguish if a userspace addr is for private
MMIO and safe to host access.
Rely on userspace mapping also means private MMIO mapping should follow
userspace mapping change via mmu_notifier. This conflicts with the
current design that mmu_notifier never impacts private mapping. It also
makes no sense to support mmu_notifier just for private MMIO, private
MMIO mapping should be fixed when CoCo-VM accepts the private MMIO, any
following mapping change without guest permission should be invalid.
So the choice here is to eliminate userspace mapping and switch to use
the FD based MMIO resources.
There is still need to switch the memory attribute (shared <-> private)
for private MMIO, when guest switches the device attribute between
shared & private. Unlike memory, MMIO region has only one physical
backend so it is a bit like in-place conversion, which for private
memory, requires much effort on how to invalidate user mapping when
converting to private. But for MMIO, it is expected that VMM never
needs to access assigned MMIO for feature emulation, so always disallow
userspace MMIO mapping and use FD based MMIO resources for 'private
capable' MMIO region.
The dma-buf is chosen as the FD based backend, it meets the need for KVM
to aquire the non-struct page memory that can still have lifetime
controlled by VFIO. It provides the option to disallow userspace mmap as
long as the exporter doesn't provide dma_buf_ops.mmap() callback. The
concern is it now just supports mapping into device's default_domain via
DMA APIs. Some clue I can found to extend dma-buf APIs for subsystems
like IOMMUFD [1] or KVM. The adding of dma_buf_get_pfn_unlocked() in this
series is for this purpose.
An alternative is VFIO provides a dedicated FD for KVM. But considering
IOMMUFD may use dma-buf for MMIO mapping [2], it is better to have a
unified export mechanism for the same purpose in VFIO.
Open: Currently store the dmabuf fd parameter in
kvm_userspace_memory_region2::guest_memfd. It may be confusing but avoids
introducing another API format for IOCTL(KVM_SET_USER_MEMORY_REGION3).
[1] https://lore.kernel.org/all/YwywgciH6BiWz4H1@nvidia.com/
[2] https://lore.kernel.org/kvm/14-v4-0de2f6c78ed0+9d1-iommufd_jgg@nvidia.com/
Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
---
Documentation/virt/kvm/api.rst | 7 ++
include/linux/kvm_host.h | 18 +++++
include/uapi/linux/kvm.h | 1 +
virt/kvm/Kconfig | 6 ++
virt/kvm/Makefile.kvm | 1 +
virt/kvm/kvm_main.c | 32 +++++++--
virt/kvm/kvm_mm.h | 19 +++++
virt/kvm/vfio_dmabuf.c | 125 +++++++++++++++++++++++++++++++++
8 files changed, 205 insertions(+), 4 deletions(-)
create mode 100644 virt/kvm/vfio_dmabuf.c
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 7911da34b9fd..f6199764a768 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6304,6 +6304,13 @@ state. At VM creation time, all memory is shared, i.e. the PRIVATE attribute
is '0' for all gfns. Userspace can control whether memory is shared/private by
toggling KVM_MEMORY_ATTRIBUTE_PRIVATE via KVM_SET_MEMORY_ATTRIBUTES as needed.
+Userspace can set KVM_MEM_VFIO_DMABUF in flags to indicate the memory region is
+backed by a userspace unmappable dma_buf exported by VFIO. The backend resource
+is one piece of MMIO region of the device. The slot is unmappable so it is
+allowed to be converted to private. KVM binds the memory region to a given
+dma_buf fd range of [0, memory_size]. For now, the dma_buf fd is filled in
+'guest_memfd' field, and the guest_memfd_offset must be 0;
+
S390:
^^^^^
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0a141685872d..871d927485a5 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -606,6 +606,10 @@ struct kvm_memory_slot {
pgoff_t pgoff;
} gmem;
#endif
+
+#ifdef CONFIG_KVM_VFIO_DMABUF
+ struct dma_buf_attachment *dmabuf_attach;
+#endif
};
static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
@@ -2568,4 +2572,18 @@ static inline int kvm_enable_virtualization(void) { return 0; }
static inline void kvm_disable_virtualization(void) { }
#endif
+#ifdef CONFIG_KVM_VFIO_DMABUF
+int kvm_vfio_dmabuf_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+ gfn_t gfn, kvm_pfn_t *pfn, int *max_order);
+#else
+static inline int kvm_vfio_dmabuf_get_pfn(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ gfn_t gfn, kvm_pfn_t *pfn,
+ int *max_order);
+{
+ KVM_BUG_ON(1, kvm);
+ return -EIO;
+}
+#endif
+
#endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 1dae36cbfd52..4f5b5def182a 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -51,6 +51,7 @@ struct kvm_userspace_memory_region2 {
#define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
#define KVM_MEM_READONLY (1UL << 1)
#define KVM_MEM_GUEST_MEMFD (1UL << 2)
+#define KVM_MEM_VFIO_DMABUF (1UL << 3)
/* for KVM_IRQ_LINE */
struct kvm_irq_level {
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 54e959e7d68f..68fff3fb1841 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -115,6 +115,7 @@ config KVM_PRIVATE_MEM
config KVM_GENERIC_PRIVATE_MEM
select KVM_GENERIC_MEMORY_ATTRIBUTES
select KVM_PRIVATE_MEM
+ select KVM_VFIO_DMABUF
bool
config HAVE_KVM_ARCH_GMEM_PREPARE
@@ -124,3 +125,8 @@ config HAVE_KVM_ARCH_GMEM_PREPARE
config HAVE_KVM_ARCH_GMEM_INVALIDATE
bool
depends on KVM_PRIVATE_MEM
+
+config KVM_VFIO_DMABUF
+ bool
+ select DMA_SHARED_BUFFER
+ select DMABUF_MOVE_NOTIFY
diff --git a/virt/kvm/Makefile.kvm b/virt/kvm/Makefile.kvm
index 724c89af78af..c08e98f13f65 100644
--- a/virt/kvm/Makefile.kvm
+++ b/virt/kvm/Makefile.kvm
@@ -13,3 +13,4 @@ kvm-$(CONFIG_HAVE_KVM_IRQ_ROUTING) += $(KVM)/irqchip.o
kvm-$(CONFIG_HAVE_KVM_DIRTY_RING) += $(KVM)/dirty_ring.o
kvm-$(CONFIG_HAVE_KVM_PFNCACHE) += $(KVM)/pfncache.o
kvm-$(CONFIG_KVM_PRIVATE_MEM) += $(KVM)/guest_memfd.o
+kvm-$(CONFIG_KVM_VFIO_DMABUF) += $(KVM)/vfio_dmabuf.o
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4a13de82479d..c9342d88f06c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -938,6 +938,8 @@ static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
{
if (slot->flags & KVM_MEM_GUEST_MEMFD)
kvm_gmem_unbind(slot);
+ else if (slot->flags & KVM_MEM_VFIO_DMABUF)
+ kvm_vfio_dmabuf_unbind(slot);
kvm_destroy_dirty_bitmap(slot);
@@ -1526,13 +1528,19 @@ static void kvm_replace_memslot(struct kvm *kvm,
static int check_memory_region_flags(struct kvm *kvm,
const struct kvm_userspace_memory_region2 *mem)
{
+ u32 private_mask = KVM_MEM_GUEST_MEMFD | KVM_MEM_VFIO_DMABUF;
+ u32 private_flag = mem->flags & private_mask;
u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
+ /* private flags are mutually exclusive. */
+ if (private_flag & (private_flag - 1))
+ return -EINVAL;
+
if (kvm_arch_has_private_mem(kvm))
- valid_flags |= KVM_MEM_GUEST_MEMFD;
+ valid_flags |= private_flag;
/* Dirty logging private memory is not currently supported. */
- if (mem->flags & KVM_MEM_GUEST_MEMFD)
+ if (private_flag)
valid_flags &= ~KVM_MEM_LOG_DIRTY_PAGES;
/*
@@ -1540,8 +1548,7 @@ static int check_memory_region_flags(struct kvm *kvm,
* read-only memslots have emulated MMIO, not page fault, semantics,
* and KVM doesn't allow emulated MMIO for private memory.
*/
- if (kvm_arch_has_readonly_mem(kvm) &&
- !(mem->flags & KVM_MEM_GUEST_MEMFD))
+ if (kvm_arch_has_readonly_mem(kvm) && !private_flag)
valid_flags |= KVM_MEM_READONLY;
if (mem->flags & ~valid_flags)
@@ -2044,6 +2051,21 @@ int __kvm_set_memory_region(struct kvm *kvm,
r = kvm_gmem_bind(kvm, new, mem->guest_memfd, mem->guest_memfd_offset);
if (r)
goto out;
+ } else if (mem->flags & KVM_MEM_VFIO_DMABUF) {
+ if (mem->guest_memfd_offset) {
+ r = -EINVAL;
+ goto out;
+ }
+
+ /*
+ * Open: May be confusing that store the dmabuf fd parameter in
+ * kvm_userspace_memory_region2::guest_memfd. But this avoids
+ * introducing another format for
+ * IOCTL(KVM_SET_USER_MEMORY_REGIONX).
+ */
+ r = kvm_vfio_dmabuf_bind(kvm, new, mem->guest_memfd);
+ if (r)
+ goto out;
}
r = kvm_set_memslot(kvm, old, new, change);
@@ -2055,6 +2077,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
out_unbind:
if (mem->flags & KVM_MEM_GUEST_MEMFD)
kvm_gmem_unbind(new);
+ else if (mem->flags & KVM_MEM_VFIO_DMABUF)
+ kvm_vfio_dmabuf_unbind(new);
out:
kfree(new);
return r;
diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
index acef3f5c582a..faefc252c337 100644
--- a/virt/kvm/kvm_mm.h
+++ b/virt/kvm/kvm_mm.h
@@ -93,4 +93,23 @@ static inline void kvm_gmem_unbind(struct kvm_memory_slot *slot)
}
#endif /* CONFIG_KVM_PRIVATE_MEM */
+#ifdef CONFIG_KVM_VFIO_DMABUF
+int kvm_vfio_dmabuf_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
+ unsigned int fd);
+void kvm_vfio_dmabuf_unbind(struct kvm_memory_slot *slot);
+#else
+static inline int kvm_vfio_dmabuf_bind(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ unsigned int fd);
+{
+ WARN_ON_ONCE(1);
+ return -EIO;
+}
+
+static inline void kvm_vfio_dmabuf_unbind(struct kvm_memory_slot *slot)
+{
+ WARN_ON_ONCE(1);
+}
+#endif /* CONFIG_KVM_VFIO_DMABUF */
+
#endif /* __KVM_MM_H__ */
diff --git a/virt/kvm/vfio_dmabuf.c b/virt/kvm/vfio_dmabuf.c
new file mode 100644
index 000000000000..c427ab39c68a
--- /dev/null
+++ b/virt/kvm/vfio_dmabuf.c
@@ -0,0 +1,125 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/dma-buf.h>
+#include <linux/kvm_host.h>
+#include <linux/vfio.h>
+
+#include "kvm_mm.h"
+
+MODULE_IMPORT_NS("DMA_BUF");
+
+struct kvm_vfio_dmabuf {
+ struct kvm *kvm;
+ struct kvm_memory_slot *slot;
+};
+
+static void kv_dmabuf_move_notify(struct dma_buf_attachment *attach)
+{
+ struct kvm_vfio_dmabuf *kv_dmabuf = attach->importer_priv;
+ struct kvm_memory_slot *slot = kv_dmabuf->slot;
+ struct kvm *kvm = kv_dmabuf->kvm;
+ bool flush = false;
+
+ struct kvm_gfn_range gfn_range = {
+ .start = slot->base_gfn,
+ .end = slot->base_gfn + slot->npages,
+ .slot = slot,
+ .may_block = true,
+ .attr_filter = KVM_FILTER_PRIVATE | KVM_FILTER_SHARED,
+ };
+
+ KVM_MMU_LOCK(kvm);
+ kvm_mmu_invalidate_begin(kvm);
+ flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
+ if (flush)
+ kvm_flush_remote_tlbs(kvm);
+
+ kvm_mmu_invalidate_end(kvm);
+ KVM_MMU_UNLOCK(kvm);
+}
+
+static const struct dma_buf_attach_ops kv_dmabuf_attach_ops = {
+ .allow_peer2peer = true,
+ .move_notify = kv_dmabuf_move_notify,
+};
+
+int kvm_vfio_dmabuf_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
+ unsigned int fd)
+{
+ size_t size = slot->npages << PAGE_SHIFT;
+ struct dma_buf_attachment *attach;
+ struct kvm_vfio_dmabuf *kv_dmabuf;
+ struct dma_buf *dmabuf;
+ int ret;
+
+ dmabuf = dma_buf_get(fd);
+ if (IS_ERR(dmabuf))
+ return PTR_ERR(dmabuf);
+
+ if (size != dmabuf->size) {
+ ret = -EINVAL;
+ goto err_dmabuf;
+ }
+
+ kv_dmabuf = kzalloc(sizeof(*kv_dmabuf), GFP_KERNEL);
+ if (!kv_dmabuf) {
+ ret = -ENOMEM;
+ goto err_dmabuf;
+ }
+
+ kv_dmabuf->kvm = kvm;
+ kv_dmabuf->slot = slot;
+ attach = dma_buf_dynamic_attach(dmabuf, NULL, &kv_dmabuf_attach_ops,
+ kv_dmabuf);
+ if (IS_ERR(attach)) {
+ ret = PTR_ERR(attach);
+ goto err_kv_dmabuf;
+ }
+
+ slot->dmabuf_attach = attach;
+
+ return 0;
+
+err_kv_dmabuf:
+ kfree(kv_dmabuf);
+err_dmabuf:
+ dma_buf_put(dmabuf);
+ return ret;
+}
+
+void kvm_vfio_dmabuf_unbind(struct kvm_memory_slot *slot)
+{
+ struct dma_buf_attachment *attach = slot->dmabuf_attach;
+ struct kvm_vfio_dmabuf *kv_dmabuf;
+ struct dma_buf *dmabuf;
+
+ if (WARN_ON_ONCE(!attach))
+ return;
+
+ kv_dmabuf = attach->importer_priv;
+ dmabuf = attach->dmabuf;
+ dma_buf_detach(dmabuf, attach);
+ kfree(kv_dmabuf);
+ dma_buf_put(dmabuf);
+}
+
+/*
+ * The return value matters. If return -EFAULT, userspace will try to do
+ * page attribute (shared <-> private) conversion.
+ */
+int kvm_vfio_dmabuf_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+ gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
+{
+ struct dma_buf_attachment *attach = slot->dmabuf_attach;
+ pgoff_t pgoff = gfn - slot->base_gfn;
+ int ret;
+
+ if (WARN_ON_ONCE(!attach))
+ return -EFAULT;
+
+ ret = dma_buf_get_pfn_unlocked(attach, pgoff, pfn, max_order);
+ if (ret)
+ return -EIO;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_vfio_dmabuf_get_pfn);
--
2.25.1
^ permalink raw reply related [flat|nested] 134+ messages in thread
* [RFC PATCH 07/12] KVM: x86/mmu: Handle page fault for vfio_dmabuf backed MMIO
2025-01-07 14:27 [RFC PATCH 00/12] Private MMIO support for private assigned dev Xu Yilun
` (5 preceding siblings ...)
2025-01-07 14:27 ` [RFC PATCH 06/12] KVM: Support vfio_dmabuf backed MMIO region Xu Yilun
@ 2025-01-07 14:27 ` Xu Yilun
2025-01-07 14:27 ` [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device Xu Yilun
` (5 subsequent siblings)
12 siblings, 0 replies; 134+ messages in thread
From: Xu Yilun @ 2025-01-07 14:27 UTC (permalink / raw)
To: kvm, dri-devel, linux-media, linaro-mm-sig, sumit.semwal,
christian.koenig, pbonzini, seanjc, alex.williamson, jgg,
vivek.kasireddy, dan.j.williams, aik
Cc: yilun.xu, yilun.xu, linux-coco, linux-kernel, lukas, yan.y.zhao,
daniel.vetter, leon, baolu.lu, zhenzhong.duan, tao1.su
Add support for resolving page faults on vfio_dmabuf backed MMIO. This
is to support private MMIO for private assigned devices (known as TDI
in TDISP spec).
Private MMIOs are set to KVM as vfio_dmabuf typed memory slot, which is
another type of can-be-private memory slot just like the gmem slot.
Like gmem slot, KVM needs to map its GFN as shared or private based on
the current state of the GFN's memory attribute. When page fault
happens for private MMIO but private <-> shared conversion is needed,
KVM still exits to userspace with exit reason KVM_EXIT_MEMORY_FAULT and
toggles KVM_MEMORY_EXIT_FLAG_PRIVATE. Unlike gmem slot, vfio_dmabuf
slot has only one backend MMIO resource, the switching of GFN's
attribute won't change the way of getting PFN, the vfio_dmabuf specific
way, kvm_vfio_dmabuf_get_pfn().
Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
---
arch/x86/kvm/mmu/mmu.c | 25 +++++++++++++++++++++++--
include/linux/kvm_host.h | 7 ++++++-
2 files changed, 29 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 713ca857f2c2..90ca54fee22f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4341,8 +4341,13 @@ static int kvm_mmu_faultin_pfn_private(struct kvm_vcpu *vcpu,
return -EFAULT;
}
- r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn,
- &fault->refcounted_page, &max_order);
+ if (kvm_slot_is_vfio_dmabuf(fault->slot))
+ r = kvm_vfio_dmabuf_get_pfn(vcpu->kvm, fault->slot, fault->gfn,
+ &fault->pfn, &max_order);
+ else
+ r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn,
+ &fault->pfn, &fault->refcounted_page,
+ &max_order);
if (r) {
kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
return r;
@@ -4363,6 +4368,22 @@ static int __kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
if (fault->is_private)
return kvm_mmu_faultin_pfn_private(vcpu, fault);
+ /* vfio_dmabuf slot is also applicable for shared mapping */
+ if (kvm_slot_is_vfio_dmabuf(fault->slot)) {
+ int max_order, r;
+
+ r = kvm_vfio_dmabuf_get_pfn(vcpu->kvm, fault->slot, fault->gfn,
+ &fault->pfn, &max_order);
+ if (r)
+ return r;
+
+ fault->max_level = min(kvm_max_level_for_order(max_order),
+ fault->max_level);
+ fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
+
+ return RET_PF_CONTINUE;
+ }
+
foll |= FOLL_NOWAIT;
fault->pfn = __kvm_faultin_pfn(fault->slot, fault->gfn, foll,
&fault->map_writable, &fault->refcounted_page);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 871d927485a5..966a5a247c6b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -614,7 +614,12 @@ struct kvm_memory_slot {
static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
{
- return slot && (slot->flags & KVM_MEM_GUEST_MEMFD);
+ return slot && (slot->flags & (KVM_MEM_GUEST_MEMFD | KVM_MEM_VFIO_DMABUF));
+}
+
+static inline bool kvm_slot_is_vfio_dmabuf(const struct kvm_memory_slot *slot)
+{
+ return slot && (slot->flags & KVM_MEM_VFIO_DMABUF);
}
static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
--
2.25.1
^ permalink raw reply related [flat|nested] 134+ messages in thread
* [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device
2025-01-07 14:27 [RFC PATCH 00/12] Private MMIO support for private assigned dev Xu Yilun
` (6 preceding siblings ...)
2025-01-07 14:27 ` [RFC PATCH 07/12] KVM: x86/mmu: Handle page fault for vfio_dmabuf backed MMIO Xu Yilun
@ 2025-01-07 14:27 ` Xu Yilun
2025-01-08 13:30 ` Jason Gunthorpe
2025-01-07 14:27 ` [RFC PATCH 09/12] vfio/pci: Export vfio dma-buf specific info for importers Xu Yilun
` (4 subsequent siblings)
12 siblings, 1 reply; 134+ messages in thread
From: Xu Yilun @ 2025-01-07 14:27 UTC (permalink / raw)
To: kvm, dri-devel, linux-media, linaro-mm-sig, sumit.semwal,
christian.koenig, pbonzini, seanjc, alex.williamson, jgg,
vivek.kasireddy, dan.j.williams, aik
Cc: yilun.xu, yilun.xu, linux-coco, linux-kernel, lukas, yan.y.zhao,
daniel.vetter, leon, baolu.lu, zhenzhong.duan, tao1.su
Add a flag for ioctl(VFIO_DEVICE_BIND_IOMMUFD) to mark a device as
for private assignment. For these private assigned devices, disallow
host accessing their MMIO resources.
Since the MMIO regions for private assignment are not accessible from
host, remove the VFIO_REGION_INFO_FLAG_MMAP/READ/WRITE for these
regions, instead add a new VFIO_REGION_INFO_FLAG_PRIVATE flag to
indicate users should create dma-buf for MMIO mapping in KVM MMU.
Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
---
drivers/vfio/device_cdev.c | 9 ++++++++-
drivers/vfio/pci/vfio_pci_core.c | 14 ++++++++++++++
drivers/vfio/pci/vfio_pci_priv.h | 2 ++
drivers/vfio/pci/vfio_pci_rdwr.c | 3 +++
include/linux/vfio.h | 1 +
include/uapi/linux/vfio.h | 5 ++++-
6 files changed, 32 insertions(+), 2 deletions(-)
diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c
index bb1817bd4ff3..919285c1cd7a 100644
--- a/drivers/vfio/device_cdev.c
+++ b/drivers/vfio/device_cdev.c
@@ -75,7 +75,10 @@ long vfio_df_ioctl_bind_iommufd(struct vfio_device_file *df,
if (copy_from_user(&bind, arg, minsz))
return -EFAULT;
- if (bind.argsz < minsz || bind.flags || bind.iommufd < 0)
+ if (bind.argsz < minsz || bind.iommufd < 0)
+ return -EINVAL;
+
+ if (bind.flags & ~(VFIO_DEVICE_BIND_IOMMUFD_PRIVATE))
return -EINVAL;
/* BIND_IOMMUFD only allowed for cdev fds */
@@ -118,6 +121,9 @@ long vfio_df_ioctl_bind_iommufd(struct vfio_device_file *df,
goto out_close_device;
device->cdev_opened = true;
+ if (bind.flags & VFIO_DEVICE_BIND_IOMMUFD_PRIVATE)
+ device->is_private = true;
+
/*
* Paired with smp_load_acquire() in vfio_device_fops::ioctl/
* read/write/mmap
@@ -151,6 +157,7 @@ void vfio_df_unbind_iommufd(struct vfio_device_file *df)
return;
mutex_lock(&device->dev_set->lock);
+ device->is_private = false;
vfio_df_close(df);
vfio_device_put_kvm(device);
iommufd_ctx_put(df->iommufd);
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index f69eda5956ad..11c735dfe1f7 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1005,6 +1005,12 @@ static int vfio_pci_ioctl_get_info(struct vfio_pci_core_device *vdev,
return copy_to_user(arg, &info, minsz) ? -EFAULT : 0;
}
+bool is_vfio_pci_bar_private(struct vfio_pci_core_device *vdev, int bar)
+{
+ /* Any mmap supported bar can be used as vfio dmabuf */
+ return vdev->bar_mmap_supported[bar] && vdev->vdev.is_private;
+}
+
static int vfio_pci_ioctl_get_region_info(struct vfio_pci_core_device *vdev,
struct vfio_region_info __user *arg)
{
@@ -1035,6 +1041,11 @@ static int vfio_pci_ioctl_get_region_info(struct vfio_pci_core_device *vdev,
break;
}
+ if (is_vfio_pci_bar_private(vdev, info.index)) {
+ info.flags = VFIO_REGION_INFO_FLAG_PRIVATE;
+ break;
+ }
+
info.flags = VFIO_REGION_INFO_FLAG_READ |
VFIO_REGION_INFO_FLAG_WRITE;
if (vdev->bar_mmap_supported[info.index]) {
@@ -1735,6 +1746,9 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma
u64 phys_len, req_len, pgoff, req_start;
int ret;
+ if (vdev->vdev.is_private)
+ return -EINVAL;
+
index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index d27f383f3931..2b61e35145fd 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -126,4 +126,6 @@ static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev,
}
#endif
+bool is_vfio_pci_bar_private(struct vfio_pci_core_device *vdev, int bar);
+
#endif
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index 66b72c289284..e385f7f63414 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -242,6 +242,9 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_core_device *vdev, char __user *buf,
struct resource *res = &vdev->pdev->resource[bar];
ssize_t done;
+ if (is_vfio_pci_bar_private(vdev, bar))
+ return -EINVAL;
+
if (pci_resource_start(pdev, bar))
end = pci_resource_len(pdev, bar);
else if (bar == PCI_ROM_RESOURCE &&
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 2258b0585330..e99d856c6cd8 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -69,6 +69,7 @@ struct vfio_device {
struct iommufd_device *iommufd_device;
u8 iommufd_attached:1;
#endif
+ u8 is_private:1;
u8 cdev_opened:1;
#ifdef CONFIG_DEBUG_FS
/*
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index f43dfbde7352..6a1c703e3185 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -275,6 +275,7 @@ struct vfio_region_info {
#define VFIO_REGION_INFO_FLAG_WRITE (1 << 1) /* Region supports write */
#define VFIO_REGION_INFO_FLAG_MMAP (1 << 2) /* Region supports mmap */
#define VFIO_REGION_INFO_FLAG_CAPS (1 << 3) /* Info supports caps */
+#define VFIO_REGION_INFO_FLAG_PRIVATE (1 << 4) /* Region supports private MMIO */
__u32 index; /* Region index */
__u32 cap_offset; /* Offset within info struct of first cap */
__aligned_u64 size; /* Region size (bytes) */
@@ -904,7 +905,8 @@ struct vfio_device_feature {
* VFIO_DEVICE_BIND_IOMMUFD - _IOR(VFIO_TYPE, VFIO_BASE + 18,
* struct vfio_device_bind_iommufd)
* @argsz: User filled size of this data.
- * @flags: Must be 0.
+ * @flags: Optional device initialization flags:
+ * VFIO_DEVICE_BIND_IOMMUFD_PRIVATE: for private assignment
* @iommufd: iommufd to bind.
* @out_devid: The device id generated by this bind. devid is a handle for
* this device/iommufd bond and can be used in IOMMUFD commands.
@@ -921,6 +923,7 @@ struct vfio_device_feature {
struct vfio_device_bind_iommufd {
__u32 argsz;
__u32 flags;
+#define VFIO_DEVICE_BIND_IOMMUFD_PRIVATE (1 << 0)
__s32 iommufd;
__u32 out_devid;
};
--
2.25.1
^ permalink raw reply related [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device
2025-01-07 14:27 ` [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device Xu Yilun
@ 2025-01-08 13:30 ` Jason Gunthorpe
2025-01-08 16:57 ` Xu Yilun
0 siblings, 1 reply; 134+ messages in thread
From: Jason Gunthorpe @ 2025-01-08 13:30 UTC (permalink / raw)
To: Xu Yilun
Cc: kvm, dri-devel, linux-media, linaro-mm-sig, sumit.semwal,
christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, aik, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On Tue, Jan 07, 2025 at 10:27:15PM +0800, Xu Yilun wrote:
> Add a flag for ioctl(VFIO_DEVICE_BIND_IOMMUFD) to mark a device as
> for private assignment. For these private assigned devices, disallow
> host accessing their MMIO resources.
Why? Shouldn't the VMM simply not call mmap? Why does the kernel have
to enforce this?
Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device
2025-01-08 13:30 ` Jason Gunthorpe
@ 2025-01-08 16:57 ` Xu Yilun
2025-01-09 14:40 ` Jason Gunthorpe
0 siblings, 1 reply; 134+ messages in thread
From: Xu Yilun @ 2025-01-08 16:57 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: kvm, dri-devel, linux-media, linaro-mm-sig, sumit.semwal,
christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, aik, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On Wed, Jan 08, 2025 at 09:30:26AM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 07, 2025 at 10:27:15PM +0800, Xu Yilun wrote:
> > Add a flag for ioctl(VFIO_DEVICE_BIND_IOMMUFD) to mark a device as
> > for private assignment. For these private assigned devices, disallow
> > host accessing their MMIO resources.
>
> Why? Shouldn't the VMM simply not call mmap? Why does the kernel have
> to enforce this?
MM.. maybe I should not say 'host', instead 'userspace'.
I think the kernel part VMM (KVM) has the responsibility to enforce the
correct behavior of the userspace part VMM (QEMU). QEMU has no way to
touch private memory/MMIO intentionally or accidently. IIUC that's one
of the initiative guest_memfd is introduced for private memory. Private
MMIO follows.
Thanks,
Yilun
>
> Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device
2025-01-08 16:57 ` Xu Yilun
@ 2025-01-09 14:40 ` Jason Gunthorpe
2025-01-09 16:40 ` Xu Yilun
0 siblings, 1 reply; 134+ messages in thread
From: Jason Gunthorpe @ 2025-01-09 14:40 UTC (permalink / raw)
To: Xu Yilun
Cc: kvm, dri-devel, linux-media, linaro-mm-sig, sumit.semwal,
christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, aik, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On Thu, Jan 09, 2025 at 12:57:58AM +0800, Xu Yilun wrote:
> On Wed, Jan 08, 2025 at 09:30:26AM -0400, Jason Gunthorpe wrote:
> > On Tue, Jan 07, 2025 at 10:27:15PM +0800, Xu Yilun wrote:
> > > Add a flag for ioctl(VFIO_DEVICE_BIND_IOMMUFD) to mark a device as
> > > for private assignment. For these private assigned devices, disallow
> > > host accessing their MMIO resources.
> >
> > Why? Shouldn't the VMM simply not call mmap? Why does the kernel have
> > to enforce this?
>
> MM.. maybe I should not say 'host', instead 'userspace'.
>
> I think the kernel part VMM (KVM) has the responsibility to enforce the
> correct behavior of the userspace part VMM (QEMU). QEMU has no way to
> touch private memory/MMIO intentionally or accidently. IIUC that's one
> of the initiative guest_memfd is introduced for private memory. Private
> MMIO follows.
Okay, but then why is it a flag like that? I'm expecting a much
broader system here to make the VFIO device into a confidential device
(like setup the TDI) where we'd have to enforce the private things,
communicate with some secure world to assign it, and so on.
I want to see a fuller solution to the CC problem in VFIO before we
can be sure what is the correct UAPI. In other words, make the
VFIO device into a CC device should also prevent mmaping it and so on.
So, I would take this out and defer VFIO enforcment to a series which
does fuller CC enablement of VFIO.
The precursor work should just be avoiding requiring a VMA when
installing VFIO MMIO into the KVM and IOMMU stage 2 mappings. Ie by
using a FD to get the CPU pfns into iommufd and kvm as you are
showing.
This works just fine for non-CC devices anyhow and is the necessary
building block for making a TDI interface in VFIO.
Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device
2025-01-09 14:40 ` Jason Gunthorpe
@ 2025-01-09 16:40 ` Xu Yilun
2025-01-10 13:31 ` Jason Gunthorpe
0 siblings, 1 reply; 134+ messages in thread
From: Xu Yilun @ 2025-01-09 16:40 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: kvm, dri-devel, linux-media, linaro-mm-sig, sumit.semwal,
christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, aik, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On Thu, Jan 09, 2025 at 10:40:51AM -0400, Jason Gunthorpe wrote:
> On Thu, Jan 09, 2025 at 12:57:58AM +0800, Xu Yilun wrote:
> > On Wed, Jan 08, 2025 at 09:30:26AM -0400, Jason Gunthorpe wrote:
> > > On Tue, Jan 07, 2025 at 10:27:15PM +0800, Xu Yilun wrote:
> > > > Add a flag for ioctl(VFIO_DEVICE_BIND_IOMMUFD) to mark a device as
> > > > for private assignment. For these private assigned devices, disallow
> > > > host accessing their MMIO resources.
> > >
> > > Why? Shouldn't the VMM simply not call mmap? Why does the kernel have
> > > to enforce this?
> >
> > MM.. maybe I should not say 'host', instead 'userspace'.
> >
> > I think the kernel part VMM (KVM) has the responsibility to enforce the
> > correct behavior of the userspace part VMM (QEMU). QEMU has no way to
> > touch private memory/MMIO intentionally or accidently. IIUC that's one
> > of the initiative guest_memfd is introduced for private memory. Private
> > MMIO follows.
>
> Okay, but then why is it a flag like that? I'm expecting a much
This flag is a prerequisite for setting up TDI, or part of the
requirement to make a "TDI capable" assigned device. It prevents the
userspace mapping at the first place, even as a shared device.
We want the device firstly appear as a shared device in CoCo-VM, then
do TDI setup (via a tsm verb "bind"). This late bind approach avoids
changing the CoCo VM startup routine. In contrast, early bind would
easily be broken, especially if bios is not aware of the TDI rule.
So then we face with the shared <-> private device conversion in CoCo VM,
and in turn shared <-> private MMIO conversion. MMIO region has only one
physical backend so it is a bit like in-place conversion which is
complicated. I wanna simply the MMIO conversion routine based on the fact
that VMM never needs to access assigned MMIO for feature emulation, so
always disallow userspace MMIO mapping during the whole lifecycle. That's
why the flag is introduced.
Patch 6 has similar discription.
> broader system here to make the VFIO device into a confidential device
> (like setup the TDI) where we'd have to enforce the private things,
I plan to introduce a new VFIO ioctl to setup the TDI.
> communicate with some secure world to assign it, and so on.
Yes, the new VFIO ioctl will communicate with PCI TSM.
>
> I want to see a fuller solution to the CC problem in VFIO before we
MM.. I have something but need more preparation. Whether send out or
make a public repo, I'll discuss with internal.
> can be sure what is the correct UAPI. In other words, make the
> VFIO device into a CC device should also prevent mmaping it and so on.
My idea is prevent mmaping first, then allow VFIO device into CC dev (TDI).
>
> So, I would take this out and defer VFIO enforcment to a series which
> does fuller CC enablement of VFIO.
>
> The precursor work should just be avoiding requiring a VMA when
> installing VFIO MMIO into the KVM and IOMMU stage 2 mappings. Ie by
> using a FD to get the CPU pfns into iommufd and kvm as you are
> showing.
>
> This works just fine for non-CC devices anyhow and is the necessary
Yes. It carries out the idea of "KVM maps MMIO resources without firstly
mapping into the host" even for normal VM. That's why I think it could
be an independent patchset.
Thanks,
Yilun
> building block for making a TDI interface in VFIO.
>
> Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device
2025-01-09 16:40 ` Xu Yilun
@ 2025-01-10 13:31 ` Jason Gunthorpe
2025-01-11 3:48 ` Xu Yilun
0 siblings, 1 reply; 134+ messages in thread
From: Jason Gunthorpe @ 2025-01-10 13:31 UTC (permalink / raw)
To: Xu Yilun
Cc: kvm, dri-devel, linux-media, linaro-mm-sig, sumit.semwal,
christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, aik, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On Fri, Jan 10, 2025 at 12:40:28AM +0800, Xu Yilun wrote:
> So then we face with the shared <-> private device conversion in CoCo VM,
> and in turn shared <-> private MMIO conversion. MMIO region has only one
> physical backend so it is a bit like in-place conversion which is
> complicated. I wanna simply the MMIO conversion routine based on the fact
> that VMM never needs to access assigned MMIO for feature emulation, so
> always disallow userspace MMIO mapping during the whole lifecycle. That's
> why the flag is introduced.
The VMM can simply not map it if for these cases. As part of the TDI
flow the kernel can validate it is not mapped.
> > can be sure what is the correct UAPI. In other words, make the
> > VFIO device into a CC device should also prevent mmaping it and so on.
>
> My idea is prevent mmaping first, then allow VFIO device into CC dev (TDI).
I think you need to start the TDI process much earlier. Some arches
are going to need work to prepare the TDI before the VM is started.
The other issue here is that Intel is somewhat different from others
and when we build uapi for TDI it has to accommodate everyone.
> Yes. It carries out the idea of "KVM maps MMIO resources without firstly
> mapping into the host" even for normal VM. That's why I think it could
> be an independent patchset.
Yes, just remove this patch and other TDI focused stuff. Just
infrastructure to move to FD based mapping instead of VMA.
Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device
2025-01-10 13:31 ` Jason Gunthorpe
@ 2025-01-11 3:48 ` Xu Yilun
2025-01-13 16:49 ` Jason Gunthorpe
0 siblings, 1 reply; 134+ messages in thread
From: Xu Yilun @ 2025-01-11 3:48 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: kvm, dri-devel, linux-media, linaro-mm-sig, sumit.semwal,
christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, aik, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On Fri, Jan 10, 2025 at 09:31:16AM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 10, 2025 at 12:40:28AM +0800, Xu Yilun wrote:
>
> > So then we face with the shared <-> private device conversion in CoCo VM,
> > and in turn shared <-> private MMIO conversion. MMIO region has only one
> > physical backend so it is a bit like in-place conversion which is
> > complicated. I wanna simply the MMIO conversion routine based on the fact
> > that VMM never needs to access assigned MMIO for feature emulation, so
> > always disallow userspace MMIO mapping during the whole lifecycle. That's
> > why the flag is introduced.
>
> The VMM can simply not map it if for these cases. As part of the TDI
> flow the kernel can validate it is not mapped.
That's a good point. I can try on that.
>
> > > can be sure what is the correct UAPI. In other words, make the
> > > VFIO device into a CC device should also prevent mmaping it and so on.
> >
> > My idea is prevent mmaping first, then allow VFIO device into CC dev (TDI).
>
> I think you need to start the TDI process much earlier. Some arches
> are going to need work to prepare the TDI before the VM is started.
Could you elaborate more on that? AFAICS Intel & AMD are all good on
"late bind", but not sure for other architectures. This relates to the
definition of TSM verbs and is the right time we collect the needs for
Dan's series.
>
> The other issue here is that Intel is somewhat different from others
> and when we build uapi for TDI it has to accommodate everyone.
Sure, this is the aim for PCI TSM core, and VFIO as a PCI TSM user
should not be TDX awared.
>
> > Yes. It carries out the idea of "KVM maps MMIO resources without firstly
> > mapping into the host" even for normal VM. That's why I think it could
> > be an independent patchset.
>
> Yes, just remove this patch and other TDI focused stuff. Just
> infrastructure to move to FD based mapping instead of VMA.
Yes.
Thanks,
Yilun
>
> Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device
2025-01-11 3:48 ` Xu Yilun
@ 2025-01-13 16:49 ` Jason Gunthorpe
2024-06-17 23:28 ` Xu Yilun
0 siblings, 1 reply; 134+ messages in thread
From: Jason Gunthorpe @ 2025-01-13 16:49 UTC (permalink / raw)
To: Xu Yilun
Cc: kvm, dri-devel, linux-media, linaro-mm-sig, sumit.semwal,
christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, aik, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On Sat, Jan 11, 2025 at 11:48:06AM +0800, Xu Yilun wrote:
> > > > can be sure what is the correct UAPI. In other words, make the
> > > > VFIO device into a CC device should also prevent mmaping it and so on.
> > >
> > > My idea is prevent mmaping first, then allow VFIO device into CC dev (TDI).
> >
> > I think you need to start the TDI process much earlier. Some arches
> > are going to need work to prepare the TDI before the VM is started.
>
> Could you elaborate more on that? AFAICS Intel & AMD are all good on
> "late bind", but not sure for other architectures.
I'm not sure about this, the topic has been confused a bit, and people
often seem to misunderstand what the full scenario actually is. :\
What I'm talking abou there is that you will tell the secure world to
create vPCI function that has the potential to be secure "TDI run"
down the road. The VM will decide when it reaches the run state. This
is needed so the secure world can prepare anything it needs prior to
starting the VM. Setting up secure vIOMMU emulation, for instance. I
expect ARM will need this, I'd be surprised if AMD actually doesn't in
the full scenario with secure viommu.
It should not be a surprise to the secure world after the VM has
started that suddenly it learns about a vPCI function that wants to be
secure. This should all be pre-arranged as possible before starting
the VM, even if alot of steps happen after the VM starts running (or
maybe don't happen at all).
Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device
2025-01-13 16:49 ` Jason Gunthorpe
@ 2024-06-17 23:28 ` Xu Yilun
2025-01-14 13:35 ` Jason Gunthorpe
0 siblings, 1 reply; 134+ messages in thread
From: Xu Yilun @ 2024-06-17 23:28 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: kvm, dri-devel, linux-media, linaro-mm-sig, sumit.semwal,
christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, aik, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On Mon, Jan 13, 2025 at 12:49:35PM -0400, Jason Gunthorpe wrote:
> On Sat, Jan 11, 2025 at 11:48:06AM +0800, Xu Yilun wrote:
>
> > > > > can be sure what is the correct UAPI. In other words, make the
> > > > > VFIO device into a CC device should also prevent mmaping it and so on.
> > > >
> > > > My idea is prevent mmaping first, then allow VFIO device into CC dev (TDI).
> > >
> > > I think you need to start the TDI process much earlier. Some arches
> > > are going to need work to prepare the TDI before the VM is started.
> >
> > Could you elaborate more on that? AFAICS Intel & AMD are all good on
> > "late bind", but not sure for other architectures.
>
> I'm not sure about this, the topic has been confused a bit, and people
> often seem to misunderstand what the full scenario actually is. :\
Yes, it is in early stage and open to discuss.
>
> What I'm talking abou there is that you will tell the secure world to
> create vPCI function that has the potential to be secure "TDI run"
> down the road. The VM will decide when it reaches the run state. This
Yes.
> is needed so the secure world can prepare anything it needs prior to
> starting the VM.
OK. From Dan's patchset there are some touch point for vendor tsm
drivers to do secure world preparation. e.g. pci_tsm_ops::probe().
Maybe we could move to Dan's thread for discussion.
https://lore.kernel.org/linux-coco/173343739517.1074769.13134786548545925484.stgit@dwillia2-xfh.jf.intel.com/
> Setting up secure vIOMMU emulation, for instance. I
I think this could be done at VM late bind time.
> expect ARM will need this, I'd be surprised if AMD actually doesn't in
> the full scenario with secure viommu.
AFAICS, AMD needs secure viommu.
>
> It should not be a surprise to the secure world after the VM has
> started that suddenly it learns about a vPCI function that wants to be
With some pre-VM stage touch point, it wouldn't be all of a sudden.
> secure. This should all be pre-arranged as possible before starting
But our current implementation is not to prepare as much as possible,
but only necessary, so most of the secure work for vPCI function is done
at late bind time.
Thank,
Yilun
> the VM, even if alot of steps happen after the VM starts running (or
> maybe don't happen at all).
>
> Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device
2024-06-17 23:28 ` Xu Yilun
@ 2025-01-14 13:35 ` Jason Gunthorpe
2025-01-15 12:57 ` Alexey Kardashevskiy
0 siblings, 1 reply; 134+ messages in thread
From: Jason Gunthorpe @ 2025-01-14 13:35 UTC (permalink / raw)
To: Xu Yilun
Cc: kvm, dri-devel, linux-media, linaro-mm-sig, sumit.semwal,
christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, aik, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On Tue, Jun 18, 2024 at 07:28:43AM +0800, Xu Yilun wrote:
> > is needed so the secure world can prepare anything it needs prior to
> > starting the VM.
>
> OK. From Dan's patchset there are some touch point for vendor tsm
> drivers to do secure world preparation. e.g. pci_tsm_ops::probe().
>
> Maybe we could move to Dan's thread for discussion.
>
> https://lore.kernel.org/linux-coco/173343739517.1074769.13134786548545925484.stgit@dwillia2-xfh.jf.intel.com/
I think Dan's series is different, any uapi from that series should
not be used in the VMM case. We need proper vfio APIs for the VMM to
use. I would expect VFIO to be calling some of that infrastructure.
Really, I don't see a clear sense of how this will look yet. AMD
provided some patches along these lines, I have not seem ARM and Intel
proposals yet, not do I sense there is alignment.
> > Setting up secure vIOMMU emulation, for instance. I
>
> I think this could be done at VM late bind time.
The vIOMMU needs to be setup before the VM boots
> > secure. This should all be pre-arranged as possible before starting
>
> But our current implementation is not to prepare as much as possible,
> but only necessary, so most of the secure work for vPCI function is done
> at late bind time.
That's fine too, but both options need to be valid.
Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device
2025-01-14 13:35 ` Jason Gunthorpe
@ 2025-01-15 12:57 ` Alexey Kardashevskiy
2025-01-15 13:01 ` Jason Gunthorpe
0 siblings, 1 reply; 134+ messages in thread
From: Alexey Kardashevskiy @ 2025-01-15 12:57 UTC (permalink / raw)
To: Jason Gunthorpe, Xu Yilun
Cc: kvm, dri-devel, linux-media, linaro-mm-sig, sumit.semwal,
christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On 15/1/25 00:35, Jason Gunthorpe wrote:
> On Tue, Jun 18, 2024 at 07:28:43AM +0800, Xu Yilun wrote:
>
>>> is needed so the secure world can prepare anything it needs prior to
>>> starting the VM.
>>
>> OK. From Dan's patchset there are some touch point for vendor tsm
>> drivers to do secure world preparation. e.g. pci_tsm_ops::probe().
>>
>> Maybe we could move to Dan's thread for discussion.
>>
>> https://lore.kernel.org/linux-coco/173343739517.1074769.13134786548545925484.stgit@dwillia2-xfh.jf.intel.com/
>
> I think Dan's series is different, any uapi from that series should
> not be used in the VMM case. We need proper vfio APIs for the VMM to
> use. I would expect VFIO to be calling some of that infrastructure.
Something like this experiment?
https://github.com/aik/linux/commit/ce052512fb8784e19745d4cb222e23cabc57792e
Thanks,
>
> Really, I don't see a clear sense of how this will look yet. AMD
> provided some patches along these lines, I have not seem ARM and Intel
> proposals yet, not do I sense there is alignment.
>
>>> Setting up secure vIOMMU emulation, for instance. I
>>
>> I think this could be done at VM late bind time.
>
> The vIOMMU needs to be setup before the VM boots
>
>>> secure. This should all be pre-arranged as possible before starting
>>
>> But our current implementation is not to prepare as much as possible,
>> but only necessary, so most of the secure work for vPCI function is done
>> at late bind time.
>
> That's fine too, but both options need to be valid.
>
> Jason
--
Alexey
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device
2025-01-15 12:57 ` Alexey Kardashevskiy
@ 2025-01-15 13:01 ` Jason Gunthorpe
2025-01-17 1:57 ` Baolu Lu
0 siblings, 1 reply; 134+ messages in thread
From: Jason Gunthorpe @ 2025-01-15 13:01 UTC (permalink / raw)
To: Alexey Kardashevskiy
Cc: Xu Yilun, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On Wed, Jan 15, 2025 at 11:57:05PM +1100, Alexey Kardashevskiy wrote:
> On 15/1/25 00:35, Jason Gunthorpe wrote:
> > On Tue, Jun 18, 2024 at 07:28:43AM +0800, Xu Yilun wrote:
> >
> > > > is needed so the secure world can prepare anything it needs prior to
> > > > starting the VM.
> > >
> > > OK. From Dan's patchset there are some touch point for vendor tsm
> > > drivers to do secure world preparation. e.g. pci_tsm_ops::probe().
> > >
> > > Maybe we could move to Dan's thread for discussion.
> > >
> > > https://lore.kernel.org/linux-coco/173343739517.1074769.13134786548545925484.stgit@dwillia2-xfh.jf.intel.com/
> >
> > I think Dan's series is different, any uapi from that series should
> > not be used in the VMM case. We need proper vfio APIs for the VMM to
> > use. I would expect VFIO to be calling some of that infrastructure.
>
> Something like this experiment?
>
> https://github.com/aik/linux/commit/ce052512fb8784e19745d4cb222e23cabc57792e
Yeah, maybe, though I don't know which of vfio/iommufd/kvm should be
hosting those APIs, the above does seem to be a reasonable direction.
When the various fds are closed I would expect the kernel to unbind
and restore the device back.
Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device
2025-01-15 13:01 ` Jason Gunthorpe
@ 2025-01-17 1:57 ` Baolu Lu
2025-01-17 13:25 ` Jason Gunthorpe
0 siblings, 1 reply; 134+ messages in thread
From: Baolu Lu @ 2025-01-17 1:57 UTC (permalink / raw)
To: Jason Gunthorpe, Alexey Kardashevskiy
Cc: Xu Yilun, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon,
zhenzhong.duan, tao1.su
On 1/15/25 21:01, Jason Gunthorpe wrote:
> On Wed, Jan 15, 2025 at 11:57:05PM +1100, Alexey Kardashevskiy wrote:
>> On 15/1/25 00:35, Jason Gunthorpe wrote:
>>> On Tue, Jun 18, 2024 at 07:28:43AM +0800, Xu Yilun wrote:
>>>
>>>>> is needed so the secure world can prepare anything it needs prior to
>>>>> starting the VM.
>>>> OK. From Dan's patchset there are some touch point for vendor tsm
>>>> drivers to do secure world preparation. e.g. pci_tsm_ops::probe().
>>>>
>>>> Maybe we could move to Dan's thread for discussion.
>>>>
>>>> https://lore.kernel.org/linux-
>>>> coco/173343739517.1074769.13134786548545925484.stgit@dwillia2-
>>>> xfh.jf.intel.com/
>>> I think Dan's series is different, any uapi from that series should
>>> not be used in the VMM case. We need proper vfio APIs for the VMM to
>>> use. I would expect VFIO to be calling some of that infrastructure.
>> Something like this experiment?
>>
>> https://github.com/aik/linux/commit/
>> ce052512fb8784e19745d4cb222e23cabc57792e
> Yeah, maybe, though I don't know which of vfio/iommufd/kvm should be
> hosting those APIs, the above does seem to be a reasonable direction.
>
> When the various fds are closed I would expect the kernel to unbind
> and restore the device back.
I am curious about the value of tsm binding against an iomnufd_vdevice
instead of the physical iommufd_device.
It is likely that the kvm pointer should be passed to iommufd during the
creation of a viommu object. If my recollection is correct, the arm
smmu-v3 needs it to obtain the vmid to setup the userspace event queue:
struct iommufd_viommu *arm_vsmmu_alloc(struct device *dev,
struct iommu_domain *parent,
struct iommufd_ctx *ictx,
unsigned int viommu_type)
{
[...]
/* FIXME Move VMID allocation from the S2 domain allocation to
here */
vsmmu->vmid = s2_parent->s2_cfg.vmid;
return &vsmmu->core;
}
Intel TDX connect implementation also needs a reference to the kvm
pointer to obtain the secure EPT information. This is crucial because
the CPU's page table must be shared with the iommu. I am not sure
whether the amd architecture has a similar requirement.
---
baolu
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device
2025-01-17 1:57 ` Baolu Lu
@ 2025-01-17 13:25 ` Jason Gunthorpe
2024-06-23 19:59 ` Xu Yilun
` (2 more replies)
0 siblings, 3 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2025-01-17 13:25 UTC (permalink / raw)
To: Baolu Lu
Cc: Alexey Kardashevskiy, Xu Yilun, kvm, dri-devel, linux-media,
linaro-mm-sig, sumit.semwal, christian.koenig, pbonzini, seanjc,
alex.williamson, vivek.kasireddy, dan.j.williams, yilun.xu,
linux-coco, linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon,
zhenzhong.duan, tao1.su
On Fri, Jan 17, 2025 at 09:57:40AM +0800, Baolu Lu wrote:
> On 1/15/25 21:01, Jason Gunthorpe wrote:
> > On Wed, Jan 15, 2025 at 11:57:05PM +1100, Alexey Kardashevskiy wrote:
> > > On 15/1/25 00:35, Jason Gunthorpe wrote:
> > > > On Tue, Jun 18, 2024 at 07:28:43AM +0800, Xu Yilun wrote:
> > > >
> > > > > > is needed so the secure world can prepare anything it needs prior to
> > > > > > starting the VM.
> > > > > OK. From Dan's patchset there are some touch point for vendor tsm
> > > > > drivers to do secure world preparation. e.g. pci_tsm_ops::probe().
> > > > >
> > > > > Maybe we could move to Dan's thread for discussion.
> > > > >
> > > > > https://lore.kernel.org/linux-
> > > > > coco/173343739517.1074769.13134786548545925484.stgit@dwillia2-
> > > > > xfh.jf.intel.com/
> > > > I think Dan's series is different, any uapi from that series should
> > > > not be used in the VMM case. We need proper vfio APIs for the VMM to
> > > > use. I would expect VFIO to be calling some of that infrastructure.
> > > Something like this experiment?
> > >
> > > https://github.com/aik/linux/commit/
> > > ce052512fb8784e19745d4cb222e23cabc57792e
> > Yeah, maybe, though I don't know which of vfio/iommufd/kvm should be
> > hosting those APIs, the above does seem to be a reasonable direction.
> >
> > When the various fds are closed I would expect the kernel to unbind
> > and restore the device back.
>
> I am curious about the value of tsm binding against an iomnufd_vdevice
> instead of the physical iommufd_device.
Interesting question
> It is likely that the kvm pointer should be passed to iommufd during the
> creation of a viommu object.
Yes, I fully expect this
> If my recollection is correct, the arm
> smmu-v3 needs it to obtain the vmid to setup the userspace event queue:
Right now it will use a VMID unrelated to KVM. BTM support on ARM will
require syncing the VMID with KVM.
AMD and Intel may require the KVM for some reason as well.
For CC I'm expecting the KVM fd to be the handle for the cVM, so any
RPCs that want to call into the secure world need the KVM FD to get
the cVM's identifier. Ie a "bind to cVM" RPC will need the PCI
information and the cVM's handle.
From that perspective it does make sense that any cVM related APIs,
like "bind to cVM" would be against the VDEVICE where we have a link
to the VIOMMU which has the KVM. On the iommufd side the VIOMMU is
part of the object hierarchy, but does not necessarily have to force a
vIOMMU to appear in the cVM.
But it also seems to me that VFIO should be able to support putting
the device into the RUN state without involving KVM or cVMs.
> Intel TDX connect implementation also needs a reference to the kvm
> pointer to obtain the secure EPT information. This is crucial because
> the CPU's page table must be shared with the iommu.
I thought kvm folks were NAKing this sharing entirely? Or is the
secure EPT in the secure world and not directly managed by Linux?
AFAIK AMD is going to mirror the iommu page table like today.
ARM, I suspect, will not have an "EPT" under Linux control, so
whatever happens will be hidden in their secure world.
Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device
2025-01-17 13:25 ` Jason Gunthorpe
@ 2024-06-23 19:59 ` Xu Yilun
2025-01-20 13:25 ` Jason Gunthorpe
2025-01-20 4:41 ` Baolu Lu
2025-01-20 9:45 ` Alexey Kardashevskiy
2 siblings, 1 reply; 134+ messages in thread
From: Xu Yilun @ 2024-06-23 19:59 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Baolu Lu, Alexey Kardashevskiy, kvm, dri-devel, linux-media,
linaro-mm-sig, sumit.semwal, christian.koenig, pbonzini, seanjc,
alex.williamson, vivek.kasireddy, dan.j.williams, yilun.xu,
linux-coco, linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon,
zhenzhong.duan, tao1.su
On Fri, Jan 17, 2025 at 09:25:23AM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 17, 2025 at 09:57:40AM +0800, Baolu Lu wrote:
> > On 1/15/25 21:01, Jason Gunthorpe wrote:
> > > On Wed, Jan 15, 2025 at 11:57:05PM +1100, Alexey Kardashevskiy wrote:
> > > > On 15/1/25 00:35, Jason Gunthorpe wrote:
> > > > > On Tue, Jun 18, 2024 at 07:28:43AM +0800, Xu Yilun wrote:
> > > > >
> > > > > > > is needed so the secure world can prepare anything it needs prior to
> > > > > > > starting the VM.
> > > > > > OK. From Dan's patchset there are some touch point for vendor tsm
> > > > > > drivers to do secure world preparation. e.g. pci_tsm_ops::probe().
> > > > > >
> > > > > > Maybe we could move to Dan's thread for discussion.
> > > > > >
> > > > > > https://lore.kernel.org/linux-
> > > > > > coco/173343739517.1074769.13134786548545925484.stgit@dwillia2-
> > > > > > xfh.jf.intel.com/
> > > > > I think Dan's series is different, any uapi from that series should
> > > > > not be used in the VMM case. We need proper vfio APIs for the VMM to
> > > > > use. I would expect VFIO to be calling some of that infrastructure.
> > > > Something like this experiment?
> > > >
> > > > https://github.com/aik/linux/commit/
> > > > ce052512fb8784e19745d4cb222e23cabc57792e
> > > Yeah, maybe, though I don't know which of vfio/iommufd/kvm should be
> > > hosting those APIs, the above does seem to be a reasonable direction.
> > >
> > > When the various fds are closed I would expect the kernel to unbind
> > > and restore the device back.
> >
> > I am curious about the value of tsm binding against an iomnufd_vdevice
> > instead of the physical iommufd_device.
>
> Interesting question
>
> > It is likely that the kvm pointer should be passed to iommufd during the
> > creation of a viommu object.
>
> Yes, I fully expect this
>
> > If my recollection is correct, the arm
> > smmu-v3 needs it to obtain the vmid to setup the userspace event queue:
>
> Right now it will use a VMID unrelated to KVM. BTM support on ARM will
> require syncing the VMID with KVM.
>
> AMD and Intel may require the KVM for some reason as well.
>
> For CC I'm expecting the KVM fd to be the handle for the cVM, so any
> RPCs that want to call into the secure world need the KVM FD to get
> the cVM's identifier. Ie a "bind to cVM" RPC will need the PCI
> information and the cVM's handle.
I also expect this.
>
> From that perspective it does make sense that any cVM related APIs,
> like "bind to cVM" would be against the VDEVICE where we have a link
> to the VIOMMU which has the KVM. On the iommufd side the VIOMMU is
> part of the object hierarchy, but does not necessarily have to force a
> vIOMMU to appear in the cVM.
>
> But it also seems to me that VFIO should be able to support putting
> the device into the RUN state
Firstly I think VFIO should support putting device into *LOCKED* state.
From LOCKED to RUN, there are many evidence fetching and attestation
things that only guest cares. I don't think VFIO needs to opt-in.
But that doesn't impact this concern. I actually think VFIO should
provide 'bind' uAPI to support these device side configuration things
rather than iommufd uAPI. IIUC iommufd should only do the setup on
IOMMU side.
The switching of TDISP state to LOCKED involves device side
differences that should be awared by the device owner, VFIO driver.
E.g. as we previously mentioned, to check if all MMIOs are never mapped.
Another E.g. invalidate MMIOs when device is to be LOCKED, some Pseudo
Code:
@@ -1494,7 +1494,15 @@ static int vfio_pci_ioctl_tsm_bind(struct vfio_pci_core_device *vdev,
if (!kvm)
return -ENOENT;
+ down_write(&vdev->memory_lock);
+ vfio_pci_dma_buf_move(vdev, true);
+
ret = pci_tsm_dev_bind(pdev, kvm, &bind.intf_id);
+
+ if (__vfio_pci_memory_enabled(vdev))
+ vfio_pci_dma_buf_move(vdev, false);
+ up_write(&vdev->memory_lock);
BTW, we may still need viommu/vdevice APIs during 'bind', if some IOMMU
side configurations are required by secure world. TDX does have some.
> without involving KVM or cVMs.
It may not be feasible for all vendors. I believe AMD would have one
firmware call that requires cVM handle *AND* move device into LOCKED
state. It really depends on firmware implementation.
So I'm expecting a coarse TSM verb pci_tsm_dev_bind() for vendors to do
any host side preparation and put device into LOCKED state.
>
> > Intel TDX connect implementation also needs a reference to the kvm
> > pointer to obtain the secure EPT information. This is crucial because
> > the CPU's page table must be shared with the iommu.
>
> I thought kvm folks were NAKing this sharing entirely? Or is the
I believe this is still Based on the general EPT sharing idea, is it?
There are several major reasons for the objection. In general, KVM now
has many "page non-present" tricks in EPT, which are not applicable to
IOPT. If shared, KVM has to take IOPT concerns into account, which is
quite a burden for KVM maintaining.
> secure EPT in the secure world and not directly managed by Linux?
Yes, the secure EPT is in the secure world and managed by TDX firmware.
Now a SW Mirror Secure EPT is introduced in KVM and managed by KVM
directly, and KVM will finally use firmware calls to propagate Mirror
Secure EPT changes to secure EPT.
Secure EPT are controlled by TDX module, basically KVM cannot play any
of the tricks. And TDX firmware should ensure any SEPT setting would be
applicable for Secure IOPT. I hope this could remove most of the
concerns.
I remember we've talked about SEPT sharing architechture for TDX TIO
before, but didn't get information back from KVM folks. Not sure how
things will go. Maybe will find out when we have some patches posted.
Thanks,
Yilun
>
> AFAIK AMD is going to mirror the iommu page table like today.
>
> ARM, I suspect, will not have an "EPT" under Linux control, so
> whatever happens will be hidden in their secure world.
>
> Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device
2024-06-23 19:59 ` Xu Yilun
@ 2025-01-20 13:25 ` Jason Gunthorpe
2024-06-24 21:12 ` Xu Yilun
0 siblings, 1 reply; 134+ messages in thread
From: Jason Gunthorpe @ 2025-01-20 13:25 UTC (permalink / raw)
To: Xu Yilun
Cc: Baolu Lu, Alexey Kardashevskiy, kvm, dri-devel, linux-media,
linaro-mm-sig, sumit.semwal, christian.koenig, pbonzini, seanjc,
alex.williamson, vivek.kasireddy, dan.j.williams, yilun.xu,
linux-coco, linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon,
zhenzhong.duan, tao1.su
On Mon, Jun 24, 2024 at 03:59:53AM +0800, Xu Yilun wrote:
> > But it also seems to me that VFIO should be able to support putting
> > the device into the RUN state
>
> Firstly I think VFIO should support putting device into *LOCKED* state.
> From LOCKED to RUN, there are many evidence fetching and attestation
> things that only guest cares. I don't think VFIO needs to opt-in.
VFIO is not just about running VMs. If someone wants to run DPDK on
VFIO they should be able to get the device into a RUN state and work
with secure memory without requiring a KVM. Yes there are many steps
to this, but we should imagine how it can work.
> > without involving KVM or cVMs.
>
> It may not be feasible for all vendors.
It must be. A CC guest with an in kernel driver can definately get the
PCI device into RUN, so VFIO running in the guest should be able as
well.
> I believe AMD would have one firmware call that requires cVM handle
> *AND* move device into LOCKED state. It really depends on firmware
> implementation.
IMHO, you would not use the secure firmware if you are not using VMs.
> Yes, the secure EPT is in the secure world and managed by TDX firmware.
> Now a SW Mirror Secure EPT is introduced in KVM and managed by KVM
> directly, and KVM will finally use firmware calls to propagate Mirror
> Secure EPT changes to secure EPT.
If the secure world managed it then the secure world can have rules
that work with the IOMMU as well..
Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device
2025-01-20 13:25 ` Jason Gunthorpe
@ 2024-06-24 21:12 ` Xu Yilun
2025-01-21 17:43 ` Jason Gunthorpe
0 siblings, 1 reply; 134+ messages in thread
From: Xu Yilun @ 2024-06-24 21:12 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Baolu Lu, Alexey Kardashevskiy, kvm, dri-devel, linux-media,
linaro-mm-sig, sumit.semwal, christian.koenig, pbonzini, seanjc,
alex.williamson, vivek.kasireddy, dan.j.williams, yilun.xu,
linux-coco, linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon,
zhenzhong.duan, tao1.su
On Mon, Jan 20, 2025 at 09:25:25AM -0400, Jason Gunthorpe wrote:
> On Mon, Jun 24, 2024 at 03:59:53AM +0800, Xu Yilun wrote:
> > > But it also seems to me that VFIO should be able to support putting
> > > the device into the RUN state
> >
> > Firstly I think VFIO should support putting device into *LOCKED* state.
> > From LOCKED to RUN, there are many evidence fetching and attestation
> > things that only guest cares. I don't think VFIO needs to opt-in.
>
> VFIO is not just about running VMs. If someone wants to run DPDK on
> VFIO they should be able to get the device into a RUN state and work
> with secure memory without requiring a KVM. Yes there are many steps
> to this, but we should imagine how it can work.
Interesting Question. I've never thought about native TIO before.
And you are also thinking about VFIO usage in CoCo-VM. So I believe
VFIO could be able to support putting the device into the RUN state,
but no need a uAPI for that, this happens when VFIO works as a TEE
attester.
In different cases, VFIO plays different roles:
1. TEE helper, but itself is out of TEE.
2. TEE attester, it is within the TEE.
3. TEE user, it is within the TEE.
As a TEE helper, it works on a untrusted device and help put the device
in LOCKED state, waiting for attestation. For VM use case, VM acts as the
attester to do attestation and move device into trusted/RUN state (lets
say 'accept'). The attestation and accept could be direct talks between
attester and device (maybe via TSM sysfs node), because from
LOCKED -> RUN VFIO doesn't change its way of handling device so seems
no need to introduce extra uAPIs and complexity just for passing the
talks. That's my expectation of VFIO's responsibility as a TEE
helper - serve until LOCKED, no care about the rest, UNLOCK rollbacks
everything.
I imagine in bare metal, if DPDK works as an attester (within TEE) and
VFIO still as a TEE helper (out of TEE), this model seems still work.
When VFIO works as a TEE user in VM, it means an attester (e.g. PCI
subsystem) has already moved the device to RUN state. So VFIO & DPDK
are all TEE users, no need to manipulate TDISP state between them.
AFAICS, this is the most preferred TIO usage in CoCo-VM.
When VFIO works as a TEE attester in VM, it means the VM's PCI
subsystem leaves the attestation work to device drivers. VFIO should do
the attestation and accept before pass through to DPDK, again no need to
manipulate TDISP state between them.
I image the possibility TIO happens on bare metal, that a device is
configured as waiting for attestation by whatever kernel module, then
PCI subsystem or VFIO try to attest, accept and use it, just the same as
in CoCo VM.
>
> > > without involving KVM or cVMs.
> >
> > It may not be feasible for all vendors.
>
> It must be. A CC guest with an in kernel driver can definately get the
> PCI device into RUN, so VFIO running in the guest should be able as
> well.
You are talking about VFIO in CoCo-VM as an attester, then definiately
yes.
>
> > I believe AMD would have one firmware call that requires cVM handle
> > *AND* move device into LOCKED state. It really depends on firmware
> > implementation.
>
> IMHO, you would not use the secure firmware if you are not using VMs.
>
> > Yes, the secure EPT is in the secure world and managed by TDX firmware.
> > Now a SW Mirror Secure EPT is introduced in KVM and managed by KVM
> > directly, and KVM will finally use firmware calls to propagate Mirror
> > Secure EPT changes to secure EPT.
>
> If the secure world managed it then the secure world can have rules
> that work with the IOMMU as well..
Yes.
Thanks,
Yilun
>
> Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device
2024-06-24 21:12 ` Xu Yilun
@ 2025-01-21 17:43 ` Jason Gunthorpe
2025-01-22 4:32 ` Xu Yilun
0 siblings, 1 reply; 134+ messages in thread
From: Jason Gunthorpe @ 2025-01-21 17:43 UTC (permalink / raw)
To: Xu Yilun
Cc: Baolu Lu, Alexey Kardashevskiy, kvm, dri-devel, linux-media,
linaro-mm-sig, sumit.semwal, christian.koenig, pbonzini, seanjc,
alex.williamson, vivek.kasireddy, dan.j.williams, yilun.xu,
linux-coco, linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon,
zhenzhong.duan, tao1.su
On Tue, Jun 25, 2024 at 05:12:10AM +0800, Xu Yilun wrote:
> When VFIO works as a TEE user in VM, it means an attester (e.g. PCI
> subsystem) has already moved the device to RUN state. So VFIO & DPDK
> are all TEE users, no need to manipulate TDISP state between them.
> AFAICS, this is the most preferred TIO usage in CoCo-VM.
No, unfortunately. Part of the motivation to have the devices be
unlocked when the VM starts is because there is an expectation that a
driver in the VM will need to do untrusted operations to boot up the
device before it can be switched to the run state.
So any vfio use case needs to imagine that VFIO starts with an
untrusted device, does stuff to it, then pushes everything through to
run. The exact mirror as what a kernel driver should be able to do.
How exactly all this very complex stuff works, I have no idea, but
this is what I've understood is the target. :\
Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device
2025-01-21 17:43 ` Jason Gunthorpe
@ 2025-01-22 4:32 ` Xu Yilun
2025-01-22 12:55 ` Jason Gunthorpe
0 siblings, 1 reply; 134+ messages in thread
From: Xu Yilun @ 2025-01-22 4:32 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Baolu Lu, Alexey Kardashevskiy, kvm, dri-devel, linux-media,
linaro-mm-sig, sumit.semwal, christian.koenig, pbonzini, seanjc,
alex.williamson, vivek.kasireddy, dan.j.williams, yilun.xu,
linux-coco, linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon,
zhenzhong.duan, tao1.su
On Tue, Jan 21, 2025 at 01:43:03PM -0400, Jason Gunthorpe wrote:
> On Tue, Jun 25, 2024 at 05:12:10AM +0800, Xu Yilun wrote:
>
> > When VFIO works as a TEE user in VM, it means an attester (e.g. PCI
> > subsystem) has already moved the device to RUN state. So VFIO & DPDK
> > are all TEE users, no need to manipulate TDISP state between them.
> > AFAICS, this is the most preferred TIO usage in CoCo-VM.
>
> No, unfortunately. Part of the motivation to have the devices be
> unlocked when the VM starts is because there is an expectation that a
> driver in the VM will need to do untrusted operations to boot up the
I assume these operations are device specific.
> device before it can be switched to the run state.
>
> So any vfio use case needs to imagine that VFIO starts with an
> untrusted device, does stuff to it, then pushes everything through to
I have concern that VFIO has to do device specific stuff. Our current
expectation is a specific device driver deals with the untrusted
operations, then user writes a 'bind' device sysfs node which detaches
the driver for untrusted, do the attestation and accept, and try match
the driver for trusted (e.g. VFIO).
Thanks,
Yilun
> run. The exact mirror as what a kernel driver should be able to do.
>
> How exactly all this very complex stuff works, I have no idea, but
> this is what I've understood is the target. :\
>
> Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device
2025-01-22 4:32 ` Xu Yilun
@ 2025-01-22 12:55 ` Jason Gunthorpe
2025-01-23 7:41 ` Xu Yilun
0 siblings, 1 reply; 134+ messages in thread
From: Jason Gunthorpe @ 2025-01-22 12:55 UTC (permalink / raw)
To: Xu Yilun
Cc: Baolu Lu, Alexey Kardashevskiy, kvm, dri-devel, linux-media,
linaro-mm-sig, sumit.semwal, christian.koenig, pbonzini, seanjc,
alex.williamson, vivek.kasireddy, dan.j.williams, yilun.xu,
linux-coco, linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon,
zhenzhong.duan, tao1.su
On Wed, Jan 22, 2025 at 12:32:56PM +0800, Xu Yilun wrote:
> On Tue, Jan 21, 2025 at 01:43:03PM -0400, Jason Gunthorpe wrote:
> > On Tue, Jun 25, 2024 at 05:12:10AM +0800, Xu Yilun wrote:
> >
> > > When VFIO works as a TEE user in VM, it means an attester (e.g. PCI
> > > subsystem) has already moved the device to RUN state. So VFIO & DPDK
> > > are all TEE users, no need to manipulate TDISP state between them.
> > > AFAICS, this is the most preferred TIO usage in CoCo-VM.
> >
> > No, unfortunately. Part of the motivation to have the devices be
> > unlocked when the VM starts is because there is an expectation that a
> > driver in the VM will need to do untrusted operations to boot up the
>
> I assume these operations are device specific.
Yes
> > device before it can be switched to the run state.
> >
> > So any vfio use case needs to imagine that VFIO starts with an
> > untrusted device, does stuff to it, then pushes everything through to
>
> I have concern that VFIO has to do device specific stuff. Our current
> expectation is a specific device driver deals with the untrusted
> operations, then user writes a 'bind' device sysfs node which detaches
> the driver for untrusted, do the attestation and accept, and try match
> the driver for trusted (e.g. VFIO).
I don't see this as working, VFIO will FLR the device which will
destroy anything that was done prior.
VFIO itself has to do the sequence and the VFIO userspace has to
contain the device specific stuff.
The bind/unbind dance for untrusted->trusted would need to be
internalized in VFIO without unbinding. The main motivation for the
bind/unbind flow was to manage the DMA API, which VFIO does not use.
Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device
2025-01-22 12:55 ` Jason Gunthorpe
@ 2025-01-23 7:41 ` Xu Yilun
2025-01-23 13:08 ` Jason Gunthorpe
0 siblings, 1 reply; 134+ messages in thread
From: Xu Yilun @ 2025-01-23 7:41 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Baolu Lu, Alexey Kardashevskiy, kvm, dri-devel, linux-media,
linaro-mm-sig, sumit.semwal, christian.koenig, pbonzini, seanjc,
alex.williamson, vivek.kasireddy, dan.j.williams, yilun.xu,
linux-coco, linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon,
zhenzhong.duan, tao1.su
On Wed, Jan 22, 2025 at 08:55:12AM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 22, 2025 at 12:32:56PM +0800, Xu Yilun wrote:
> > On Tue, Jan 21, 2025 at 01:43:03PM -0400, Jason Gunthorpe wrote:
> > > On Tue, Jun 25, 2024 at 05:12:10AM +0800, Xu Yilun wrote:
> > >
> > > > When VFIO works as a TEE user in VM, it means an attester (e.g. PCI
> > > > subsystem) has already moved the device to RUN state. So VFIO & DPDK
> > > > are all TEE users, no need to manipulate TDISP state between them.
> > > > AFAICS, this is the most preferred TIO usage in CoCo-VM.
> > >
> > > No, unfortunately. Part of the motivation to have the devices be
> > > unlocked when the VM starts is because there is an expectation that a
> > > driver in the VM will need to do untrusted operations to boot up the
> >
> > I assume these operations are device specific.
>
> Yes
>
> > > device before it can be switched to the run state.
> > >
> > > So any vfio use case needs to imagine that VFIO starts with an
> > > untrusted device, does stuff to it, then pushes everything through to
> >
> > I have concern that VFIO has to do device specific stuff. Our current
> > expectation is a specific device driver deals with the untrusted
> > operations, then user writes a 'bind' device sysfs node which detaches
> > the driver for untrusted, do the attestation and accept, and try match
> > the driver for trusted (e.g. VFIO).
>
> I don't see this as working, VFIO will FLR the device which will
> destroy anything that was done prior.
>
> VFIO itself has to do the sequence and the VFIO userspace has to
> contain the device specific stuff.
I don't have a complete idea yet. But the goal is not to make any
existing driver seamlessly work with secure device. It is to provide a
generic way for bind/attestation/accept, and may save driver's effort
if they don't care about this startup process. There are plenty of
operations that a driver can't do to a secure device, FLR is one of
them. The TDISP SPEC has described some general rules but some are even
device specific.
So I think a driver (including VFIO) expects change to support trusted
device, but may not have to cover bind/attestation/accept flow.
Thanks,
Yilun
>
> The bind/unbind dance for untrusted->trusted would need to be
> internalized in VFIO without unbinding. The main motivation for the
> bind/unbind flow was to manage the DMA API, which VFIO does not use.
>
> Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device
2025-01-23 7:41 ` Xu Yilun
@ 2025-01-23 13:08 ` Jason Gunthorpe
0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2025-01-23 13:08 UTC (permalink / raw)
To: Xu Yilun
Cc: Baolu Lu, Alexey Kardashevskiy, kvm, dri-devel, linux-media,
linaro-mm-sig, sumit.semwal, christian.koenig, pbonzini, seanjc,
alex.williamson, vivek.kasireddy, dan.j.williams, yilun.xu,
linux-coco, linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon,
zhenzhong.duan, tao1.su
On Thu, Jan 23, 2025 at 03:41:58PM +0800, Xu Yilun wrote:
> I don't have a complete idea yet. But the goal is not to make any
> existing driver seamlessly work with secure device. It is to provide a
> generic way for bind/attestation/accept, and may save driver's effort
> if they don't care about this startup process. There are plenty of
> operations that a driver can't do to a secure device, FLR is one of
> them. The TDISP SPEC has described some general rules but some are even
> device specific.
You can FLR a secure device, it just has to be re-secured and
re-attested after. Otherwise no VFIO for you.
> So I think a driver (including VFIO) expects change to support trusted
> device, but may not have to cover bind/attestation/accept flow.
I expect changes, but not fundamental ones. VFIO will still have to
FLR devices as part of it's security architecture.
The entire flow needs to have options for drivers to be involved in
the flow, somehow.
Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device
2025-01-17 13:25 ` Jason Gunthorpe
2024-06-23 19:59 ` Xu Yilun
@ 2025-01-20 4:41 ` Baolu Lu
2025-01-20 9:45 ` Alexey Kardashevskiy
2 siblings, 0 replies; 134+ messages in thread
From: Baolu Lu @ 2025-01-20 4:41 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Alexey Kardashevskiy, Xu Yilun, kvm, dri-devel, linux-media,
linaro-mm-sig, sumit.semwal, christian.koenig, pbonzini, seanjc,
alex.williamson, vivek.kasireddy, dan.j.williams, yilun.xu,
linux-coco, linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon,
zhenzhong.duan, tao1.su
On 1/17/25 21:25, Jason Gunthorpe wrote:
>> If my recollection is correct, the arm
>> smmu-v3 needs it to obtain the vmid to setup the userspace event queue:
> Right now it will use a VMID unrelated to KVM. BTM support on ARM will
> require syncing the VMID with KVM.
>
> AMD and Intel may require the KVM for some reason as well.
>
> For CC I'm expecting the KVM fd to be the handle for the cVM, so any
> RPCs that want to call into the secure world need the KVM FD to get
> the cVM's identifier. Ie a "bind to cVM" RPC will need the PCI
> information and the cVM's handle.
>
> From that perspective it does make sense that any cVM related APIs,
> like "bind to cVM" would be against the VDEVICE where we have a link
> to the VIOMMU which has the KVM. On the iommufd side the VIOMMU is
> part of the object hierarchy, but does not necessarily have to force a
> vIOMMU to appear in the cVM.
Yea, from that perspective, treating the vDEVICE object as the primary
focus for the uAPIs of cVMs is more reasonable. This simplifies the
iommu drivers by eliminating the need to verify hardware capabilities
and compatibilities within each callback. Everything could be done in
one shot when allocating the vDEVICE object.
>
> But it also seems to me that VFIO should be able to support putting
> the device into the RUN state without involving KVM or cVMs.
Then, it appears that BIND ioctl should be part of VFIO uAPI.
>> Intel TDX connect implementation also needs a reference to the kvm
>> pointer to obtain the secure EPT information. This is crucial because
>> the CPU's page table must be shared with the iommu.
> I thought kvm folks were NAKing this sharing entirely? Or is the
Yes, previous idea of *generic* EPT sharing was objected by the kvm
folks. The primary concern, as I understand it, is that KVM has many
"page non-present" tricks in EPT, which are not applicable to IOPT.
Consequently, KVM must now consider IOPT requirements when sharing the
EPT with the IOMMU, which presents a significant maintenance burden for
the KVM folks.
> secure EPT in the secure world and not directly managed by Linux?
But Secure EPT is managed by the TDX module within the secure world.
Crucially, KVM does not involve any such mechanisms. The firmware
guarantees that any Secure EPT configuration will be applicable to
Secure IOPT. This approach may alleviate concerns raised by the KVM
community.
> AFAIK AMD is going to mirror the iommu page table like today.
>
> ARM, I suspect, will not have an "EPT" under Linux control, so
> whatever happens will be hidden in their secure world.
Intel also does not have an EPT under Linux control. The KVM has a
mirrored page table and syncs it with the secure EPT managed by firmware
every time it is updated through the ABIs defined by the firmware.
Thanks,
baolu
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device
2025-01-17 13:25 ` Jason Gunthorpe
2024-06-23 19:59 ` Xu Yilun
2025-01-20 4:41 ` Baolu Lu
@ 2025-01-20 9:45 ` Alexey Kardashevskiy
2025-01-20 13:28 ` Jason Gunthorpe
2 siblings, 1 reply; 134+ messages in thread
From: Alexey Kardashevskiy @ 2025-01-20 9:45 UTC (permalink / raw)
To: Jason Gunthorpe, Baolu Lu
Cc: Xu Yilun, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon,
zhenzhong.duan, tao1.su
On 18/1/25 00:25, Jason Gunthorpe wrote:
> On Fri, Jan 17, 2025 at 09:57:40AM +0800, Baolu Lu wrote:
>> On 1/15/25 21:01, Jason Gunthorpe wrote:
>>> On Wed, Jan 15, 2025 at 11:57:05PM +1100, Alexey Kardashevskiy wrote:
>>>> On 15/1/25 00:35, Jason Gunthorpe wrote:
>>>>> On Tue, Jun 18, 2024 at 07:28:43AM +0800, Xu Yilun wrote:
>>>>>
>>>>>>> is needed so the secure world can prepare anything it needs prior to
>>>>>>> starting the VM.
>>>>>> OK. From Dan's patchset there are some touch point for vendor tsm
>>>>>> drivers to do secure world preparation. e.g. pci_tsm_ops::probe().
>>>>>>
>>>>>> Maybe we could move to Dan's thread for discussion.
>>>>>>
>>>>>> https://lore.kernel.org/linux-
>>>>>> coco/173343739517.1074769.13134786548545925484.stgit@dwillia2-
>>>>>> xfh.jf.intel.com/
>>>>> I think Dan's series is different, any uapi from that series should
>>>>> not be used in the VMM case. We need proper vfio APIs for the VMM to
>>>>> use. I would expect VFIO to be calling some of that infrastructure.
>>>> Something like this experiment?
>>>>
>>>> https://github.com/aik/linux/commit/
>>>> ce052512fb8784e19745d4cb222e23cabc57792e
>>> Yeah, maybe, though I don't know which of vfio/iommufd/kvm should be
>>> hosting those APIs, the above does seem to be a reasonable direction.
>>>
>>> When the various fds are closed I would expect the kernel to unbind
>>> and restore the device back.
>>
>> I am curious about the value of tsm binding against an iomnufd_vdevice
>> instead of the physical iommufd_device.
>
> Interesting question
>
>> It is likely that the kvm pointer should be passed to iommufd during the
>> creation of a viommu object.
>
> Yes, I fully expect this
>
>> If my recollection is correct, the arm
>> smmu-v3 needs it to obtain the vmid to setup the userspace event queue:
>
> Right now it will use a VMID unrelated to KVM. BTM support on ARM will
> require syncing the VMID with KVM.
>
> AMD and Intel may require the KVM for some reason as well.
>
> For CC I'm expecting the KVM fd to be the handle for the cVM, so any
> RPCs that want to call into the secure world need the KVM FD to get
> the cVM's identifier. Ie a "bind to cVM" RPC will need the PCI
> information and the cVM's handle.
And keep KVM fd open until unbind? Or just for the short time to call
the PSP?
> From that perspective it does make sense that any cVM related APIs,
> like "bind to cVM" would be against the VDEVICE where we have a link
> to the VIOMMU which has the KVM. On the iommufd side the VIOMMU is
> part of the object hierarchy, but does not necessarily have to force a
> vIOMMU to appear in the cVM.
Well, in my sketch it "appears" as an ability to make GUEST TIO REQUEST
calls (guest <-> secure FW protocol).
> But it also seems to me that VFIO should be able to support putting
> the device into the RUN state without involving KVM or cVMs.
AMD's TDI bind handler in the PSP wants a guest handle ("GCTX") and a
guest device BDFn, and VFIO has no desire to dive into this KVM business
beyond IOMMUFD.
And then this GUEST TIO REQUEST which is used for 1) enabling secure
part of IOMMU (so it relates to IOMMUFD) 2) enabling secure MMIO (which
is more VFIO business).
We can do all sorts of things but the lifetime of these entangled
objects is tricky sometimes. Thanks,
>> Intel TDX connect implementation also needs a reference to the kvm
>> pointer to obtain the secure EPT information. This is crucial because
>> the CPU's page table must be shared with the iommu.
>
> I thought kvm folks were NAKing this sharing entirely? Or is the
> secure EPT in the secure world and not directly managed by Linux?
>
> AFAIK AMD is going to mirror the iommu page table like today.
>
> ARM, I suspect, will not have an "EPT" under Linux control, so
> whatever happens will be hidden in their secure world.
>
> Jason
--
Alexey
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device
2025-01-20 9:45 ` Alexey Kardashevskiy
@ 2025-01-20 13:28 ` Jason Gunthorpe
2025-03-12 1:37 ` Dan Williams
0 siblings, 1 reply; 134+ messages in thread
From: Jason Gunthorpe @ 2025-01-20 13:28 UTC (permalink / raw)
To: Alexey Kardashevskiy
Cc: Baolu Lu, Xu Yilun, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon,
zhenzhong.duan, tao1.su
On Mon, Jan 20, 2025 at 08:45:51PM +1100, Alexey Kardashevskiy wrote:
> > For CC I'm expecting the KVM fd to be the handle for the cVM, so any
> > RPCs that want to call into the secure world need the KVM FD to get
> > the cVM's identifier. Ie a "bind to cVM" RPC will need the PCI
> > information and the cVM's handle.
>
> And keep KVM fd open until unbind? Or just for the short time to call the
> PSP?
iommufd will keep the KVM fd alive so long as the vIOMMU object
exists. Other uses for kvm require it to work like this.
> > But it also seems to me that VFIO should be able to support putting
> > the device into the RUN state without involving KVM or cVMs.
>
> AMD's TDI bind handler in the PSP wants a guest handle ("GCTX") and a guest
> device BDFn, and VFIO has no desire to dive into this KVM business beyond
> IOMMUFD.
As in my other email, VFIO is not restricted to running VMs, useful
things should be available to apps like DPDK.
There is a use case for using TDISP and getting devices up into an
ecrypted/attested state on pure bare metal without any KVM, VFIO
should work in that use case too.
Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device
2025-01-20 13:28 ` Jason Gunthorpe
@ 2025-03-12 1:37 ` Dan Williams
2025-03-17 16:38 ` Jason Gunthorpe
0 siblings, 1 reply; 134+ messages in thread
From: Dan Williams @ 2025-03-12 1:37 UTC (permalink / raw)
To: Jason Gunthorpe, Alexey Kardashevskiy
Cc: Baolu Lu, Xu Yilun, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon,
zhenzhong.duan, tao1.su
[ My ears have been burning for a couple months regarding this thread
and I have finally had the chance to circle back and read through all
the discussion on PATCH 01/12 and this PATCH 08/12, pardon the latency
while I addressed some CXL backlog ]
Jason Gunthorpe wrote:
> On Mon, Jan 20, 2025 at 08:45:51PM +1100, Alexey Kardashevskiy wrote:
>
> > > For CC I'm expecting the KVM fd to be the handle for the cVM, so any
> > > RPCs that want to call into the secure world need the KVM FD to get
> > > the cVM's identifier. Ie a "bind to cVM" RPC will need the PCI
> > > information and the cVM's handle.
> >
> > And keep KVM fd open until unbind? Or just for the short time to call the
> > PSP?
>
> iommufd will keep the KVM fd alive so long as the vIOMMU object
> exists. Other uses for kvm require it to work like this.
>
> > > But it also seems to me that VFIO should be able to support putting
> > > the device into the RUN state without involving KVM or cVMs.
> >
> > AMD's TDI bind handler in the PSP wants a guest handle ("GCTX") and a guest
> > device BDFn, and VFIO has no desire to dive into this KVM business beyond
> > IOMMUFD.
>
> As in my other email, VFIO is not restricted to running VMs, useful
> things should be available to apps like DPDK.
>
> There is a use case for using TDISP and getting devices up into an
> ecrypted/attested state on pure bare metal without any KVM, VFIO
> should work in that use case too.
Are you sure you are not confusing the use case for native PCI CMA plus
PCIe IDE *without* PCIe TDISP? In other words validate device
measurements over a secure session and set up link encryption, but not
enable DMA to private memory. Without a cVM there is no private memory
for the device to talk to in the TDISP run state, but you can certainly
encrypt the PCIe link.
However that pretty much only gets you an extension of a secure session
to a PCIe link state. It does not enable end-to-end MMIO and DMA
integrity+confidentiality.
Note that to my knowledge all but the Intel TEE I/O implementation
disallow routing T=0 traffic over IDE. The host bridge only accepts T=1
traffic over IDE to private memory which is not this "without any KVM"
use case.
The uapi proposed in the PCI/TSM series [1] is all about the setup of PCI
CMA + PCIe IDE without KVM as a precuror to all the VFIO + KVM + IOMMUFD
work needed to get the TDI able to publish private MMIO and DMA to
private memory.
[1]: http://lore.kernel.org/174107245357.1288555.10863541957822891561.stgit@dwillia2-xfh.jf.intel.com
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device
2025-03-12 1:37 ` Dan Williams
@ 2025-03-17 16:38 ` Jason Gunthorpe
0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2025-03-17 16:38 UTC (permalink / raw)
To: Dan Williams
Cc: Alexey Kardashevskiy, Baolu Lu, Xu Yilun, kvm, dri-devel,
linux-media, linaro-mm-sig, sumit.semwal, christian.koenig,
pbonzini, seanjc, alex.williamson, vivek.kasireddy, yilun.xu,
linux-coco, linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon,
zhenzhong.duan, tao1.su
On Tue, Mar 11, 2025 at 06:37:13PM -0700, Dan Williams wrote:
> > There is a use case for using TDISP and getting devices up into an
> > ecrypted/attested state on pure bare metal without any KVM, VFIO
> > should work in that use case too.
>
> Are you sure you are not confusing the use case for native PCI CMA plus
> PCIe IDE *without* PCIe TDISP?
Oh maybe, who knows with all this complexity :\
I see there is a crossover point, once you start getting T=1 traffic
then you need a KVM handle to process it, yes, but everything prior to
T=0, including all the use cases with T=0 IDE, attestation and so on,
still need to be working.
> In other words validate device measurements over a secure session
> and set up link encryption, but not enable DMA to private
> memory. Without a cVM there is no private memory for the device to
> talk to in the TDISP run state, but you can certainly encrypt the
> PCIe link.
Right. But can you do that all without touching tdisp?
> However that pretty much only gets you an extension of a secure session
> to a PCIe link state. It does not enable end-to-end MMIO and DMA
> integrity+confidentiality.
But that is the point, right? You want to bind your IDE encryption to
the device attestation and get all of those things. I thought you
needed some TDISP for that?
> Note that to my knowledge all but the Intel TEE I/O implementation
> disallow routing T=0 traffic over IDE.
I'm not sure that will hold up long term, I hear alot of people
talking about using IDE to solve all kinds of PCI problems that have
nothing to do with CC.
> The uapi proposed in the PCI/TSM series [1] is all about the setup of PCI
> CMA + PCIe IDE without KVM as a precuror to all the VFIO + KVM + IOMMUFD
> work needed to get the TDI able to publish private MMIO and DMA to
> private memory.
That seems reasonable
Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* [RFC PATCH 09/12] vfio/pci: Export vfio dma-buf specific info for importers
2025-01-07 14:27 [RFC PATCH 00/12] Private MMIO support for private assigned dev Xu Yilun
` (7 preceding siblings ...)
2025-01-07 14:27 ` [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device Xu Yilun
@ 2025-01-07 14:27 ` Xu Yilun
2025-01-07 14:27 ` [RFC PATCH 10/12] KVM: vfio_dmabuf: Fetch VFIO specific dma-buf data for sanity check Xu Yilun
` (3 subsequent siblings)
12 siblings, 0 replies; 134+ messages in thread
From: Xu Yilun @ 2025-01-07 14:27 UTC (permalink / raw)
To: kvm, dri-devel, linux-media, linaro-mm-sig, sumit.semwal,
christian.koenig, pbonzini, seanjc, alex.williamson, jgg,
vivek.kasireddy, dan.j.williams, aik
Cc: yilun.xu, yilun.xu, linux-coco, linux-kernel, lukas, yan.y.zhao,
daniel.vetter, leon, baolu.lu, zhenzhong.duan, tao1.su
VFIO dma-buf supports exporting host unaccessible MMIO regions for
private assignment. Export this info by attaching VFIO specific
dma-buf data in struct dma_buf::priv. Provide a helper
vfio_dma_buf_get_data() for importers to fetch these data.
The exported host unaccessible info are for importers to decide if the
dma-buf is good to use. KVM only allows host unaccessible MMIO regions
for private MMIO slot. But it is expected other importers (e.g. RDMA
driver, IOMMUFD) may also use the dma-buf machanism for P2P in native
or non-CoCo VM, in which cases host unaccessible is not required.
Also export struct kvm * handler attached to the vfio device. This
allows KVM to do another sanity check. MMIO should only be assigned to
a CoCo VM if its owner device is already assigned to the same VM.
Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
---
drivers/vfio/pci/dma_buf.c | 24 ++++++++++++++++++++++++
include/linux/vfio.h | 19 +++++++++++++++++++
2 files changed, 43 insertions(+)
diff --git a/drivers/vfio/pci/dma_buf.c b/drivers/vfio/pci/dma_buf.c
index ad12cfb85099..ad984f2c22fc 100644
--- a/drivers/vfio/pci/dma_buf.c
+++ b/drivers/vfio/pci/dma_buf.c
@@ -9,6 +9,8 @@
MODULE_IMPORT_NS("DMA_BUF");
struct vfio_pci_dma_buf {
+ struct vfio_dma_buf_data export_data;
+
struct dma_buf *dmabuf;
struct vfio_pci_core_device *vdev;
struct list_head dmabufs_elm;
@@ -156,6 +158,14 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
priv->vdev = vdev;
priv->nr_ranges = get_dma_buf.nr_ranges;
priv->dma_ranges = dma_ranges;
+ /*
+ * KVM expects private dma_buf. An private dma_buf must not
+ * support dma_buf_ops.map_dma_buf/mmap/vmap(). The exporter must also
+ * ensure no side channel access for the backend resource, e.g.
+ * vfio_device_ops.mmap() should not be supported.
+ */
+ priv->export_data.is_private = vdev->vdev.is_private;
+ priv->export_data.kvm = vdev->vdev.kvm;
ret = check_dma_ranges(priv, &dmabuf_size);
if (ret)
@@ -247,3 +257,17 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
}
up_write(&vdev->memory_lock);
}
+
+/*
+ * Only vfio/pci implements this, so put the helper here for now.
+ */
+struct vfio_dma_buf_data *vfio_dma_buf_get_data(struct dma_buf *dmabuf)
+{
+ struct vfio_pci_dma_buf *priv = dmabuf->priv;
+
+ if (dmabuf->ops != &vfio_pci_dmabuf_ops)
+ return ERR_PTR(-EINVAL);
+
+ return &priv->export_data;
+}
+EXPORT_SYMBOL_GPL(vfio_dma_buf_get_data);
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index e99d856c6cd8..fd7669e5b276 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -9,6 +9,7 @@
#define VFIO_H
+#include <linux/dma-buf.h>
#include <linux/iommu.h>
#include <linux/mm.h>
#include <linux/workqueue.h>
@@ -370,4 +371,22 @@ int vfio_virqfd_enable(void *opaque, int (*handler)(void *, void *),
void vfio_virqfd_disable(struct virqfd **pvirqfd);
void vfio_virqfd_flush_thread(struct virqfd **pvirqfd);
+/*
+ * DMA-buf - generic
+ */
+struct vfio_dma_buf_data {
+ bool is_private;
+ struct kvm *kvm;
+};
+
+#if IS_ENABLED(CONFIG_DMA_SHARED_BUFFER) && IS_ENABLED(CONFIG_VFIO_PCI_CORE)
+struct vfio_dma_buf_data *vfio_dma_buf_get_data(struct dma_buf *dmabuf);
+#else
+static inline
+struct vfio_dma_buf_data *vfio_dma_buf_get_data(struct dma_buf *dmabuf)
+{
+ return NULL;
+}
+#endif
+
#endif /* VFIO_H */
--
2.25.1
^ permalink raw reply related [flat|nested] 134+ messages in thread
* [RFC PATCH 10/12] KVM: vfio_dmabuf: Fetch VFIO specific dma-buf data for sanity check
2025-01-07 14:27 [RFC PATCH 00/12] Private MMIO support for private assigned dev Xu Yilun
` (8 preceding siblings ...)
2025-01-07 14:27 ` [RFC PATCH 09/12] vfio/pci: Export vfio dma-buf specific info for importers Xu Yilun
@ 2025-01-07 14:27 ` Xu Yilun
2025-01-07 14:27 ` [RFC PATCH 11/12] KVM: x86/mmu: Export kvm_is_mmio_pfn() Xu Yilun
` (2 subsequent siblings)
12 siblings, 0 replies; 134+ messages in thread
From: Xu Yilun @ 2025-01-07 14:27 UTC (permalink / raw)
To: kvm, dri-devel, linux-media, linaro-mm-sig, sumit.semwal,
christian.koenig, pbonzini, seanjc, alex.williamson, jgg,
vivek.kasireddy, dan.j.williams, aik
Cc: yilun.xu, yilun.xu, linux-coco, linux-kernel, lukas, yan.y.zhao,
daniel.vetter, leon, baolu.lu, zhenzhong.duan, tao1.su
Fetch VFIO specific dma-buf data to see if the dma-buf is eligible to
be assigned to CoCo VM as private MMIO.
KVM expects host unaccessible dma-buf for private MMIO mapping. So need
the exporter provide this information. VFIO also provides the
struct kvm *kvm handler for KVM to check if the owner device of the
MMIO region is already assigned to the same CoCo VM.
Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
---
virt/kvm/vfio_dmabuf.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/virt/kvm/vfio_dmabuf.c b/virt/kvm/vfio_dmabuf.c
index c427ab39c68a..26e01b815ebf 100644
--- a/virt/kvm/vfio_dmabuf.c
+++ b/virt/kvm/vfio_dmabuf.c
@@ -12,6 +12,22 @@ struct kvm_vfio_dmabuf {
struct kvm_memory_slot *slot;
};
+static struct vfio_dma_buf_data *kvm_vfio_dma_buf_get_data(struct dma_buf *dmabuf)
+{
+ struct vfio_dma_buf_data *(*fn)(struct dma_buf *dmabuf);
+ struct vfio_dma_buf_data *ret;
+
+ fn = symbol_get(vfio_dma_buf_get_data);
+ if (!fn)
+ return ERR_PTR(-ENOENT);
+
+ ret = fn(dmabuf);
+
+ symbol_put(vfio_dma_buf_get_data);
+
+ return ret;
+}
+
static void kv_dmabuf_move_notify(struct dma_buf_attachment *attach)
{
struct kvm_vfio_dmabuf *kv_dmabuf = attach->importer_priv;
@@ -48,6 +64,7 @@ int kvm_vfio_dmabuf_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
size_t size = slot->npages << PAGE_SHIFT;
struct dma_buf_attachment *attach;
struct kvm_vfio_dmabuf *kv_dmabuf;
+ struct vfio_dma_buf_data *data;
struct dma_buf *dmabuf;
int ret;
@@ -60,6 +77,15 @@ int kvm_vfio_dmabuf_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
goto err_dmabuf;
}
+ data = kvm_vfio_dma_buf_get_data(dmabuf);
+ if (IS_ERR(data))
+ goto err_dmabuf;
+
+ if (!data->is_private || data->kvm != kvm) {
+ ret = -EINVAL;
+ goto err_dmabuf;
+ }
+
kv_dmabuf = kzalloc(sizeof(*kv_dmabuf), GFP_KERNEL);
if (!kv_dmabuf) {
ret = -ENOMEM;
--
2.25.1
^ permalink raw reply related [flat|nested] 134+ messages in thread
* [RFC PATCH 11/12] KVM: x86/mmu: Export kvm_is_mmio_pfn()
2025-01-07 14:27 [RFC PATCH 00/12] Private MMIO support for private assigned dev Xu Yilun
` (9 preceding siblings ...)
2025-01-07 14:27 ` [RFC PATCH 10/12] KVM: vfio_dmabuf: Fetch VFIO specific dma-buf data for sanity check Xu Yilun
@ 2025-01-07 14:27 ` Xu Yilun
2025-01-07 14:27 ` [RFC PATCH 12/12] KVM: TDX: Implement TDX specific private MMIO map/unmap for SEPT Xu Yilun
2025-04-29 6:48 ` [RFC PATCH 00/12] Private MMIO support for private assigned dev Alexey Kardashevskiy
12 siblings, 0 replies; 134+ messages in thread
From: Xu Yilun @ 2025-01-07 14:27 UTC (permalink / raw)
To: kvm, dri-devel, linux-media, linaro-mm-sig, sumit.semwal,
christian.koenig, pbonzini, seanjc, alex.williamson, jgg,
vivek.kasireddy, dan.j.williams, aik
Cc: yilun.xu, yilun.xu, linux-coco, linux-kernel, lukas, yan.y.zhao,
daniel.vetter, leon, baolu.lu, zhenzhong.duan, tao1.su
From: Xu Yilun <yilun.xu@intel.com>
Export kvm_is_mmio_pfn() for KVM TDX to decide which seamcall should be
used to setup SEPT leaf entry.
TDX Module requires tdh_mem_page_aug() for memory page setup,
and tdh_mmio_map() for MMIO setup.
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Xu Yilun <yilun.xu@intel.com>
---
arch/x86/kvm/mmu.h | 1 +
arch/x86/kvm/mmu/spte.c | 3 ++-
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index e40097c7e8d4..23ff0e6c9ef6 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -102,6 +102,7 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu);
void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu);
void kvm_mmu_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
int bytes);
+bool kvm_is_mmio_pfn(kvm_pfn_t pfn);
static inline int kvm_mmu_reload(struct kvm_vcpu *vcpu)
{
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index e819d16655b6..0a9a81afba93 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -105,7 +105,7 @@ u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
return spte;
}
-static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
+bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
{
if (pfn_valid(pfn))
return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) &&
@@ -125,6 +125,7 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
pfn_to_hpa(pfn + 1) - 1,
E820_TYPE_RAM);
}
+EXPORT_SYMBOL_GPL(kvm_is_mmio_pfn);
/*
* Returns true if the SPTE has bits that may be set without holding mmu_lock.
--
2.25.1
^ permalink raw reply related [flat|nested] 134+ messages in thread
* [RFC PATCH 12/12] KVM: TDX: Implement TDX specific private MMIO map/unmap for SEPT
2025-01-07 14:27 [RFC PATCH 00/12] Private MMIO support for private assigned dev Xu Yilun
` (10 preceding siblings ...)
2025-01-07 14:27 ` [RFC PATCH 11/12] KVM: x86/mmu: Export kvm_is_mmio_pfn() Xu Yilun
@ 2025-01-07 14:27 ` Xu Yilun
2025-04-29 6:48 ` [RFC PATCH 00/12] Private MMIO support for private assigned dev Alexey Kardashevskiy
12 siblings, 0 replies; 134+ messages in thread
From: Xu Yilun @ 2025-01-07 14:27 UTC (permalink / raw)
To: kvm, dri-devel, linux-media, linaro-mm-sig, sumit.semwal,
christian.koenig, pbonzini, seanjc, alex.williamson, jgg,
vivek.kasireddy, dan.j.williams, aik
Cc: yilun.xu, yilun.xu, linux-coco, linux-kernel, lukas, yan.y.zhao,
daniel.vetter, leon, baolu.lu, zhenzhong.duan, tao1.su
Implement TDX specific private MMIO map/unmap in existing TDP MMU hooks.
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
---
TODO: This patch is still based on the earlier kvm-coco-queue version
(v6.13-rc2). Will follow up the latest SEAMCALL wrapper change. [1]
[1] https://lore.kernel.org/all/20250101074959.412696-1-pbonzini@redhat.com/
---
arch/x86/include/asm/tdx.h | 3 ++
arch/x86/kvm/vmx/tdx.c | 57 +++++++++++++++++++++++++++++++++++--
arch/x86/virt/vmx/tdx/tdx.c | 52 +++++++++++++++++++++++++++++++++
arch/x86/virt/vmx/tdx/tdx.h | 3 ++
4 files changed, 113 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 01409a59224d..7d158bbf79f4 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -151,6 +151,9 @@ u64 tdh_mem_page_remove(u64 tdr, u64 gpa, u64 level, u64 *rcx, u64 *rdx);
u64 tdh_phymem_cache_wb(bool resume);
u64 tdh_phymem_page_wbinvd_tdr(u64 tdr);
u64 tdh_phymem_page_wbinvd_hkid(u64 hpa, u64 hkid);
+u64 tdh_mmio_map(u64 tdr, u64 gpa, u64 level, u64 hpa, u64 *rcx, u64 *rdx);
+u64 tdh_mmio_block(u64 tdr, u64 gpa, u64 level, u64 *rcx, u64 *rdx);
+u64 tdh_mmio_unmap(u64 tdr, u64 gpa, u64 level, u64 *rcx, u64 *rdx);
#else
static inline void tdx_init(void) { }
static inline int tdx_cpu_enable(void) { return -ENODEV; }
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 69ef9c967fbf..9b43a2ee2203 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1576,6 +1576,29 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
return 0;
}
+static int tdx_mmio_map(struct kvm *kvm, gfn_t gfn,
+ enum pg_level level, kvm_pfn_t pfn)
+{
+ int tdx_level = pg_level_to_tdx_sept_level(level);
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ hpa_t hpa = pfn_to_hpa(pfn);
+ gpa_t gpa = gfn_to_gpa(gfn);
+ u64 entry, level_state;
+ u64 err;
+
+ err = tdh_mmio_map(kvm_tdx->tdr_pa, gpa, tdx_level, hpa,
+ &entry, &level_state);
+ if (unlikely(err & TDX_OPERAND_BUSY))
+ return -EBUSY;
+
+ if (KVM_BUG_ON(err, kvm)) {
+ pr_tdx_error_2(TDH_MMIO_MAP, err, entry, level_state);
+ return -EIO;
+ }
+
+ return 0;
+}
+
/*
* KVM_TDX_INIT_MEM_REGION calls kvm_gmem_populate() to get guest pages and
* tdx_gmem_post_populate() to premap page table pages into private EPT.
@@ -1610,6 +1633,9 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
return -EINVAL;
+ if (kvm_is_mmio_pfn(pfn))
+ return tdx_mmio_map(kvm, gfn, level, pfn);
+
/*
* Because guest_memfd doesn't support page migration with
* a_ops->migrate_folio (yet), no callback is triggered for KVM on page
@@ -1647,6 +1673,20 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
if (KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm))
return -EINVAL;
+ if (kvm_is_mmio_pfn(pfn)) {
+ do {
+ err = tdh_mmio_unmap(kvm_tdx->tdr_pa, gpa, tdx_level,
+ &entry, &level_state);
+ } while (unlikely(err == TDX_ERROR_SEPT_BUSY));
+
+ if (KVM_BUG_ON(err, kvm)) {
+ pr_tdx_error_2(TDH_MMIO_UNMAP, err, entry, level_state);
+ return -EIO;
+ }
+
+ return 0;
+ }
+
do {
/*
* When zapping private page, write lock is held. So no race
@@ -1715,7 +1755,7 @@ int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
}
static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
- enum pg_level level)
+ enum pg_level level, kvm_pfn_t pfn)
{
int tdx_level = pg_level_to_tdx_sept_level(level);
struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
@@ -1725,6 +1765,19 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
/* For now large page isn't supported yet. */
WARN_ON_ONCE(level != PG_LEVEL_4K);
+ if (kvm_is_mmio_pfn(pfn)) {
+ err = tdh_mmio_block(kvm_tdx->tdr_pa, gpa, tdx_level,
+ &entry, &level_state);
+ if (unlikely(err == TDX_ERROR_SEPT_BUSY))
+ return -EAGAIN;
+ if (KVM_BUG_ON(err, kvm)) {
+ pr_tdx_error_2(TDH_MMIO_BLOCK, err, entry, level_state);
+ return -EIO;
+ }
+
+ return 0;
+ }
+
err = tdh_mem_range_block(kvm_tdx->tdr_pa, gpa, tdx_level, &entry, &level_state);
if (unlikely(err == TDX_ERROR_SEPT_BUSY))
return -EAGAIN;
@@ -1816,7 +1869,7 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
return -EINVAL;
- ret = tdx_sept_zap_private_spte(kvm, gfn, level);
+ ret = tdx_sept_zap_private_spte(kvm, gfn, level, pfn);
if (ret)
return ret;
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 57195cf0d832..3b2109877a39 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1951,3 +1951,55 @@ u64 tdh_phymem_page_wbinvd_hkid(u64 hpa, u64 hkid)
return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
}
EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_hkid);
+
+u64 tdh_mmio_map(u64 tdr, u64 gpa, u64 level, u64 hpa, u64 *rcx, u64 *rdx)
+{
+ struct tdx_module_args args = {
+ .rcx = gpa | level,
+ .rdx = tdr,
+ .r8 = hpa,
+ };
+ u64 ret;
+
+ ret = tdx_seamcall_sept(TDH_MMIO_MAP, &args);
+
+ *rcx = args.rcx;
+ *rdx = args.rdx;
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(tdh_mmio_map);
+
+u64 tdh_mmio_block(u64 tdr, u64 gpa, u64 level, u64 *rcx, u64 *rdx)
+{
+ struct tdx_module_args args = {
+ .rcx = gpa | level,
+ .rdx = tdr,
+ };
+ u64 ret;
+
+ ret = tdx_seamcall_sept(TDH_MMIO_BLOCK, &args);
+
+ *rcx = args.rcx;
+ *rdx = args.rdx;
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(tdh_mmio_block);
+
+u64 tdh_mmio_unmap(u64 tdr, u64 gpa, u64 level, u64 *rcx, u64 *rdx)
+{
+ struct tdx_module_args args = {
+ .rcx = gpa | level,
+ .rdx = tdr,
+ };
+ u64 ret;
+
+ ret = tdx_seamcall_sept(TDH_MMIO_UNMAP, &args);
+
+ *rcx = args.rcx;
+ *rdx = args.rdx;
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(tdh_mmio_unmap);
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 58d5754dcb4d..a83a90a043a5 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -49,6 +49,9 @@
#define TDH_VP_WR 43
#define TDH_PHYMEM_PAGE_WBINVD 41
#define TDH_SYS_CONFIG 45
+#define TDH_MMIO_MAP 158
+#define TDH_MMIO_BLOCK 159
+#define TDH_MMIO_UNMAP 160
/*
* SEAMCALL leaf:
--
2.25.1
^ permalink raw reply related [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-01-07 14:27 [RFC PATCH 00/12] Private MMIO support for private assigned dev Xu Yilun
` (11 preceding siblings ...)
2025-01-07 14:27 ` [RFC PATCH 12/12] KVM: TDX: Implement TDX specific private MMIO map/unmap for SEPT Xu Yilun
@ 2025-04-29 6:48 ` Alexey Kardashevskiy
2025-04-29 7:50 ` Alexey Kardashevskiy
12 siblings, 1 reply; 134+ messages in thread
From: Alexey Kardashevskiy @ 2025-04-29 6:48 UTC (permalink / raw)
To: Xu Yilun, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
jgg, vivek.kasireddy, dan.j.williams
Cc: yilun.xu, linux-coco, linux-kernel, lukas, yan.y.zhao,
daniel.vetter, leon, baolu.lu, zhenzhong.duan, tao1.su
On 8/1/25 01:27, Xu Yilun wrote:
> This series is based on an earlier kvm-coco-queue version (v6.12-rc2)
Has this been pushed somewhere public? The patchset does not apply on top of v6.12-rc2, for example (I fixed locally).
Also, is there somewhere a QEMU tree using this? I am trying to use this new DMA_BUF feature and this require quite some not so obvious plumbing. Thanks,
> which includes all basic TDX patches.
>
> The series is to start the early stage discussion of the private MMIO
> handling for Coco-VM, which is part of the Private Device
> Assignment (aka TEE-IO, TIO) enabling. There are already some
> disscusion about the context of TIO:
>
> https://lore.kernel.org/linux-coco/173343739517.1074769.13134786548545925484.stgit@dwillia2-xfh.jf.intel.com/
> https://lore.kernel.org/all/20240823132137.336874-1-aik@amd.com/
>
> Private MMIOs are resources owned by Private assigned devices. Like
> private memory, they are also not intended to be accessed by host, only
> accessible by Coco-VM via some secondary MMUs (e.g. Secure EPT). This
> series is for KVM to map these MMIO resources without firstly mapping
> into the host. For this purpose, This series uses the FD based MMIO
> resources for secure mapping, and the dma-buf is chosen as the FD based
> backend, just like guest_memfd for private memory. Patch 6 in this
> series has more detailed description.
>
>
> Patch 1 changes dma-buf core, expose a new kAPI for importers to get
> dma-buf's PFN without DMA mapping. KVM could use this kAPI to build
> GPA -> HPA mapping in KVM MMU.
>
> Patch 2-4 are from Jason & Vivek, allow vfio-pci to export MMIO
> resources as dma-buf. The original series are for native P2P DMA and
> focus on p2p DMA mapping opens. I removed these p2p DMA mapping code
> just to focus the early stage discussion of private MMIO. The original
> series:
>
> https://lore.kernel.org/all/0-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com/
> https://lore.kernel.org/kvm/20240624065552.1572580-1-vivek.kasireddy@intel.com/
>
> Patch 5 is the implementation of get_pfn() callback for vfio dma-buf
> exporter.
>
> Patch 6-7 is about KVM supports the private MMIO memory slot backed by
> vfio dma-buf.
>
> Patch 8-10 is about how KVM verifies the user provided dma-buf fd
> eligible for private MMIO slot.
>
> Patch 11-12 is the example of how KVM TDX setup the Secure EPT for
> private MMIO.
>
>
> TODOs:
>
> - Follow up the evolving of original VFIO dma-buf series.
> - Follow up the evolving of basic TDX patches.
>
>
> Vivek Kasireddy (3):
> vfio: Export vfio device get and put registration helpers
> vfio/pci: Share the core device pointer while invoking feature
> functions
> vfio/pci: Allow MMIO regions to be exported through dma-buf
>
> Xu Yilun (9):
> dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI
> vfio/pci: Support get_pfn() callback for dma-buf
> KVM: Support vfio_dmabuf backed MMIO region
> KVM: x86/mmu: Handle page fault for vfio_dmabuf backed MMIO
> vfio/pci: Create host unaccessible dma-buf for private device
> vfio/pci: Export vfio dma-buf specific info for importers
> KVM: vfio_dmabuf: Fetch VFIO specific dma-buf data for sanity check
> KVM: x86/mmu: Export kvm_is_mmio_pfn()
> KVM: TDX: Implement TDX specific private MMIO map/unmap for SEPT
>
> Documentation/virt/kvm/api.rst | 7 +
> arch/x86/include/asm/tdx.h | 3 +
> arch/x86/kvm/mmu.h | 1 +
> arch/x86/kvm/mmu/mmu.c | 25 ++-
> arch/x86/kvm/mmu/spte.c | 3 +-
> arch/x86/kvm/vmx/tdx.c | 57 +++++-
> arch/x86/virt/vmx/tdx/tdx.c | 52 ++++++
> arch/x86/virt/vmx/tdx/tdx.h | 3 +
> drivers/dma-buf/dma-buf.c | 90 ++++++++--
> drivers/vfio/device_cdev.c | 9 +-
> drivers/vfio/pci/Makefile | 1 +
> drivers/vfio/pci/dma_buf.c | 273 +++++++++++++++++++++++++++++
> drivers/vfio/pci/vfio_pci_config.c | 22 ++-
> drivers/vfio/pci/vfio_pci_core.c | 64 +++++--
> drivers/vfio/pci/vfio_pci_priv.h | 27 +++
> drivers/vfio/pci/vfio_pci_rdwr.c | 3 +
> drivers/vfio/vfio_main.c | 2 +
> include/linux/dma-buf.h | 13 ++
> include/linux/kvm_host.h | 25 ++-
> include/linux/vfio.h | 22 +++
> include/linux/vfio_pci_core.h | 1 +
> include/uapi/linux/kvm.h | 1 +
> include/uapi/linux/vfio.h | 34 +++-
> virt/kvm/Kconfig | 6 +
> virt/kvm/Makefile.kvm | 1 +
> virt/kvm/kvm_main.c | 32 +++-
> virt/kvm/kvm_mm.h | 19 ++
> virt/kvm/vfio_dmabuf.c | 151 ++++++++++++++++
> 28 files changed, 896 insertions(+), 51 deletions(-)
> create mode 100644 drivers/vfio/pci/dma_buf.c
> create mode 100644 virt/kvm/vfio_dmabuf.c
>
--
Alexey
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-04-29 6:48 ` [RFC PATCH 00/12] Private MMIO support for private assigned dev Alexey Kardashevskiy
@ 2025-04-29 7:50 ` Alexey Kardashevskiy
2025-05-09 3:04 ` Alexey Kardashevskiy
0 siblings, 1 reply; 134+ messages in thread
From: Alexey Kardashevskiy @ 2025-04-29 7:50 UTC (permalink / raw)
To: Xu Yilun, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
jgg, vivek.kasireddy, dan.j.williams
Cc: yilun.xu, linux-coco, linux-kernel, lukas, yan.y.zhao,
daniel.vetter, leon, baolu.lu, zhenzhong.duan, tao1.su
On 29/4/25 16:48, Alexey Kardashevskiy wrote:
> On 8/1/25 01:27, Xu Yilun wrote:
>> This series is based on an earlier kvm-coco-queue version (v6.12-rc2)
>
> Has this been pushed somewhere public? The patchset does not apply on top of v6.12-rc2, for example (I fixed locally).
> Also, is there somewhere a QEMU tree using this? I am trying to use this new DMA_BUF feature and this require quite some not so obvious plumbing. Thanks,
More to the point, to make it work, QEMU needs to register VFIO MMIO BAR with KVM_SET_USER_MEMORY_REGION2 which passes slot->guest_memfd to KVM which essentially comes from VFIORegion->mmaps[0].mem->ram_block->guest_memfd. But since you disabled mmap for private MMIO, there is no MR which QEMU would even try registering as KVM memslot and there are many ways to fix it. I took a shortcut and reenabled mmap() but wonder what exactly you did. Makes sense? Thanks,
>
>> which includes all basic TDX patches.
>>
>> The series is to start the early stage discussion of the private MMIO
>> handling for Coco-VM, which is part of the Private Device
>> Assignment (aka TEE-IO, TIO) enabling. There are already some
>> disscusion about the context of TIO:
>>
>> https://lore.kernel.org/linux-coco/173343739517.1074769.13134786548545925484.stgit@dwillia2-xfh.jf.intel.com/
>> https://lore.kernel.org/all/20240823132137.336874-1-aik@amd.com/
>>
>> Private MMIOs are resources owned by Private assigned devices. Like
>> private memory, they are also not intended to be accessed by host, only
>> accessible by Coco-VM via some secondary MMUs (e.g. Secure EPT). This
>> series is for KVM to map these MMIO resources without firstly mapping
>> into the host. For this purpose, This series uses the FD based MMIO
>> resources for secure mapping, and the dma-buf is chosen as the FD based
>> backend, just like guest_memfd for private memory. Patch 6 in this
>> series has more detailed description.
>>
>>
>> Patch 1 changes dma-buf core, expose a new kAPI for importers to get
>> dma-buf's PFN without DMA mapping. KVM could use this kAPI to build
>> GPA -> HPA mapping in KVM MMU.
>>
>> Patch 2-4 are from Jason & Vivek, allow vfio-pci to export MMIO
>> resources as dma-buf. The original series are for native P2P DMA and
>> focus on p2p DMA mapping opens. I removed these p2p DMA mapping code
>> just to focus the early stage discussion of private MMIO. The original
>> series:
>>
>> https://lore.kernel.org/all/0-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com/
>> https://lore.kernel.org/kvm/20240624065552.1572580-1-vivek.kasireddy@intel.com/
>>
>> Patch 5 is the implementation of get_pfn() callback for vfio dma-buf
>> exporter.
>>
>> Patch 6-7 is about KVM supports the private MMIO memory slot backed by
>> vfio dma-buf.
>>
>> Patch 8-10 is about how KVM verifies the user provided dma-buf fd
>> eligible for private MMIO slot.
>>
>> Patch 11-12 is the example of how KVM TDX setup the Secure EPT for
>> private MMIO.
>>
>>
>> TODOs:
>>
>> - Follow up the evolving of original VFIO dma-buf series.
>> - Follow up the evolving of basic TDX patches.
>>
>>
>> Vivek Kasireddy (3):
>> vfio: Export vfio device get and put registration helpers
>> vfio/pci: Share the core device pointer while invoking feature
>> functions
>> vfio/pci: Allow MMIO regions to be exported through dma-buf
>>
>> Xu Yilun (9):
>> dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI
>> vfio/pci: Support get_pfn() callback for dma-buf
>> KVM: Support vfio_dmabuf backed MMIO region
>> KVM: x86/mmu: Handle page fault for vfio_dmabuf backed MMIO
>> vfio/pci: Create host unaccessible dma-buf for private device
>> vfio/pci: Export vfio dma-buf specific info for importers
>> KVM: vfio_dmabuf: Fetch VFIO specific dma-buf data for sanity check
>> KVM: x86/mmu: Export kvm_is_mmio_pfn()
>> KVM: TDX: Implement TDX specific private MMIO map/unmap for SEPT
>>
>> Documentation/virt/kvm/api.rst | 7 +
>> arch/x86/include/asm/tdx.h | 3 +
>> arch/x86/kvm/mmu.h | 1 +
>> arch/x86/kvm/mmu/mmu.c | 25 ++-
>> arch/x86/kvm/mmu/spte.c | 3 +-
>> arch/x86/kvm/vmx/tdx.c | 57 +++++-
>> arch/x86/virt/vmx/tdx/tdx.c | 52 ++++++
>> arch/x86/virt/vmx/tdx/tdx.h | 3 +
>> drivers/dma-buf/dma-buf.c | 90 ++++++++--
>> drivers/vfio/device_cdev.c | 9 +-
>> drivers/vfio/pci/Makefile | 1 +
>> drivers/vfio/pci/dma_buf.c | 273 +++++++++++++++++++++++++++++
>> drivers/vfio/pci/vfio_pci_config.c | 22 ++-
>> drivers/vfio/pci/vfio_pci_core.c | 64 +++++--
>> drivers/vfio/pci/vfio_pci_priv.h | 27 +++
>> drivers/vfio/pci/vfio_pci_rdwr.c | 3 +
>> drivers/vfio/vfio_main.c | 2 +
>> include/linux/dma-buf.h | 13 ++
>> include/linux/kvm_host.h | 25 ++-
>> include/linux/vfio.h | 22 +++
>> include/linux/vfio_pci_core.h | 1 +
>> include/uapi/linux/kvm.h | 1 +
>> include/uapi/linux/vfio.h | 34 +++-
>> virt/kvm/Kconfig | 6 +
>> virt/kvm/Makefile.kvm | 1 +
>> virt/kvm/kvm_main.c | 32 +++-
>> virt/kvm/kvm_mm.h | 19 ++
>> virt/kvm/vfio_dmabuf.c | 151 ++++++++++++++++
>> 28 files changed, 896 insertions(+), 51 deletions(-)
>> create mode 100644 drivers/vfio/pci/dma_buf.c
>> create mode 100644 virt/kvm/vfio_dmabuf.c
>>
>
--
Alexey
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-04-29 7:50 ` Alexey Kardashevskiy
@ 2025-05-09 3:04 ` Alexey Kardashevskiy
2025-05-09 11:12 ` Xu Yilun
0 siblings, 1 reply; 134+ messages in thread
From: Alexey Kardashevskiy @ 2025-05-09 3:04 UTC (permalink / raw)
To: Xu Yilun, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
jgg, vivek.kasireddy, dan.j.williams
Cc: yilun.xu, linux-coco, linux-kernel, lukas, yan.y.zhao,
daniel.vetter, leon, baolu.lu, zhenzhong.duan, tao1.su
Ping?
Also, since there is pushback on 01/12 "dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI", what is the plan now? Thanks,
On 29/4/25 17:50, Alexey Kardashevskiy wrote:
>
>
> On 29/4/25 16:48, Alexey Kardashevskiy wrote:
>> On 8/1/25 01:27, Xu Yilun wrote:
>>> This series is based on an earlier kvm-coco-queue version (v6.12-rc2)
>>
>> Has this been pushed somewhere public? The patchset does not apply on top of v6.12-rc2, for example (I fixed locally).
>> Also, is there somewhere a QEMU tree using this? I am trying to use this new DMA_BUF feature and this require quite some not so obvious plumbing. Thanks,
>
>
> More to the point, to make it work, QEMU needs to register VFIO MMIO BAR with KVM_SET_USER_MEMORY_REGION2 which passes slot->guest_memfd to KVM which essentially comes from VFIORegion->mmaps[0].mem->ram_block->guest_memfd. But since you disabled mmap for private MMIO, there is no MR which QEMU would even try registering as KVM memslot and there are many ways to fix it. I took a shortcut and reenabled mmap() but wonder what exactly you did. Makes sense? Thanks,
>
>
>>
>>> which includes all basic TDX patches.
>>>
>>> The series is to start the early stage discussion of the private MMIO
>>> handling for Coco-VM, which is part of the Private Device
>>> Assignment (aka TEE-IO, TIO) enabling. There are already some
>>> disscusion about the context of TIO:
>>>
>>> https://lore.kernel.org/linux-coco/173343739517.1074769.13134786548545925484.stgit@dwillia2-xfh.jf.intel.com/
>>> https://lore.kernel.org/all/20240823132137.336874-1-aik@amd.com/
>>>
>>> Private MMIOs are resources owned by Private assigned devices. Like
>>> private memory, they are also not intended to be accessed by host, only
>>> accessible by Coco-VM via some secondary MMUs (e.g. Secure EPT). This
>>> series is for KVM to map these MMIO resources without firstly mapping
>>> into the host. For this purpose, This series uses the FD based MMIO
>>> resources for secure mapping, and the dma-buf is chosen as the FD based
>>> backend, just like guest_memfd for private memory. Patch 6 in this
>>> series has more detailed description.
>>>
>>>
>>> Patch 1 changes dma-buf core, expose a new kAPI for importers to get
>>> dma-buf's PFN without DMA mapping. KVM could use this kAPI to build
>>> GPA -> HPA mapping in KVM MMU.
>>>
>>> Patch 2-4 are from Jason & Vivek, allow vfio-pci to export MMIO
>>> resources as dma-buf. The original series are for native P2P DMA and
>>> focus on p2p DMA mapping opens. I removed these p2p DMA mapping code
>>> just to focus the early stage discussion of private MMIO. The original
>>> series:
>>>
>>> https://lore.kernel.org/all/0-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com/
>>> https://lore.kernel.org/kvm/20240624065552.1572580-1-vivek.kasireddy@intel.com/
>>>
>>> Patch 5 is the implementation of get_pfn() callback for vfio dma-buf
>>> exporter.
>>>
>>> Patch 6-7 is about KVM supports the private MMIO memory slot backed by
>>> vfio dma-buf.
>>>
>>> Patch 8-10 is about how KVM verifies the user provided dma-buf fd
>>> eligible for private MMIO slot.
>>>
>>> Patch 11-12 is the example of how KVM TDX setup the Secure EPT for
>>> private MMIO.
>>>
>>>
>>> TODOs:
>>>
>>> - Follow up the evolving of original VFIO dma-buf series.
>>> - Follow up the evolving of basic TDX patches.
>>>
>>>
>>> Vivek Kasireddy (3):
>>> vfio: Export vfio device get and put registration helpers
>>> vfio/pci: Share the core device pointer while invoking feature
>>> functions
>>> vfio/pci: Allow MMIO regions to be exported through dma-buf
>>>
>>> Xu Yilun (9):
>>> dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI
>>> vfio/pci: Support get_pfn() callback for dma-buf
>>> KVM: Support vfio_dmabuf backed MMIO region
>>> KVM: x86/mmu: Handle page fault for vfio_dmabuf backed MMIO
>>> vfio/pci: Create host unaccessible dma-buf for private device
>>> vfio/pci: Export vfio dma-buf specific info for importers
>>> KVM: vfio_dmabuf: Fetch VFIO specific dma-buf data for sanity check
>>> KVM: x86/mmu: Export kvm_is_mmio_pfn()
>>> KVM: TDX: Implement TDX specific private MMIO map/unmap for SEPT
>>>
>>> Documentation/virt/kvm/api.rst | 7 +
>>> arch/x86/include/asm/tdx.h | 3 +
>>> arch/x86/kvm/mmu.h | 1 +
>>> arch/x86/kvm/mmu/mmu.c | 25 ++-
>>> arch/x86/kvm/mmu/spte.c | 3 +-
>>> arch/x86/kvm/vmx/tdx.c | 57 +++++-
>>> arch/x86/virt/vmx/tdx/tdx.c | 52 ++++++
>>> arch/x86/virt/vmx/tdx/tdx.h | 3 +
>>> drivers/dma-buf/dma-buf.c | 90 ++++++++--
>>> drivers/vfio/device_cdev.c | 9 +-
>>> drivers/vfio/pci/Makefile | 1 +
>>> drivers/vfio/pci/dma_buf.c | 273 +++++++++++++++++++++++++++++
>>> drivers/vfio/pci/vfio_pci_config.c | 22 ++-
>>> drivers/vfio/pci/vfio_pci_core.c | 64 +++++--
>>> drivers/vfio/pci/vfio_pci_priv.h | 27 +++
>>> drivers/vfio/pci/vfio_pci_rdwr.c | 3 +
>>> drivers/vfio/vfio_main.c | 2 +
>>> include/linux/dma-buf.h | 13 ++
>>> include/linux/kvm_host.h | 25 ++-
>>> include/linux/vfio.h | 22 +++
>>> include/linux/vfio_pci_core.h | 1 +
>>> include/uapi/linux/kvm.h | 1 +
>>> include/uapi/linux/vfio.h | 34 +++-
>>> virt/kvm/Kconfig | 6 +
>>> virt/kvm/Makefile.kvm | 1 +
>>> virt/kvm/kvm_main.c | 32 +++-
>>> virt/kvm/kvm_mm.h | 19 ++
>>> virt/kvm/vfio_dmabuf.c | 151 ++++++++++++++++
>>> 28 files changed, 896 insertions(+), 51 deletions(-)
>>> create mode 100644 drivers/vfio/pci/dma_buf.c
>>> create mode 100644 virt/kvm/vfio_dmabuf.c
>>>
>>
>
--
Alexey
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-09 3:04 ` Alexey Kardashevskiy
@ 2025-05-09 11:12 ` Xu Yilun
2025-05-09 16:28 ` Xu Yilun
0 siblings, 1 reply; 134+ messages in thread
From: Xu Yilun @ 2025-05-09 11:12 UTC (permalink / raw)
To: Alexey Kardashevskiy
Cc: kvm, dri-devel, linux-media, linaro-mm-sig, sumit.semwal,
christian.koenig, pbonzini, seanjc, alex.williamson, jgg,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On Fri, May 09, 2025 at 01:04:58PM +1000, Alexey Kardashevskiy wrote:
> Ping?
Sorry for late reply from vacation.
> Also, since there is pushback on 01/12 "dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI", what is the plan now? Thanks,
As disscussed in the thread, this kAPI is not well considered but IIUC
the concept of "importer mapping" is still valid. We need more
investigation about all the needs - P2P, CC memory, private bus
channel, and work out a formal API.
However in last few months I'm focusing on high level TIO flow - TSM
framework, IOMMUFD based bind/unbind, so no much progress here and is
still using this temporary kAPI. But as long as "importer mapping" is
alive, the dmabuf fd for KVM is still valid and we could enable TIO
based on that.
>
>
> On 29/4/25 17:50, Alexey Kardashevskiy wrote:
> >
> >
> > On 29/4/25 16:48, Alexey Kardashevskiy wrote:
> > > On 8/1/25 01:27, Xu Yilun wrote:
> > > > This series is based on an earlier kvm-coco-queue version (v6.12-rc2)
> > >
> > > Has this been pushed somewhere public? The patchset does not apply on top of v6.12-rc2, for example (I fixed locally).
Sorry, not yet. I'm trying to solve this ... same for the QEMU tree.
> > > Also, is there somewhere a QEMU tree using this? I am trying to use this new DMA_BUF feature and this require quite some not so obvious plumbing. Thanks,
> >
> >
> > More to the point, to make it work, QEMU needs to register VFIO MMIO BAR with KVM_SET_USER_MEMORY_REGION2 which passes slot->guest_memfd to KVM which essentially comes from VFIORegion->mmaps[0].mem->ram_block->guest_memfd. But since you disabled mmap for private MMIO, there is no MR which QEMU would even try registering as KVM memslot and there are many ways to fix it. I took a shortcut and reenabled mmap() but wonder what exactly you did. Makes sense? Thanks,
Yes, QEMU needs change. 08/12 "vfio/pci: Create host unaccessible dma-buf for private device“
adds a new flag VFIO_REGION_INFO_FLAG_PRIVATE to indicate user could
create dmabuf on this region.
I'm also not very serious about QEMU changes now, just FYI:
I use VFIO_REGION_INFO_FLAG_PRIVATE flag to revive region->mmaps.
int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
...
+ if (region->flags & VFIO_REGION_INFO_FLAG_PRIVATE) {
+ region->nr_mmaps = 1;
+ region->mmaps = g_new0(VFIOMmap, region->nr_mmaps);
+ region->mmaps[0].offset = 0;
+ region->mmaps[0].size = region->size;
+ region->mmaps[0].dmabuf_fd = -1;
}
Then in vfio_region_mmap(), use a new memory_region_init_dmabuf() to populate
the MR.
int vfio_region_mmap(VFIORegion *region)
+ if (use_dmabuf) {
+ /* create vfio dmabuf fd */
+ ret = vfio_create_dmabuf(region->vbasedev, region->nr,
+ region->mmaps[i].offset,
+ region->mmaps[i].size);
+ if (ret < 0) {
+ goto sub_unmap;
+ }
+
+ region->mmaps[i].dmabuf_fd = ret;
+
+ name = g_strdup_printf("%s dmabuf[%d]",
+ memory_region_name(region->mem), i);
+ memory_region_init_dmabuf(®ion->mmaps[i].mem,
+ memory_region_owner(region->mem),
+ name, region->mmaps[i].size,
+ region->mmaps[i].dmabuf_fd);
+ g_free(name);
+ } else {
Thanks,
Yilun
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-09 11:12 ` Xu Yilun
@ 2025-05-09 16:28 ` Xu Yilun
2025-05-09 18:43 ` Jason Gunthorpe
0 siblings, 1 reply; 134+ messages in thread
From: Xu Yilun @ 2025-05-09 16:28 UTC (permalink / raw)
To: Alexey Kardashevskiy
Cc: kvm, dri-devel, linux-media, linaro-mm-sig, sumit.semwal,
christian.koenig, pbonzini, seanjc, alex.williamson, jgg,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On Fri, May 09, 2025 at 07:12:46PM +0800, Xu Yilun wrote:
> On Fri, May 09, 2025 at 01:04:58PM +1000, Alexey Kardashevskiy wrote:
> > Ping?
>
> Sorry for late reply from vacation.
>
> > Also, since there is pushback on 01/12 "dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI", what is the plan now? Thanks,
>
> As disscussed in the thread, this kAPI is not well considered but IIUC
> the concept of "importer mapping" is still valid. We need more
> investigation about all the needs - P2P, CC memory, private bus
> channel, and work out a formal API.
>
> However in last few months I'm focusing on high level TIO flow - TSM
> framework, IOMMUFD based bind/unbind, so no much progress here and is
> still using this temporary kAPI. But as long as "importer mapping" is
> alive, the dmabuf fd for KVM is still valid and we could enable TIO
> based on that.
Oh I forgot to mention I moved the dmabuf creation from VFIO to IOMMUFD
recently, the IOCTL is against iommufd_device. According to Jason's
opinion [1], TSM bind/unbind should be called against iommufd_device,
then I need to do the same for dmabuf. This is because Intel TDX
Connect enforces a specific operation sequence between TSM unbind & MMIO
unmap:
1. STOP TDI via TDISP message STOP_INTERFACE
2. Private MMIO unmap from Secure EPT
3. Trusted Device Context Table cleanup for the TDI
4. TDI ownership reclaim and metadata free
That makes TSM unbind & dmabuf closely correlated and should be managed
by the same kernel component.
IIUC, the suggested flow is VFIO receives a CC capable flag and propagate
to IOMMUFD, which means VFIO hand over device's MMIO management & CC
management to IOMMUFD.
[1]: https://lore.kernel.org/all/20250306182614.GF354403@ziepe.ca/
Thanks,
Yilun
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-09 16:28 ` Xu Yilun
@ 2025-05-09 18:43 ` Jason Gunthorpe
2025-05-10 3:47 ` Xu Yilun
0 siblings, 1 reply; 134+ messages in thread
From: Jason Gunthorpe @ 2025-05-09 18:43 UTC (permalink / raw)
To: Xu Yilun
Cc: Alexey Kardashevskiy, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On Sat, May 10, 2025 at 12:28:48AM +0800, Xu Yilun wrote:
> On Fri, May 09, 2025 at 07:12:46PM +0800, Xu Yilun wrote:
> > On Fri, May 09, 2025 at 01:04:58PM +1000, Alexey Kardashevskiy wrote:
> > > Ping?
> >
> > Sorry for late reply from vacation.
> >
> > > Also, since there is pushback on 01/12 "dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI", what is the plan now? Thanks,
> >
> > As disscussed in the thread, this kAPI is not well considered but IIUC
> > the concept of "importer mapping" is still valid. We need more
> > investigation about all the needs - P2P, CC memory, private bus
> > channel, and work out a formal API.
> >
> > However in last few months I'm focusing on high level TIO flow - TSM
> > framework, IOMMUFD based bind/unbind, so no much progress here and is
> > still using this temporary kAPI. But as long as "importer mapping" is
> > alive, the dmabuf fd for KVM is still valid and we could enable TIO
> > based on that.
>
> Oh I forgot to mention I moved the dmabuf creation from VFIO to IOMMUFD
> recently, the IOCTL is against iommufd_device.
I'm surprised by this.. iommufd shouldn't be doing PCI stuff, it is
just about managing the translation control of the device.
> According to Jason's
> opinion [1], TSM bind/unbind should be called against iommufd_device,
> then I need to do the same for dmabuf. This is because Intel TDX
> Connect enforces a specific operation sequence between TSM unbind & MMIO
> unmap:
>
> 1. STOP TDI via TDISP message STOP_INTERFACE
> 2. Private MMIO unmap from Secure EPT
> 3. Trusted Device Context Table cleanup for the TDI
> 4. TDI ownership reclaim and metadata free
So your issue is you need to shoot down the dmabuf during vPCI device
destruction?
VFIO also needs to shoot down the MMIO during things like FLR
I don't think moving to iommufd really fixes it, it sounds like you
need more coordination between the two parts??
Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-09 18:43 ` Jason Gunthorpe
@ 2025-05-10 3:47 ` Xu Yilun
2025-05-12 9:30 ` Alexey Kardashevskiy
0 siblings, 1 reply; 134+ messages in thread
From: Xu Yilun @ 2025-05-10 3:47 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Alexey Kardashevskiy, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On Fri, May 09, 2025 at 03:43:18PM -0300, Jason Gunthorpe wrote:
> On Sat, May 10, 2025 at 12:28:48AM +0800, Xu Yilun wrote:
> > On Fri, May 09, 2025 at 07:12:46PM +0800, Xu Yilun wrote:
> > > On Fri, May 09, 2025 at 01:04:58PM +1000, Alexey Kardashevskiy wrote:
> > > > Ping?
> > >
> > > Sorry for late reply from vacation.
> > >
> > > > Also, since there is pushback on 01/12 "dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI", what is the plan now? Thanks,
> > >
> > > As disscussed in the thread, this kAPI is not well considered but IIUC
> > > the concept of "importer mapping" is still valid. We need more
> > > investigation about all the needs - P2P, CC memory, private bus
> > > channel, and work out a formal API.
> > >
> > > However in last few months I'm focusing on high level TIO flow - TSM
> > > framework, IOMMUFD based bind/unbind, so no much progress here and is
> > > still using this temporary kAPI. But as long as "importer mapping" is
> > > alive, the dmabuf fd for KVM is still valid and we could enable TIO
> > > based on that.
> >
> > Oh I forgot to mention I moved the dmabuf creation from VFIO to IOMMUFD
> > recently, the IOCTL is against iommufd_device.
>
> I'm surprised by this.. iommufd shouldn't be doing PCI stuff, it is
> just about managing the translation control of the device.
I have a little difficulty to understand. Is TSM bind PCI stuff? To me
it is. Host sends PCI TDISP messages via PCI DOE to put the device in
TDISP LOCKED state, so that device behaves differently from before. Then
why put it in IOMMUFD?
Or "managing the translation control" means IOMMUFD provides the TSM
bind/unbind uAPI and call into VFIO driver for real TSM bind
implementation?
>
> > According to Jason's
> > opinion [1], TSM bind/unbind should be called against iommufd_device,
> > then I need to do the same for dmabuf. This is because Intel TDX
> > Connect enforces a specific operation sequence between TSM unbind & MMIO
> > unmap:
> >
> > 1. STOP TDI via TDISP message STOP_INTERFACE
> > 2. Private MMIO unmap from Secure EPT
> > 3. Trusted Device Context Table cleanup for the TDI
> > 4. TDI ownership reclaim and metadata free
>
> So your issue is you need to shoot down the dmabuf during vPCI device
> destruction?
I assume "vPCI device" refers to assigned device in both shared mode &
prvate mode. So no, I need to shoot down the dmabuf during TSM unbind,
a.k.a. when assigned device is converting from private to shared.
Then recover the dmabuf after TSM unbind. The device could still work
in VM in shared mode.
>
> VFIO also needs to shoot down the MMIO during things like FLR
>
> I don't think moving to iommufd really fixes it, it sounds like you
> need more coordination between the two parts??
Yes, when moving to iommufd, VFIO needs extra kAPIs to inform IOMMUFD
about the shooting down. But FLR or MSE toggle also breaks TSM bind
state. As long as we put TSM bind in IOMMUFD, anyway the coordination
is needed.
What I really want is, one SW component to manage MMIO dmabuf, secure
iommu & TSM bind/unbind. So easier coordinate these 3 operations cause
these ops are interconnected according to secure firmware's requirement.
Otherwise e.g. for TDX, when device is TSM bound (IOMMUFD controls
bind) and VFIO wants FLR, VFIO revokes dmabuf first then explode.
Safe way is one SW component manages all these "pre-FLR" stuffs, let's say
IOMMUFD, it firstly do TSM unbind, let the platform TSM driver decides
the correct operation sequence (TDISP, dmabuf for private MMIO mapping,
secure dma). After TSM unbind, it's a shared device and IOMMUFD have no
worry to revoke dmabuf as needed.
Maybe I could send a patchset to illustrate...
Thanks,
Yilun
>
> Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-10 3:47 ` Xu Yilun
@ 2025-05-12 9:30 ` Alexey Kardashevskiy
2025-05-12 14:06 ` Jason Gunthorpe
2025-05-14 3:20 ` Xu Yilun
0 siblings, 2 replies; 134+ messages in thread
From: Alexey Kardashevskiy @ 2025-05-12 9:30 UTC (permalink / raw)
To: Xu Yilun, Jason Gunthorpe
Cc: kvm, dri-devel, linux-media, linaro-mm-sig, sumit.semwal,
christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On 10/5/25 13:47, Xu Yilun wrote:
> On Fri, May 09, 2025 at 03:43:18PM -0300, Jason Gunthorpe wrote:
>> On Sat, May 10, 2025 at 12:28:48AM +0800, Xu Yilun wrote:
>>> On Fri, May 09, 2025 at 07:12:46PM +0800, Xu Yilun wrote:
>>>> On Fri, May 09, 2025 at 01:04:58PM +1000, Alexey Kardashevskiy wrote:
>>>>> Ping?
>>>>
>>>> Sorry for late reply from vacation.
>>>>
>>>>> Also, since there is pushback on 01/12 "dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI", what is the plan now? Thanks,
>>>>
>>>> As disscussed in the thread, this kAPI is not well considered but IIUC
>>>> the concept of "importer mapping" is still valid. We need more
>>>> investigation about all the needs - P2P, CC memory, private bus
>>>> channel, and work out a formal API.
>>>>
>>>> However in last few months I'm focusing on high level TIO flow - TSM
>>>> framework, IOMMUFD based bind/unbind, so no much progress here and is
>>>> still using this temporary kAPI. But as long as "importer mapping" is
>>>> alive, the dmabuf fd for KVM is still valid and we could enable TIO
>>>> based on that.
>>>
>>> Oh I forgot to mention I moved the dmabuf creation from VFIO to IOMMUFD
>>> recently, the IOCTL is against iommufd_device.
>>
>> I'm surprised by this.. iommufd shouldn't be doing PCI stuff, it is
>> just about managing the translation control of the device.
>
> I have a little difficulty to understand. Is TSM bind PCI stuff? To me
> it is. Host sends PCI TDISP messages via PCI DOE to put the device in
> TDISP LOCKED state, so that device behaves differently from before. Then
> why put it in IOMMUFD?
"TSM bind" sets up the CPU side of it, it binds a VM to a piece of IOMMU on the host CPU. The device does not know about the VM, it just enables/disables encryption by a request from the CPU (those start/stop interface commands). And IOMMUFD won't be doing DOE, the platform driver (such as AMD CCP) will. Nothing to do for VFIO here.
We probably should notify VFIO about the state transition but I do not know VFIO would want to do in response.
> Or "managing the translation control" means IOMMUFD provides the TSM
> bind/unbind uAPI and call into VFIO driver for real TSM bind
> implementation?
>
>>
>>> According to Jason's
>>> opinion [1], TSM bind/unbind should be called against iommufd_device,
>>> then I need to do the same for dmabuf. This is because Intel TDX
>>> Connect enforces a specific operation sequence between TSM unbind & MMIO
>>> unmap:
>>>
>>> 1. STOP TDI via TDISP message STOP_INTERFACE
>>> 2. Private MMIO unmap from Secure EPT
>>> 3. Trusted Device Context Table cleanup for the TDI
>>> 4. TDI ownership reclaim and metadata free
>>
>> So your issue is you need to shoot down the dmabuf during vPCI device
>> destruction?
>
> I assume "vPCI device" refers to assigned device in both shared mode &
> prvate mode. So no, I need to shoot down the dmabuf during TSM unbind,
> a.k.a. when assigned device is converting from private to shared.
> Then recover the dmabuf after TSM unbind. The device could still work
> in VM in shared mode.
>
>>
>> VFIO also needs to shoot down the MMIO during things like FLR
>>
>> I don't think moving to iommufd really fixes it, it sounds like you
>> need more coordination between the two parts??
>
> Yes, when moving to iommufd, VFIO needs extra kAPIs to inform IOMMUFD
> about the shooting down. But FLR or MSE toggle also breaks TSM bind
> state. As long as we put TSM bind in IOMMUFD, anyway the coordination
> is needed.
>
> What I really want is, one SW component to manage MMIO dmabuf, secure
> iommu & TSM bind/unbind. So easier coordinate these 3 operations cause
> these ops are interconnected according to secure firmware's requirement.
This SW component is QEMU. It knows about FLRs and other config space things, it can destroy all these IOMMUFD objects and talk to VFIO too, I've tried, so far it is looking easier to manage. Thanks,
> Otherwise e.g. for TDX, when device is TSM bound (IOMMUFD controls
> bind) and VFIO wants FLR, VFIO revokes dmabuf first then explode.
>
> Safe way is one SW component manages all these "pre-FLR" stuffs, let's say
> IOMMUFD, it firstly do TSM unbind, let the platform TSM driver decides
> the correct operation sequence (TDISP, dmabuf for private MMIO mapping,
> secure dma). After TSM unbind, it's a shared device and IOMMUFD have no
> worry to revoke dmabuf as needed.
>
> Maybe I could send a patchset to illustrate...
>
> Thanks,
> Yilun
>
>>
>> Jason
--
Alexey
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-12 9:30 ` Alexey Kardashevskiy
@ 2025-05-12 14:06 ` Jason Gunthorpe
2025-05-13 10:03 ` Zhi Wang
2025-05-14 7:02 ` Xu Yilun
2025-05-14 3:20 ` Xu Yilun
1 sibling, 2 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2025-05-12 14:06 UTC (permalink / raw)
To: Alexey Kardashevskiy
Cc: Xu Yilun, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On Mon, May 12, 2025 at 07:30:21PM +1000, Alexey Kardashevskiy wrote:
> > > I'm surprised by this.. iommufd shouldn't be doing PCI stuff, it is
> > > just about managing the translation control of the device.
> >
> > I have a little difficulty to understand. Is TSM bind PCI stuff? To me
> > it is. Host sends PCI TDISP messages via PCI DOE to put the device in
> > TDISP LOCKED state, so that device behaves differently from before. Then
> > why put it in IOMMUFD?
>
>
> "TSM bind" sets up the CPU side of it, it binds a VM to a piece of
> IOMMU on the host CPU. The device does not know about the VM, it
> just enables/disables encryption by a request from the CPU (those
> start/stop interface commands). And IOMMUFD won't be doing DOE, the
> platform driver (such as AMD CCP) will. Nothing to do for VFIO here.
>
> We probably should notify VFIO about the state transition but I do
> not know VFIO would want to do in response.
We have an awkward fit for what CCA people are doing to the various
Linux APIs. Looking somewhat maximally across all the arches a "bind"
for a CC vPCI device creation operation does:
- Setup the CPU page tables for the VM to have access to the MMIO
- Revoke hypervisor access to the MMIO
- Setup the vIOMMU to understand the vPCI device
- Take over control of some of the IOVA translation, at least for T=1,
and route to the the vIOMMU
- Register the vPCI with any attestation functions the VM might use
- Do some DOE stuff to manage/validate TDSIP/etc
So we have interactions of things controlled by PCI, KVM, VFIO, and
iommufd all mushed together.
iommufd is the only area that already has a handle to all the required
objects:
- The physical PCI function
- The CC vIOMMU object
- The KVM FD
- The CC vPCI object
Which is why I have been thinking it is the right place to manage
this.
It doesn't mean that iommufd is suddenly doing PCI stuff, no, that
stays in VFIO.
> > > So your issue is you need to shoot down the dmabuf during vPCI device
> > > destruction?
> >
> > I assume "vPCI device" refers to assigned device in both shared mode &
> > prvate mode. So no, I need to shoot down the dmabuf during TSM unbind,
> > a.k.a. when assigned device is converting from private to shared.
> > Then recover the dmabuf after TSM unbind. The device could still work
> > in VM in shared mode.
What are you trying to protect with this? Is there some intelism where
you can't have references to encrypted MMIO pages?
> > What I really want is, one SW component to manage MMIO dmabuf, secure
> > iommu & TSM bind/unbind. So easier coordinate these 3 operations cause
> > these ops are interconnected according to secure firmware's requirement.
>
> This SW component is QEMU. It knows about FLRs and other config
> space things, it can destroy all these IOMMUFD objects and talk to
> VFIO too, I've tried, so far it is looking easier to manage. Thanks,
Yes, qemu should be sequencing this. The kernel only needs to enforce
any rules required to keep the system from crashing.
Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-12 14:06 ` Jason Gunthorpe
@ 2025-05-13 10:03 ` Zhi Wang
2025-05-14 9:47 ` Xu Yilun
2025-05-15 10:29 ` Alexey Kardashevskiy
2025-05-14 7:02 ` Xu Yilun
1 sibling, 2 replies; 134+ messages in thread
From: Zhi Wang @ 2025-05-13 10:03 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Alexey Kardashevskiy, Xu Yilun, kvm, dri-devel, linux-media,
linaro-mm-sig, sumit.semwal, christian.koenig, pbonzini, seanjc,
alex.williamson, vivek.kasireddy, dan.j.williams, yilun.xu,
linux-coco, linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon,
baolu.lu, zhenzhong.duan, tao1.su
On Mon, 12 May 2025 11:06:17 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:
> On Mon, May 12, 2025 at 07:30:21PM +1000, Alexey Kardashevskiy wrote:
>
> > > > I'm surprised by this.. iommufd shouldn't be doing PCI stuff,
> > > > it is just about managing the translation control of the device.
> > >
> > > I have a little difficulty to understand. Is TSM bind PCI stuff?
> > > To me it is. Host sends PCI TDISP messages via PCI DOE to put the
> > > device in TDISP LOCKED state, so that device behaves differently
> > > from before. Then why put it in IOMMUFD?
> >
> >
> > "TSM bind" sets up the CPU side of it, it binds a VM to a piece of
> > IOMMU on the host CPU. The device does not know about the VM, it
> > just enables/disables encryption by a request from the CPU (those
> > start/stop interface commands). And IOMMUFD won't be doing DOE, the
> > platform driver (such as AMD CCP) will. Nothing to do for VFIO here.
> >
> > We probably should notify VFIO about the state transition but I do
> > not know VFIO would want to do in response.
>
> We have an awkward fit for what CCA people are doing to the various
> Linux APIs. Looking somewhat maximally across all the arches a "bind"
> for a CC vPCI device creation operation does:
>
> - Setup the CPU page tables for the VM to have access to the MMIO
> - Revoke hypervisor access to the MMIO
> - Setup the vIOMMU to understand the vPCI device
> - Take over control of some of the IOVA translation, at least for
> T=1, and route to the the vIOMMU
> - Register the vPCI with any attestation functions the VM might use
> - Do some DOE stuff to manage/validate TDSIP/etc
>
> So we have interactions of things controlled by PCI, KVM, VFIO, and
> iommufd all mushed together.
>
> iommufd is the only area that already has a handle to all the required
> objects:
> - The physical PCI function
> - The CC vIOMMU object
> - The KVM FD
> - The CC vPCI object
>
> Which is why I have been thinking it is the right place to manage
> this.
>
> It doesn't mean that iommufd is suddenly doing PCI stuff, no, that
> stays in VFIO.
>
> > > > So your issue is you need to shoot down the dmabuf during vPCI
> > > > device destruction?
> > >
> > > I assume "vPCI device" refers to assigned device in both shared
> > > mode & prvate mode. So no, I need to shoot down the dmabuf during
> > > TSM unbind, a.k.a. when assigned device is converting from
> > > private to shared. Then recover the dmabuf after TSM unbind. The
> > > device could still work in VM in shared mode.
>
> What are you trying to protect with this? Is there some intelism where
> you can't have references to encrypted MMIO pages?
>
I think it is a matter of design choice. The encrypted MMIO page is
related to the TDI context and secure second level translation table
(S-EPT). and S-EPT is related to the confidential VM's context.
AMD and ARM have another level of HW control, together
with a TSM-owned meta table, can simply mask out the access to those
encrypted MMIO pages. Thus, the life cycle of the encrypted mappings in
the second level translation table can be de-coupled from the TDI
unbound. They can be reaped un-harmfully later by hypervisor in another
path.
While on Intel platform, it doesn't have that additional level of
HW control by design. Thus, the cleanup of encrypted MMIO page mapping
in the S-EPT has to be coupled tightly with TDI context destruction in
the TDI unbind process.
If the TDI unbind is triggered in VFIO/IOMMUFD, there has be a
cross-module notification to KVM to do cleanup in the S-EPT.
So shooting down the DMABUF object (encrypted MMIO page) means shooting
down the S-EPT mapping and recovering the DMABUF object means
re-construct the non-encrypted MMIO mapping in the EPT after the TDI is
unbound.
Z.
> > > What I really want is, one SW component to manage MMIO dmabuf,
> > > secure iommu & TSM bind/unbind. So easier coordinate these 3
> > > operations cause these ops are interconnected according to secure
> > > firmware's requirement.
> >
> > This SW component is QEMU. It knows about FLRs and other config
> > space things, it can destroy all these IOMMUFD objects and talk to
> > VFIO too, I've tried, so far it is looking easier to manage. Thanks,
>
> Yes, qemu should be sequencing this. The kernel only needs to enforce
> any rules required to keep the system from crashing.
>
> Jason
>
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-13 10:03 ` Zhi Wang
@ 2025-05-14 9:47 ` Xu Yilun
2025-05-14 20:05 ` Zhi Wang
2025-05-15 10:29 ` Alexey Kardashevskiy
1 sibling, 1 reply; 134+ messages in thread
From: Xu Yilun @ 2025-05-14 9:47 UTC (permalink / raw)
To: Zhi Wang
Cc: Jason Gunthorpe, Alexey Kardashevskiy, kvm, dri-devel,
linux-media, linaro-mm-sig, sumit.semwal, christian.koenig,
pbonzini, seanjc, alex.williamson, vivek.kasireddy,
dan.j.williams, yilun.xu, linux-coco, linux-kernel, lukas,
yan.y.zhao, daniel.vetter, leon, baolu.lu, zhenzhong.duan,
tao1.su
On Tue, May 13, 2025 at 01:03:15PM +0300, Zhi Wang wrote:
> On Mon, 12 May 2025 11:06:17 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> > On Mon, May 12, 2025 at 07:30:21PM +1000, Alexey Kardashevskiy wrote:
> >
> > > > > I'm surprised by this.. iommufd shouldn't be doing PCI stuff,
> > > > > it is just about managing the translation control of the device.
> > > >
> > > > I have a little difficulty to understand. Is TSM bind PCI stuff?
> > > > To me it is. Host sends PCI TDISP messages via PCI DOE to put the
> > > > device in TDISP LOCKED state, so that device behaves differently
> > > > from before. Then why put it in IOMMUFD?
> > >
> > >
> > > "TSM bind" sets up the CPU side of it, it binds a VM to a piece of
> > > IOMMU on the host CPU. The device does not know about the VM, it
> > > just enables/disables encryption by a request from the CPU (those
> > > start/stop interface commands). And IOMMUFD won't be doing DOE, the
> > > platform driver (such as AMD CCP) will. Nothing to do for VFIO here.
> > >
> > > We probably should notify VFIO about the state transition but I do
> > > not know VFIO would want to do in response.
> >
> > We have an awkward fit for what CCA people are doing to the various
> > Linux APIs. Looking somewhat maximally across all the arches a "bind"
> > for a CC vPCI device creation operation does:
> >
> > - Setup the CPU page tables for the VM to have access to the MMIO
> > - Revoke hypervisor access to the MMIO
> > - Setup the vIOMMU to understand the vPCI device
> > - Take over control of some of the IOVA translation, at least for
> > T=1, and route to the the vIOMMU
> > - Register the vPCI with any attestation functions the VM might use
> > - Do some DOE stuff to manage/validate TDSIP/etc
> >
> > So we have interactions of things controlled by PCI, KVM, VFIO, and
> > iommufd all mushed together.
> >
> > iommufd is the only area that already has a handle to all the required
> > objects:
> > - The physical PCI function
> > - The CC vIOMMU object
> > - The KVM FD
> > - The CC vPCI object
> >
> > Which is why I have been thinking it is the right place to manage
> > this.
> >
> > It doesn't mean that iommufd is suddenly doing PCI stuff, no, that
> > stays in VFIO.
> >
> > > > > So your issue is you need to shoot down the dmabuf during vPCI
> > > > > device destruction?
> > > >
> > > > I assume "vPCI device" refers to assigned device in both shared
> > > > mode & prvate mode. So no, I need to shoot down the dmabuf during
> > > > TSM unbind, a.k.a. when assigned device is converting from
> > > > private to shared. Then recover the dmabuf after TSM unbind. The
> > > > device could still work in VM in shared mode.
> >
> > What are you trying to protect with this? Is there some intelism where
> > you can't have references to encrypted MMIO pages?
> >
>
> I think it is a matter of design choice. The encrypted MMIO page is
> related to the TDI context and secure second level translation table
> (S-EPT). and S-EPT is related to the confidential VM's context.
>
> AMD and ARM have another level of HW control, together
> with a TSM-owned meta table, can simply mask out the access to those
> encrypted MMIO pages. Thus, the life cycle of the encrypted mappings in
> the second level translation table can be de-coupled from the TDI
> unbound. They can be reaped un-harmfully later by hypervisor in another
> path.
>
> While on Intel platform, it doesn't have that additional level of
> HW control by design. Thus, the cleanup of encrypted MMIO page mapping
> in the S-EPT has to be coupled tightly with TDI context destruction in
> the TDI unbind process.
Thanks for the accurate explanation. Yes, in TDX, the references/mapping
to the encrypted MMIO page means a CoCo-VM owns the MMIO page. So TDX
firmware won't allow the CC vPCI device (which physically owns the MMIO
page) unbind/freed from a CoCo-VM, while the VM still have the S-EPT mapping.
AMD doesn't use KVM page table to track CC ownership, so no need to
interact with KVM.
Thanks,
Yilun
>
> If the TDI unbind is triggered in VFIO/IOMMUFD, there has be a
> cross-module notification to KVM to do cleanup in the S-EPT.
>
> So shooting down the DMABUF object (encrypted MMIO page) means shooting
> down the S-EPT mapping and recovering the DMABUF object means
> re-construct the non-encrypted MMIO mapping in the EPT after the TDI is
> unbound.
>
> Z.
>
> > > > What I really want is, one SW component to manage MMIO dmabuf,
> > > > secure iommu & TSM bind/unbind. So easier coordinate these 3
> > > > operations cause these ops are interconnected according to secure
> > > > firmware's requirement.
> > >
> > > This SW component is QEMU. It knows about FLRs and other config
> > > space things, it can destroy all these IOMMUFD objects and talk to
> > > VFIO too, I've tried, so far it is looking easier to manage. Thanks,
> >
> > Yes, qemu should be sequencing this. The kernel only needs to enforce
> > any rules required to keep the system from crashing.
> >
> > Jason
> >
>
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-14 9:47 ` Xu Yilun
@ 2025-05-14 20:05 ` Zhi Wang
2025-05-15 18:02 ` Xu Yilun
0 siblings, 1 reply; 134+ messages in thread
From: Zhi Wang @ 2025-05-14 20:05 UTC (permalink / raw)
To: Xu Yilun
Cc: Jason Gunthorpe, Alexey Kardashevskiy, kvm, dri-devel,
linux-media, linaro-mm-sig, sumit.semwal, christian.koenig,
pbonzini, seanjc, alex.williamson, vivek.kasireddy,
dan.j.williams, yilun.xu, linux-coco, linux-kernel, lukas,
yan.y.zhao, daniel.vetter, leon, baolu.lu, zhenzhong.duan,
tao1.su
On Wed, 14 May 2025 17:47:12 +0800
Xu Yilun <yilun.xu@linux.intel.com> wrote:
> On Tue, May 13, 2025 at 01:03:15PM +0300, Zhi Wang wrote:
> > On Mon, 12 May 2025 11:06:17 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > > On Mon, May 12, 2025 at 07:30:21PM +1000, Alexey Kardashevskiy
> > > wrote:
> > >
> > > > > > I'm surprised by this.. iommufd shouldn't be doing PCI
> > > > > > stuff, it is just about managing the translation control of
> > > > > > the device.
> > > > >
> > > > > I have a little difficulty to understand. Is TSM bind PCI
> > > > > stuff? To me it is. Host sends PCI TDISP messages via PCI DOE
> > > > > to put the device in TDISP LOCKED state, so that device
> > > > > behaves differently from before. Then why put it in IOMMUFD?
> > > >
> > > >
> > > > "TSM bind" sets up the CPU side of it, it binds a VM to a piece
> > > > of IOMMU on the host CPU. The device does not know about the
> > > > VM, it just enables/disables encryption by a request from the
> > > > CPU (those start/stop interface commands). And IOMMUFD won't be
> > > > doing DOE, the platform driver (such as AMD CCP) will. Nothing
> > > > to do for VFIO here.
> > > >
> > > > We probably should notify VFIO about the state transition but I
> > > > do not know VFIO would want to do in response.
> > >
> > > We have an awkward fit for what CCA people are doing to the
> > > various Linux APIs. Looking somewhat maximally across all the
> > > arches a "bind" for a CC vPCI device creation operation does:
> > >
> > > - Setup the CPU page tables for the VM to have access to the MMIO
> > > - Revoke hypervisor access to the MMIO
> > > - Setup the vIOMMU to understand the vPCI device
> > > - Take over control of some of the IOVA translation, at least for
> > > T=1, and route to the the vIOMMU
> > > - Register the vPCI with any attestation functions the VM might
> > > use
> > > - Do some DOE stuff to manage/validate TDSIP/etc
> > >
> > > So we have interactions of things controlled by PCI, KVM, VFIO,
> > > and iommufd all mushed together.
> > >
> > > iommufd is the only area that already has a handle to all the
> > > required objects:
> > > - The physical PCI function
> > > - The CC vIOMMU object
> > > - The KVM FD
> > > - The CC vPCI object
> > >
> > > Which is why I have been thinking it is the right place to manage
> > > this.
> > >
> > > It doesn't mean that iommufd is suddenly doing PCI stuff, no, that
> > > stays in VFIO.
> > >
> > > > > > So your issue is you need to shoot down the dmabuf during
> > > > > > vPCI device destruction?
> > > > >
> > > > > I assume "vPCI device" refers to assigned device in both
> > > > > shared mode & prvate mode. So no, I need to shoot down the
> > > > > dmabuf during TSM unbind, a.k.a. when assigned device is
> > > > > converting from private to shared. Then recover the dmabuf
> > > > > after TSM unbind. The device could still work in VM in shared
> > > > > mode.
> > >
> > > What are you trying to protect with this? Is there some intelism
> > > where you can't have references to encrypted MMIO pages?
> > >
> >
> > I think it is a matter of design choice. The encrypted MMIO page is
> > related to the TDI context and secure second level translation table
> > (S-EPT). and S-EPT is related to the confidential VM's context.
> >
> > AMD and ARM have another level of HW control, together
> > with a TSM-owned meta table, can simply mask out the access to those
> > encrypted MMIO pages. Thus, the life cycle of the encrypted
> > mappings in the second level translation table can be de-coupled
> > from the TDI unbound. They can be reaped un-harmfully later by
> > hypervisor in another path.
> >
> > While on Intel platform, it doesn't have that additional level of
> > HW control by design. Thus, the cleanup of encrypted MMIO page
> > mapping in the S-EPT has to be coupled tightly with TDI context
> > destruction in the TDI unbind process.
>
> Thanks for the accurate explanation. Yes, in TDX, the
> references/mapping to the encrypted MMIO page means a CoCo-VM owns
> the MMIO page. So TDX firmware won't allow the CC vPCI device (which
> physically owns the MMIO page) unbind/freed from a CoCo-VM, while the
> VM still have the S-EPT mapping.
>
> AMD doesn't use KVM page table to track CC ownership, so no need to
> interact with KVM.
>
IMHO, I think it might be helpful that you can picture out what are the
minimum requirements (function/life cycle) to the current IOMMUFD TSM
bind architecture:
1.host tsm_bind (preparation) is in IOMMUFD, triggered by QEMU handling
the TVM-HOST call.
2. TDI acceptance is handled in guest_request() to accept the TDI after
the validation in the TVM)
and which part/where need to be modified in the current architecture to
reach there. Try to fold vendor-specific knowledge as much as possible,
but still keep them modular in the TSM driver and let's see how it looks
like. Maybe some example TSM driver code to demonstrate together with
VFIO dma-buf patch.
If some where is extremely hacky in the TSM driver, let's see how they
can be lift to the upper level or the upper call passes more parameters
to them.
Z.
> Thanks,
> Yilun
>
> >
> > If the TDI unbind is triggered in VFIO/IOMMUFD, there has be a
> > cross-module notification to KVM to do cleanup in the S-EPT.
> >
> > So shooting down the DMABUF object (encrypted MMIO page) means
> > shooting down the S-EPT mapping and recovering the DMABUF object
> > means re-construct the non-encrypted MMIO mapping in the EPT after
> > the TDI is unbound.
> >
> > Z.
> >
> > > > > What I really want is, one SW component to manage MMIO dmabuf,
> > > > > secure iommu & TSM bind/unbind. So easier coordinate these 3
> > > > > operations cause these ops are interconnected according to
> > > > > secure firmware's requirement.
> > > >
> > > > This SW component is QEMU. It knows about FLRs and other config
> > > > space things, it can destroy all these IOMMUFD objects and talk
> > > > to VFIO too, I've tried, so far it is looking easier to manage.
> > > > Thanks,
> > >
> > > Yes, qemu should be sequencing this. The kernel only needs to
> > > enforce any rules required to keep the system from crashing.
> > >
> > > Jason
> > >
> >
>
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-14 20:05 ` Zhi Wang
@ 2025-05-15 18:02 ` Xu Yilun
2025-05-15 19:21 ` Jason Gunthorpe
2025-05-20 10:57 ` Alexey Kardashevskiy
0 siblings, 2 replies; 134+ messages in thread
From: Xu Yilun @ 2025-05-15 18:02 UTC (permalink / raw)
To: Zhi Wang
Cc: Jason Gunthorpe, Alexey Kardashevskiy, kvm, dri-devel,
linux-media, linaro-mm-sig, sumit.semwal, christian.koenig,
pbonzini, seanjc, alex.williamson, vivek.kasireddy,
dan.j.williams, yilun.xu, linux-coco, linux-kernel, lukas,
yan.y.zhao, daniel.vetter, leon, baolu.lu, zhenzhong.duan,
tao1.su
> IMHO, I think it might be helpful that you can picture out what are the
> minimum requirements (function/life cycle) to the current IOMMUFD TSM
> bind architecture:
>
> 1.host tsm_bind (preparation) is in IOMMUFD, triggered by QEMU handling
> the TVM-HOST call.
> 2. TDI acceptance is handled in guest_request() to accept the TDI after
> the validation in the TVM)
I'll try my best to brainstorm and make a flow in ASCII.
(*) means new feature
Guest Guest TSM QEMU VFIO IOMMUFD host TSM KVM
----- --------- ---- ---- ------- -------- ---
1. *Connect(IDE)
2. Init vdev
3. *create dmabuf
4. *export dmabuf
5. create memslot
6. *import dmabuf
7. setup shared DMA
8. create hwpt
9. attach hwpt
10. kvm run
11.enum shared dev
12.*Connect(Bind)
13. *GHCI Bind
14. *Bind
15 CC viommu alloc
16. vdevice allloc
16. *attach vdev
17. *setup CC viommu
18 *tsm_bind
19. *bind
20.*Attest
21. *GHCI get CC info
22. *get CC info
23. *vdev guest req
24. *guest req
25.*Accept
26. *GHCI accept MMIO/DMA
27. *accept MMIO/DMA
28. *vdev guest req
29. *guest req
30. *map private MMIO
31. *GHCI start tdi
32. *start tdi
33. *vdev guest req
34. *guest req
35.Workload...
36.*disconnect(Unbind)
37. *GHCI unbind
38. *Unbind
39. *detach vdev
40. *tsm_unbind
41. *TDX stop tdi
42. *TDX disable mmio cb
43. *cb dmabuf revoke
44. *unmap private MMIO
45. *TDX disable dma cb
46. *cb disable CC viommu
47. *TDX tdi free
48. *enable mmio
49. *cb dmabuf recover
50.workable shared dev
TSM unbind is a little verbos & specific to TDX Connect, but SEV TSM could
ignore these callback. Just implement an "unbind" tsm ops.
Thanks,
Yilun
>
> and which part/where need to be modified in the current architecture to
> reach there. Try to fold vendor-specific knowledge as much as possible,
> but still keep them modular in the TSM driver and let's see how it looks
> like. Maybe some example TSM driver code to demonstrate together with
> VFIO dma-buf patch.
>
> If some where is extremely hacky in the TSM driver, let's see how they
> can be lift to the upper level or the upper call passes more parameters
> to them.
>
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-15 18:02 ` Xu Yilun
@ 2025-05-15 19:21 ` Jason Gunthorpe
2025-05-16 6:19 ` Xu Yilun
2025-05-20 10:57 ` Alexey Kardashevskiy
1 sibling, 1 reply; 134+ messages in thread
From: Jason Gunthorpe @ 2025-05-15 19:21 UTC (permalink / raw)
To: Xu Yilun
Cc: Zhi Wang, Alexey Kardashevskiy, kvm, dri-devel, linux-media,
linaro-mm-sig, sumit.semwal, christian.koenig, pbonzini, seanjc,
alex.williamson, vivek.kasireddy, dan.j.williams, yilun.xu,
linux-coco, linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon,
baolu.lu, zhenzhong.duan, tao1.su
On Fri, May 16, 2025 at 02:02:29AM +0800, Xu Yilun wrote:
> > IMHO, I think it might be helpful that you can picture out what are the
> > minimum requirements (function/life cycle) to the current IOMMUFD TSM
> > bind architecture:
> >
> > 1.host tsm_bind (preparation) is in IOMMUFD, triggered by QEMU handling
> > the TVM-HOST call.
> > 2. TDI acceptance is handled in guest_request() to accept the TDI after
> > the validation in the TVM)
>
> I'll try my best to brainstorm and make a flow in ASCII.
>
> (*) means new feature
>
>
> Guest Guest TSM QEMU VFIO IOMMUFD host TSM KVM
> ----- --------- ---- ---- ------- -------- ---
> 1. *Connect(IDE)
> 2. Init vdev
open /dev/vfio/XX as a VFIO action
Then VFIO attaches to IOMMUFD as an iommufd action creating the idev
> 3. *create dmabuf
> 4. *export dmabuf
> 5. create memslot
> 6. *import dmabuf
> 7. setup shared DMA
> 8. create hwpt
> 9. attach hwpt
> 10. kvm run
> 11.enum shared dev
> 12.*Connect(Bind)
> 13. *GHCI Bind
> 14. *Bind
> 15 CC viommu alloc
> 16. vdevice allloc
viommu and vdevice creation happen before KVM run. The vPCI function
is visible to the guest from the very start, even though it is in T=0
mode. If a platform does not require any special CC steps prior to KVM
run then it just has a NOP for these functions.
What you have here is some new BIND operation against the already
existing vdevice as we discussed earlier.
> 16. *attach vdev
> 17. *setup CC viommu
> 18 *tsm_bind
> 19. *bind
> 20.*Attest
> 21. *GHCI get CC info
> 22. *get CC info
> 23. *vdev guest req
> 24. *guest req
> 25.*Accept
> 26. *GHCI accept MMIO/DMA
> 27. *accept MMIO/DMA
> 28. *vdev guest req
> 29. *guest req
> 30. *map private MMIO
> 31. *GHCI start tdi
> 32. *start tdi
> 33. *vdev guest req
> 34. *guest req
This seems reasonable you want to have some generic RPC scheme to
carry messages fro mthe VM to the TSM tunneled through the iommufd
vdevice (because the vdevice has the vPCI ID, the KVM ID, the VIOMMU
id and so on)
> 35.Workload...
> 36.*disconnect(Unbind)
> 37. *GHCI unbind
> 38. *Unbind
> 39. *detach vdev
unbind vdev. vdev remains until kvm is stopped.
> 40. *tsm_unbind
> 41. *TDX stop tdi
> 42. *TDX disable mmio cb
> 43. *cb dmabuf revoke
> 44. *unmap private MMIO
> 45. *TDX disable dma cb
> 46. *cb disable CC viommu
I don't know why you'd disable a viommu while the VM is running,
doesn't make sense.
> 47. *TDX tdi free
> 48. *enable mmio
> 49. *cb dmabuf recover
> 50.workable shared dev
This is a nice chart, it would be good to see a comparable chart for
AMD and ARM
Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-15 19:21 ` Jason Gunthorpe
@ 2025-05-16 6:19 ` Xu Yilun
2025-05-16 12:49 ` Jason Gunthorpe
0 siblings, 1 reply; 134+ messages in thread
From: Xu Yilun @ 2025-05-16 6:19 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Zhi Wang, Alexey Kardashevskiy, kvm, dri-devel, linux-media,
linaro-mm-sig, sumit.semwal, christian.koenig, pbonzini, seanjc,
alex.williamson, vivek.kasireddy, dan.j.williams, yilun.xu,
linux-coco, linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon,
baolu.lu, zhenzhong.duan, tao1.su
On Thu, May 15, 2025 at 04:21:27PM -0300, Jason Gunthorpe wrote:
> On Fri, May 16, 2025 at 02:02:29AM +0800, Xu Yilun wrote:
> > > IMHO, I think it might be helpful that you can picture out what are the
> > > minimum requirements (function/life cycle) to the current IOMMUFD TSM
> > > bind architecture:
> > >
> > > 1.host tsm_bind (preparation) is in IOMMUFD, triggered by QEMU handling
> > > the TVM-HOST call.
> > > 2. TDI acceptance is handled in guest_request() to accept the TDI after
> > > the validation in the TVM)
> >
> > I'll try my best to brainstorm and make a flow in ASCII.
> >
> > (*) means new feature
> >
> >
> > Guest Guest TSM QEMU VFIO IOMMUFD host TSM KVM
> > ----- --------- ---- ---- ------- -------- ---
> > 1. *Connect(IDE)
> > 2. Init vdev
>
> open /dev/vfio/XX as a VFIO action
>
> Then VFIO attaches to IOMMUFD as an iommufd action creating the idev
>
> > 3. *create dmabuf
> > 4. *export dmabuf
> > 5. create memslot
> > 6. *import dmabuf
> > 7. setup shared DMA
> > 8. create hwpt
> > 9. attach hwpt
> > 10. kvm run
> > 11.enum shared dev
> > 12.*Connect(Bind)
> > 13. *GHCI Bind
> > 14. *Bind
> > 15 CC viommu alloc
> > 16. vdevice allloc
>
> viommu and vdevice creation happen before KVM run. The vPCI function
> is visible to the guest from the very start, even though it is in T=0
> mode. If a platform does not require any special CC steps prior to KVM
> run then it just has a NOP for these functions.
>
Fine.
> What you have here is some new BIND operation against the already
> existing vdevice as we discussed earlier.
>
> > 16. *attach vdev
> > 17. *setup CC viommu
> > 18 *tsm_bind
> > 19. *bind
> > 20.*Attest
> > 21. *GHCI get CC info
> > 22. *get CC info
> > 23. *vdev guest req
> > 24. *guest req
> > 25.*Accept
> > 26. *GHCI accept MMIO/DMA
> > 27. *accept MMIO/DMA
> > 28. *vdev guest req
> > 29. *guest req
> > 30. *map private MMIO
> > 31. *GHCI start tdi
> > 32. *start tdi
> > 33. *vdev guest req
> > 34. *guest req
>
> This seems reasonable you want to have some generic RPC scheme to
> carry messages fro mthe VM to the TSM tunneled through the iommufd
> vdevice (because the vdevice has the vPCI ID, the KVM ID, the VIOMMU
> id and so on)
>
> > 35.Workload...
> > 36.*disconnect(Unbind)
> > 37. *GHCI unbind
> > 38. *Unbind
> > 39. *detach vdev
>
> unbind vdev. vdev remains until kvm is stopped.
>
> > 40. *tsm_unbind
> > 41. *TDX stop tdi
> > 42. *TDX disable mmio cb
> > 43. *cb dmabuf revoke
> > 44. *unmap private MMIO
> > 45. *TDX disable dma cb
> > 46. *cb disable CC viommu
>
> I don't know why you'd disable a viommu while the VM is running,
> doesn't make sense.
Here it means remove the CC setup for viommu, shared setup is still
kept.
It is still because of the TDX enforcement on Unbind :(
1. STOP TDI via TDISP message STOP_INTERFACE
2. Private MMIO unmap from Secure EPT
3. Trusted Device Context Table cleanup for the TDI
4. TDI ownership reclaim and metadata free
It is doing Step 3 so that the TDI could finally been removed.
Please also note I does CC viommu setup on "Bind".
Thanks,
Yilun
>
> > 47. *TDX tdi free
> > 48. *enable mmio
> > 49. *cb dmabuf recover
> > 50.workable shared dev
>
> This is a nice chart, it would be good to see a comparable chart for
> AMD and ARM
>
> Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-16 6:19 ` Xu Yilun
@ 2025-05-16 12:49 ` Jason Gunthorpe
2025-05-17 2:33 ` Xu Yilun
0 siblings, 1 reply; 134+ messages in thread
From: Jason Gunthorpe @ 2025-05-16 12:49 UTC (permalink / raw)
To: Xu Yilun
Cc: Zhi Wang, Alexey Kardashevskiy, kvm, dri-devel, linux-media,
linaro-mm-sig, sumit.semwal, christian.koenig, pbonzini, seanjc,
alex.williamson, vivek.kasireddy, dan.j.williams, yilun.xu,
linux-coco, linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon,
baolu.lu, zhenzhong.duan, tao1.su
On Fri, May 16, 2025 at 02:19:45PM +0800, Xu Yilun wrote:
> > I don't know why you'd disable a viommu while the VM is running,
> > doesn't make sense.
>
> Here it means remove the CC setup for viommu, shared setup is still
> kept.
That might makes sense for the vPCI function, but not the vIOMMU. A
secure VIOMMU needs to be running at all times while the guest is
running. Perhaps it has no devices it can be used with, but it's
functionality has to be there because a driver in the VM will be
connected to it.
At most "bind" should only tell the already existing secure vIOMMU
that it is allowed to translate for a specific vPCI function.
Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-16 12:49 ` Jason Gunthorpe
@ 2025-05-17 2:33 ` Xu Yilun
0 siblings, 0 replies; 134+ messages in thread
From: Xu Yilun @ 2025-05-17 2:33 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Zhi Wang, Alexey Kardashevskiy, kvm, dri-devel, linux-media,
linaro-mm-sig, sumit.semwal, christian.koenig, pbonzini, seanjc,
alex.williamson, vivek.kasireddy, dan.j.williams, yilun.xu,
linux-coco, linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon,
baolu.lu, zhenzhong.duan, tao1.su
On Fri, May 16, 2025 at 09:49:53AM -0300, Jason Gunthorpe wrote:
> On Fri, May 16, 2025 at 02:19:45PM +0800, Xu Yilun wrote:
> > > I don't know why you'd disable a viommu while the VM is running,
> > > doesn't make sense.
> >
> > Here it means remove the CC setup for viommu, shared setup is still
> > kept.
>
> That might makes sense for the vPCI function, but not the vIOMMU. A
> secure VIOMMU needs to be running at all times while the guest is
> running. Perhaps it has no devices it can be used with, but it's
> functionality has to be there because a driver in the VM will be
> connected to it.
>
> At most "bind" should only tell the already existing secure vIOMMU
> that it is allowed to translate for a specific vPCI function.
So I think something like:
struct iommufd_vdevice_ops {
int (*setup_trusted_dma)(struct iommufd_vdevice *vdev); //for Bind
void (*remove_trusted_dma)(struct iommufd_vdevice *vdev); //for Unbind
};
Thanks,
Yilun
>
> Jason
>
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-15 18:02 ` Xu Yilun
2025-05-15 19:21 ` Jason Gunthorpe
@ 2025-05-20 10:57 ` Alexey Kardashevskiy
2025-05-24 3:33 ` Xu Yilun
1 sibling, 1 reply; 134+ messages in thread
From: Alexey Kardashevskiy @ 2025-05-20 10:57 UTC (permalink / raw)
To: Xu Yilun, Zhi Wang
Cc: Jason Gunthorpe, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On 16/5/25 04:02, Xu Yilun wrote:
>> IMHO, I think it might be helpful that you can picture out what are the
>> minimum requirements (function/life cycle) to the current IOMMUFD TSM
>> bind architecture:
>>
>> 1.host tsm_bind (preparation) is in IOMMUFD, triggered by QEMU handling
>> the TVM-HOST call.
>> 2. TDI acceptance is handled in guest_request() to accept the TDI after
>> the validation in the TVM)
>
> I'll try my best to brainstorm and make a flow in ASCII.
>
> (*) means new feature
>
>
> Guest Guest TSM QEMU VFIO IOMMUFD host TSM KVM
> ----- --------- ---- ---- ------- -------- ---
> 1. *Connect(IDE)
> 2. Init vdev
> 3. *create dmabuf
> 4. *export dmabuf
> 5. create memslot
> 6. *import dmabuf
> 7. setup shared DMA
> 8. create hwpt
> 9. attach hwpt
> 10. kvm run
> 11.enum shared dev
> 12.*Connect(Bind)
> 13. *GHCI Bind
> 14. *Bind
> 15 CC viommu alloc
> 16. vdevice allloc
> 16. *attach vdev
This "attach vdev" - we are still deciding if it goes to IOMMUFD or VFIO, right?
> 17. *setup CC viommu
> 18 *tsm_bind
> 19. *bind
> 20.*Attest
> 21. *GHCI get CC info
> 22. *get CC info
> 23. *vdev guest req
> 24. *guest req
> 25.*Accept
> 26. *GHCI accept MMIO/DMA
> 27. *accept MMIO/DMA
> 28. *vdev guest req
> 29. *guest req
> 30. *map private MMIO
> 31. *GHCI start tdi
> 32. *start tdi
> 33. *vdev guest req
> 34. *guest req
I am not sure I follow the layout here. "start tdi" and "accept MMIO/DMA" are under "QEMU" but QEMU cannot do anything by itself and has to call VFIO or some other driver...
> 35.Workload...
> 36.*disconnect(Unbind)
Is this a case of PCI hotunplug? Or just killing QEMU/shutting down the VM? Or stopping trusting the device and switching it to untrusted mode, to work with SWIOTLB or DiscardManager?
> 37. *GHCI unbind
> 38. *Unbind
> 39. *detach vdev
> 40. *tsm_unbind
> 41. *TDX stop tdi
> 42. *TDX disable mmio cb
> 43. *cb dmabuf revoke
... like VFIO and hostTSM - "TDX stop tdi" and "cb dmabuf revoke" are not under QEMU.
> 44. *unmap private MMIO
> 45. *TDX disable dma cb
> 46. *cb disable CC viommu
> 47. *TDX tdi free
> 48. *enable mmio
> 49. *cb dmabuf recover
What is the difference between "cb dmabuf revoke" and "cb dmabuf recover"?
> 50.workable shared dev
>
> TSM unbind is a little verbos & specific to TDX Connect, but SEV TSM could
> ignore these callback. Just implement an "unbind" tsm ops.
Well, something need to clear RMP entries, can be done in the TDI unbind or whenever you will do it.
And the chart applies for AMD too, more or less. Thanks,
> Thanks,
> Yilun
>
>>
>> and which part/where need to be modified in the current architecture to
>> reach there. Try to fold vendor-specific knowledge as much as possible,
>> but still keep them modular in the TSM driver and let's see how it looks
>> like. Maybe some example TSM driver code to demonstrate together with
>> VFIO dma-buf patch.
>>
>> If some where is extremely hacky in the TSM driver, let's see how they
>> can be lift to the upper level or the upper call passes more parameters
>> to them.
--
Alexey
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-20 10:57 ` Alexey Kardashevskiy
@ 2025-05-24 3:33 ` Xu Yilun
0 siblings, 0 replies; 134+ messages in thread
From: Xu Yilun @ 2025-05-24 3:33 UTC (permalink / raw)
To: Alexey Kardashevskiy
Cc: Zhi Wang, Jason Gunthorpe, kvm, dri-devel, linux-media,
linaro-mm-sig, sumit.semwal, christian.koenig, pbonzini, seanjc,
alex.williamson, vivek.kasireddy, dan.j.williams, yilun.xu,
linux-coco, linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon,
baolu.lu, zhenzhong.duan, tao1.su
On Tue, May 20, 2025 at 08:57:42PM +1000, Alexey Kardashevskiy wrote:
>
>
> On 16/5/25 04:02, Xu Yilun wrote:
> > > IMHO, I think it might be helpful that you can picture out what are the
> > > minimum requirements (function/life cycle) to the current IOMMUFD TSM
> > > bind architecture:
> > >
> > > 1.host tsm_bind (preparation) is in IOMMUFD, triggered by QEMU handling
> > > the TVM-HOST call.
> > > 2. TDI acceptance is handled in guest_request() to accept the TDI after
> > > the validation in the TVM)
> >
> > I'll try my best to brainstorm and make a flow in ASCII.
> >
> > (*) means new feature
> >
> >
> > Guest Guest TSM QEMU VFIO IOMMUFD host TSM KVM
> > ----- --------- ---- ---- ------- -------- ---
> > 1. *Connect(IDE)
> > 2. Init vdev
> > 3. *create dmabuf
> > 4. *export dmabuf
> > 5. create memslot
> > 6. *import dmabuf
> > 7. setup shared DMA
> > 8. create hwpt
> > 9. attach hwpt
> > 10. kvm run
> > 11.enum shared dev
> > 12.*Connect(Bind)
> > 13. *GHCI Bind
> > 14. *Bind
> > 15 CC viommu alloc
> > 16. vdevice allloc
> > 16. *attach vdev
>
>
> This "attach vdev" - we are still deciding if it goes to IOMMUFD or VFIO, right?
This should be "tsm bind". Seems Jason's suggestion is place the IOCTL
against VFIO, then VFIO reach into IOMMUFD to do the real
pci_tsm_bind().
https://lore.kernel.org/all/20250515175658.GR382960@nvidia.com/
>
>
> > 17. *setup CC viommu
> > 18 *tsm_bind
> > 19. *bind
> > 20.*Attest
> > 21. *GHCI get CC info
> > 22. *get CC info
> > 23. *vdev guest req
> > 24. *guest req
> > 25.*Accept
> > 26. *GHCI accept MMIO/DMA
> > 27. *accept MMIO/DMA
> > 28. *vdev guest req
> > 29. *guest req
> > 30. *map private MMIO
> > 31. *GHCI start tdi
> > 32. *start tdi
> > 33. *vdev guest req
> > 34. *guest req
>
>
> I am not sure I follow the layout here. "start tdi" and "accept MMIO/DMA" are under "QEMU" but QEMU cannot do anything by itself and has to call VFIO or some other driver...
>
Yes. Call IOCTL(iommufd, IOMMUFD_VDEVICE_GUEST_REQUEST, vdevice_id)
> > 35.Workload...
> > 36.*disconnect(Unbind)
>
> Is this a case of PCI hotunplug? Or just killing QEMU/shutting down the VM? Or stopping trusting the device and switching it to untrusted mode, to work with SWIOTLB or DiscardManager?
>
switching to untrusted mode. But I think hotunplug would finally trigger
the same host side behavior, only no need the guest to "echo 0 > connect"
> > 37. *GHCI unbind
> > 38. *Unbind
> > 39. *detach vdev
> > 40. *tsm_unbind
> > 41. *TDX stop tdi
> > 42. *TDX disable mmio cb
> > 43. *cb dmabuf revoke
>
>
> ... like VFIO and hostTSM - "TDX stop tdi" and "cb dmabuf revoke" are not under QEMU.
Correct. These are TDX Module specific requirements, we don't want them
to make the general APIs unnecessary verbose.
>
>
> > 44. *unmap private MMIO
> > 45. *TDX disable dma cb
> > 46. *cb disable CC viommu
> > 47. *TDX tdi free
> > 48. *enable mmio
> > 49. *cb dmabuf recover
>
>
> What is the difference between "cb dmabuf revoke" and "cb dmabuf recover"?
revoke revokes private S-EPT mapping, recover means KVM could then do
shared MMIO mapping on EPT.
Thanks,
Yilun
>
>
> > 50.workable shared dev
> >
> > TSM unbind is a little verbos & specific to TDX Connect, but SEV TSM could
> > ignore these callback. Just implement an "unbind" tsm ops.
>
>
> Well, something need to clear RMP entries, can be done in the TDI unbind or whenever you will do it.
>
> And the chart applies for AMD too, more or less. Thanks,
>
>
> > Thanks,
> > Yilun
> >
> > >
> > > and which part/where need to be modified in the current architecture to
> > > reach there. Try to fold vendor-specific knowledge as much as possible,
> > > but still keep them modular in the TSM driver and let's see how it looks
> > > like. Maybe some example TSM driver code to demonstrate together with
> > > VFIO dma-buf patch.
> > >
> > > If some where is extremely hacky in the TSM driver, let's see how they
> > > can be lift to the upper level or the upper call passes more parameters
> > > to them.
>
>
>
> --
> Alexey
>
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-13 10:03 ` Zhi Wang
2025-05-14 9:47 ` Xu Yilun
@ 2025-05-15 10:29 ` Alexey Kardashevskiy
2025-05-15 16:44 ` Zhi Wang
1 sibling, 1 reply; 134+ messages in thread
From: Alexey Kardashevskiy @ 2025-05-15 10:29 UTC (permalink / raw)
To: Zhi Wang, Jason Gunthorpe
Cc: Xu Yilun, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On 13/5/25 20:03, Zhi Wang wrote:
> On Mon, 12 May 2025 11:06:17 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
>
>> On Mon, May 12, 2025 at 07:30:21PM +1000, Alexey Kardashevskiy wrote:
>>
>>>>> I'm surprised by this.. iommufd shouldn't be doing PCI stuff,
>>>>> it is just about managing the translation control of the device.
>>>>
>>>> I have a little difficulty to understand. Is TSM bind PCI stuff?
>>>> To me it is. Host sends PCI TDISP messages via PCI DOE to put the
>>>> device in TDISP LOCKED state, so that device behaves differently
>>>> from before. Then why put it in IOMMUFD?
>>>
>>>
>>> "TSM bind" sets up the CPU side of it, it binds a VM to a piece of
>>> IOMMU on the host CPU. The device does not know about the VM, it
>>> just enables/disables encryption by a request from the CPU (those
>>> start/stop interface commands). And IOMMUFD won't be doing DOE, the
>>> platform driver (such as AMD CCP) will. Nothing to do for VFIO here.
>>>
>>> We probably should notify VFIO about the state transition but I do
>>> not know VFIO would want to do in response.
>>
>> We have an awkward fit for what CCA people are doing to the various
>> Linux APIs. Looking somewhat maximally across all the arches a "bind"
>> for a CC vPCI device creation operation does:
>>
>> - Setup the CPU page tables for the VM to have access to the MMIO
>> - Revoke hypervisor access to the MMIO
>> - Setup the vIOMMU to understand the vPCI device
>> - Take over control of some of the IOVA translation, at least for
>> T=1, and route to the the vIOMMU
>> - Register the vPCI with any attestation functions the VM might use
>> - Do some DOE stuff to manage/validate TDSIP/etc
>>
>> So we have interactions of things controlled by PCI, KVM, VFIO, and
>> iommufd all mushed together.
>>
>> iommufd is the only area that already has a handle to all the required
>> objects:
>> - The physical PCI function
>> - The CC vIOMMU object
>> - The KVM FD
>> - The CC vPCI object
>>
>> Which is why I have been thinking it is the right place to manage
>> this.
>>
>> It doesn't mean that iommufd is suddenly doing PCI stuff, no, that
>> stays in VFIO.
>>
>>>>> So your issue is you need to shoot down the dmabuf during vPCI
>>>>> device destruction?
>>>>
>>>> I assume "vPCI device" refers to assigned device in both shared
>>>> mode & prvate mode. So no, I need to shoot down the dmabuf during
>>>> TSM unbind, a.k.a. when assigned device is converting from
>>>> private to shared. Then recover the dmabuf after TSM unbind. The
>>>> device could still work in VM in shared mode.
>>
>> What are you trying to protect with this? Is there some intelism where
>> you can't have references to encrypted MMIO pages?
>>
>
> I think it is a matter of design choice. The encrypted MMIO page is
> related to the TDI context and secure second level translation table
> (S-EPT). and S-EPT is related to the confidential VM's context.
>
> AMD and ARM have another level of HW control, together
> with a TSM-owned meta table, can simply mask out the access to those
> encrypted MMIO pages. Thus, the life cycle of the encrypted mappings in
> the second level translation table can be de-coupled from the TDI
> unbound. They can be reaped un-harmfully later by hypervisor in another
> path.
>
> While on Intel platform, it doesn't have that additional level of
> HW control by design. Thus, the cleanup of encrypted MMIO page mapping
> in the S-EPT has to be coupled tightly with TDI context destruction in
> the TDI unbind process.
>
> If the TDI unbind is triggered in VFIO/IOMMUFD, there has be a
> cross-module notification to KVM to do cleanup in the S-EPT.
QEMU should know about this unbind and can tell KVM about it too. No cross module notification needed, it is not a hot path.
> So shooting down the DMABUF object (encrypted MMIO page) means shooting
> down the S-EPT mapping and recovering the DMABUF object means
> re-construct the non-encrypted MMIO mapping in the EPT after the TDI is
> unbound.
This is definitely QEMU's job to re-mmap MMIO to the userspace (as it does for non-trusted devices today) so later on nested page fault could fill the nested PTE. Thanks,
>
> Z.
>
>>>> What I really want is, one SW component to manage MMIO dmabuf,
>>>> secure iommu & TSM bind/unbind. So easier coordinate these 3
>>>> operations cause these ops are interconnected according to secure
>>>> firmware's requirement.
>>>
>>> This SW component is QEMU. It knows about FLRs and other config
>>> space things, it can destroy all these IOMMUFD objects and talk to
>>> VFIO too, I've tried, so far it is looking easier to manage. Thanks,
>>
>> Yes, qemu should be sequencing this. The kernel only needs to enforce
>> any rules required to keep the system from crashing.
>>
>> Jason
>>
>
--
Alexey
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-15 10:29 ` Alexey Kardashevskiy
@ 2025-05-15 16:44 ` Zhi Wang
2025-05-15 16:53 ` Zhi Wang
0 siblings, 1 reply; 134+ messages in thread
From: Zhi Wang @ 2025-05-15 16:44 UTC (permalink / raw)
To: Alexey Kardashevskiy, Jason Gunthorpe
Cc: Xu Yilun, kvm@vger.kernel.org, dri-devel@lists.freedesktop.org,
linux-media@vger.kernel.org, linaro-mm-sig@lists.linaro.org,
sumit.semwal@linaro.org, christian.koenig@amd.com,
pbonzini@redhat.com, seanjc@google.com,
alex.williamson@redhat.com, vivek.kasireddy@intel.com,
dan.j.williams@intel.com, yilun.xu@intel.com,
linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org,
lukas@wunner.de, yan.y.zhao@intel.com, daniel.vetter@ffwll.ch,
leon@kernel.org, baolu.lu@linux.intel.com,
zhenzhong.duan@intel.com, tao1.su@intel.com
On 15.5.2025 13.29, Alexey Kardashevskiy wrote:
>
>
> On 13/5/25 20:03, Zhi Wang wrote:
>> On Mon, 12 May 2025 11:06:17 -0300
>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>
>>> On Mon, May 12, 2025 at 07:30:21PM +1000, Alexey Kardashevskiy wrote:
>>>
>>>>>> I'm surprised by this.. iommufd shouldn't be doing PCI stuff,
>>>>>> it is just about managing the translation control of the device.
>>>>>
>>>>> I have a little difficulty to understand. Is TSM bind PCI stuff?
>>>>> To me it is. Host sends PCI TDISP messages via PCI DOE to put the
>>>>> device in TDISP LOCKED state, so that device behaves differently
>>>>> from before. Then why put it in IOMMUFD?
>>>>
>>>>
>>>> "TSM bind" sets up the CPU side of it, it binds a VM to a piece of
>>>> IOMMU on the host CPU. The device does not know about the VM, it
>>>> just enables/disables encryption by a request from the CPU (those
>>>> start/stop interface commands). And IOMMUFD won't be doing DOE, the
>>>> platform driver (such as AMD CCP) will. Nothing to do for VFIO here.
>>>>
>>>> We probably should notify VFIO about the state transition but I do
>>>> not know VFIO would want to do in response.
>>>
>>> We have an awkward fit for what CCA people are doing to the various
>>> Linux APIs. Looking somewhat maximally across all the arches a "bind"
>>> for a CC vPCI device creation operation does:
>>>
>>> - Setup the CPU page tables for the VM to have access to the MMIO
>>> - Revoke hypervisor access to the MMIO
>>> - Setup the vIOMMU to understand the vPCI device
>>> - Take over control of some of the IOVA translation, at least for
>>> T=1, and route to the the vIOMMU
>>> - Register the vPCI with any attestation functions the VM might use
>>> - Do some DOE stuff to manage/validate TDSIP/etc
>>>
>>> So we have interactions of things controlled by PCI, KVM, VFIO, and
>>> iommufd all mushed together.
>>>
>>> iommufd is the only area that already has a handle to all the required
>>> objects:
>>> - The physical PCI function
>>> - The CC vIOMMU object
>>> - The KVM FD
>>> - The CC vPCI object
>>>
>>> Which is why I have been thinking it is the right place to manage
>>> this.
>>>
>>> It doesn't mean that iommufd is suddenly doing PCI stuff, no, that
>>> stays in VFIO.
>>>
>>>>>> So your issue is you need to shoot down the dmabuf during vPCI
>>>>>> device destruction?
>>>>>
>>>>> I assume "vPCI device" refers to assigned device in both shared
>>>>> mode & prvate mode. So no, I need to shoot down the dmabuf during
>>>>> TSM unbind, a.k.a. when assigned device is converting from
>>>>> private to shared. Then recover the dmabuf after TSM unbind. The
>>>>> device could still work in VM in shared mode.
>>>
>>> What are you trying to protect with this? Is there some intelism where
>>> you can't have references to encrypted MMIO pages?
>>>
>>
>> I think it is a matter of design choice. The encrypted MMIO page is
>> related to the TDI context and secure second level translation table
>> (S-EPT). and S-EPT is related to the confidential VM's context.
>>
>> AMD and ARM have another level of HW control, together
>> with a TSM-owned meta table, can simply mask out the access to those
>> encrypted MMIO pages. Thus, the life cycle of the encrypted mappings in
>> the second level translation table can be de-coupled from the TDI
>> unbound. They can be reaped un-harmfully later by hypervisor in another
>> path.
>>
>> While on Intel platform, it doesn't have that additional level of
>> HW control by design. Thus, the cleanup of encrypted MMIO page mapping
>> in the S-EPT has to be coupled tightly with TDI context destruction in
>> the TDI unbind process.
>>
>> If the TDI unbind is triggered in VFIO/IOMMUFD, there has be a
>> cross-module notification to KVM to do cleanup in the S-EPT.
>
> QEMU should know about this unbind and can tell KVM about it too. No
> cross module notification needed, it is not a hot path.
>
Yes. QEMU knows almost everything important, it can do the required flow
and kernel can enforce the requirements. There shouldn't be problem at
runtime.
But if QEMU crashes, what are left there are only fd closing paths and
objects that fds represent in the kernel. The modules those fds belongs
need to solve the dependencies of tearing down objects without the help
of QEMU.
There will be private MMIO dmabuf fds, VFIO fds, IOMMU device fd, KVM
fds at that time. Who should trigger the TDI unbind at this time?
I think it should be triggered in the vdevice teardown path in IOMMUfd
fd closing path, as it is where the bind is initiated.
iommufd vdevice tear down (iommu fd closing path)
----> tsm_tdi_unbind
----> intel_tsm_tdi_unbind
...
----> private MMIO un-maping in KVM
----> cleanup private MMIO mapping in S-EPT and others
----> signal MMIO dmabuf can be safely removed.
^TVM teardown path (dmabuf uninstall path) checks
this state and wait before it can decrease the
dmabuf fd refcount
...
----> KVM TVM fd put
----> continue iommufd vdevice teardown.
Also, I think we need:
iommufd vdevice TSM bind
---> tsm_tdi_bind
----> intel_tsm_tdi_bind
...
----> KVM TVM fd get
...
Z.
>
>> So shooting down the DMABUF object (encrypted MMIO page) means shooting
>> down the S-EPT mapping and recovering the DMABUF object means
>> re-construct the non-encrypted MMIO mapping in the EPT after the TDI is
>> unbound.
>
> This is definitely QEMU's job to re-mmap MMIO to the userspace (as it
> does for non-trusted devices today) so later on nested page fault could
> fill the nested PTE. Thanks,
>
>
>>
>> Z.
>>
>>>>> What I really want is, one SW component to manage MMIO dmabuf,
>>>>> secure iommu & TSM bind/unbind. So easier coordinate these 3
>>>>> operations cause these ops are interconnected according to secure
>>>>> firmware's requirement.
>>>>
>>>> This SW component is QEMU. It knows about FLRs and other config
>>>> space things, it can destroy all these IOMMUFD objects and talk to
>>>> VFIO too, I've tried, so far it is looking easier to manage. Thanks,
>>>
>>> Yes, qemu should be sequencing this. The kernel only needs to enforce
>>> any rules required to keep the system from crashing.
>>>
>>> Jason
>>>
>>
>
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-15 16:44 ` Zhi Wang
@ 2025-05-15 16:53 ` Zhi Wang
2025-05-21 10:41 ` Alexey Kardashevskiy
0 siblings, 1 reply; 134+ messages in thread
From: Zhi Wang @ 2025-05-15 16:53 UTC (permalink / raw)
To: Alexey Kardashevskiy, Jason Gunthorpe
Cc: Xu Yilun, kvm@vger.kernel.org, dri-devel@lists.freedesktop.org,
linux-media@vger.kernel.org, linaro-mm-sig@lists.linaro.org,
sumit.semwal@linaro.org, christian.koenig@amd.com,
pbonzini@redhat.com, seanjc@google.com,
alex.williamson@redhat.com, vivek.kasireddy@intel.com,
dan.j.williams@intel.com, yilun.xu@intel.com,
linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org,
lukas@wunner.de, yan.y.zhao@intel.com, daniel.vetter@ffwll.ch,
leon@kernel.org, baolu.lu@linux.intel.com,
zhenzhong.duan@intel.com, tao1.su@intel.com
On Thu, 15 May 2025 16:44:47 +0000
Zhi Wang <zhiw@nvidia.com> wrote:
> On 15.5.2025 13.29, Alexey Kardashevskiy wrote:
> >
> >
> > On 13/5/25 20:03, Zhi Wang wrote:
> >> On Mon, 12 May 2025 11:06:17 -0300
> >> Jason Gunthorpe <jgg@nvidia.com> wrote:
> >>
> >>> On Mon, May 12, 2025 at 07:30:21PM +1000, Alexey Kardashevskiy
> >>> wrote:
> >>>
> >>>>>> I'm surprised by this.. iommufd shouldn't be doing PCI stuff,
> >>>>>> it is just about managing the translation control of the
> >>>>>> device.
> >>>>>
> >>>>> I have a little difficulty to understand. Is TSM bind PCI stuff?
> >>>>> To me it is. Host sends PCI TDISP messages via PCI DOE to put
> >>>>> the device in TDISP LOCKED state, so that device behaves
> >>>>> differently from before. Then why put it in IOMMUFD?
> >>>>
> >>>>
> >>>> "TSM bind" sets up the CPU side of it, it binds a VM to a piece
> >>>> of IOMMU on the host CPU. The device does not know about the VM,
> >>>> it just enables/disables encryption by a request from the CPU
> >>>> (those start/stop interface commands). And IOMMUFD won't be
> >>>> doing DOE, the platform driver (such as AMD CCP) will. Nothing
> >>>> to do for VFIO here.
> >>>>
> >>>> We probably should notify VFIO about the state transition but I
> >>>> do not know VFIO would want to do in response.
> >>>
> >>> We have an awkward fit for what CCA people are doing to the
> >>> various Linux APIs. Looking somewhat maximally across all the
> >>> arches a "bind" for a CC vPCI device creation operation does:
> >>>
> >>> - Setup the CPU page tables for the VM to have access to the
> >>> MMIO
> >>> - Revoke hypervisor access to the MMIO
> >>> - Setup the vIOMMU to understand the vPCI device
> >>> - Take over control of some of the IOVA translation, at least
> >>> for T=1, and route to the the vIOMMU
> >>> - Register the vPCI with any attestation functions the VM might
> >>> use
> >>> - Do some DOE stuff to manage/validate TDSIP/etc
> >>>
> >>> So we have interactions of things controlled by PCI, KVM, VFIO,
> >>> and iommufd all mushed together.
> >>>
> >>> iommufd is the only area that already has a handle to all the
> >>> required objects:
> >>> - The physical PCI function
> >>> - The CC vIOMMU object
> >>> - The KVM FD
> >>> - The CC vPCI object
> >>>
> >>> Which is why I have been thinking it is the right place to manage
> >>> this.
> >>>
> >>> It doesn't mean that iommufd is suddenly doing PCI stuff, no, that
> >>> stays in VFIO.
> >>>
> >>>>>> So your issue is you need to shoot down the dmabuf during vPCI
> >>>>>> device destruction?
> >>>>>
> >>>>> I assume "vPCI device" refers to assigned device in both shared
> >>>>> mode & prvate mode. So no, I need to shoot down the dmabuf
> >>>>> during TSM unbind, a.k.a. when assigned device is converting
> >>>>> from private to shared. Then recover the dmabuf after TSM
> >>>>> unbind. The device could still work in VM in shared mode.
> >>>
> >>> What are you trying to protect with this? Is there some intelism
> >>> where you can't have references to encrypted MMIO pages?
> >>>
> >>
> >> I think it is a matter of design choice. The encrypted MMIO page is
> >> related to the TDI context and secure second level translation
> >> table (S-EPT). and S-EPT is related to the confidential VM's
> >> context.
> >>
> >> AMD and ARM have another level of HW control, together
> >> with a TSM-owned meta table, can simply mask out the access to
> >> those encrypted MMIO pages. Thus, the life cycle of the encrypted
> >> mappings in the second level translation table can be de-coupled
> >> from the TDI unbound. They can be reaped un-harmfully later by
> >> hypervisor in another path.
> >>
> >> While on Intel platform, it doesn't have that additional level of
> >> HW control by design. Thus, the cleanup of encrypted MMIO page
> >> mapping in the S-EPT has to be coupled tightly with TDI context
> >> destruction in the TDI unbind process.
> >>
> >> If the TDI unbind is triggered in VFIO/IOMMUFD, there has be a
> >> cross-module notification to KVM to do cleanup in the S-EPT.
> >
> > QEMU should know about this unbind and can tell KVM about it too.
> > No cross module notification needed, it is not a hot path.
> >
>
> Yes. QEMU knows almost everything important, it can do the required
> flow and kernel can enforce the requirements. There shouldn't be
> problem at runtime.
>
> But if QEMU crashes, what are left there are only fd closing paths
> and objects that fds represent in the kernel. The modules those fds
> belongs need to solve the dependencies of tearing down objects
> without the help of QEMU.
>
> There will be private MMIO dmabuf fds, VFIO fds, IOMMU device fd, KVM
> fds at that time. Who should trigger the TDI unbind at this time?
>
> I think it should be triggered in the vdevice teardown path in IOMMUfd
> fd closing path, as it is where the bind is initiated.
>
> iommufd vdevice tear down (iommu fd closing path)
> ----> tsm_tdi_unbind
> ----> intel_tsm_tdi_unbind
> ...
> ----> private MMIO un-maping in KVM
> ----> cleanup private MMIO mapping in S-EPT and
> others ----> signal MMIO dmabuf can be safely removed.
> ^TVM teardown path (dmabuf uninstall path)
> checks this state and wait before it can decrease the
> dmabuf fd refcount
> ...
> ----> KVM TVM fd put
> ----> continue iommufd vdevice teardown.
>
> Also, I think we need:
>
> iommufd vdevice TSM bind
> ---> tsm_tdi_bind
> ----> intel_tsm_tdi_bind
> ...
> ----> KVM TVM fd get
ident problem, I mean KVM TVM fd is in tsm_tdi_bind(). I saw your code
has already had it there.
> ...
>
> Z.
>
> >
> >> So shooting down the DMABUF object (encrypted MMIO page) means
> >> shooting down the S-EPT mapping and recovering the DMABUF object
> >> means re-construct the non-encrypted MMIO mapping in the EPT after
> >> the TDI is unbound.
> >
> > This is definitely QEMU's job to re-mmap MMIO to the userspace (as
> > it does for non-trusted devices today) so later on nested page
> > fault could fill the nested PTE. Thanks,
> >
> >
> >>
> >> Z.
> >>
> >>>>> What I really want is, one SW component to manage MMIO dmabuf,
> >>>>> secure iommu & TSM bind/unbind. So easier coordinate these 3
> >>>>> operations cause these ops are interconnected according to
> >>>>> secure firmware's requirement.
> >>>>
> >>>> This SW component is QEMU. It knows about FLRs and other config
> >>>> space things, it can destroy all these IOMMUFD objects and talk
> >>>> to VFIO too, I've tried, so far it is looking easier to manage.
> >>>> Thanks,
> >>>
> >>> Yes, qemu should be sequencing this. The kernel only needs to
> >>> enforce any rules required to keep the system from crashing.
> >>>
> >>> Jason
> >>>
> >>
> >
>
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-15 16:53 ` Zhi Wang
@ 2025-05-21 10:41 ` Alexey Kardashevskiy
0 siblings, 0 replies; 134+ messages in thread
From: Alexey Kardashevskiy @ 2025-05-21 10:41 UTC (permalink / raw)
To: Zhi Wang, Jason Gunthorpe
Cc: Xu Yilun, kvm@vger.kernel.org, dri-devel@lists.freedesktop.org,
linux-media@vger.kernel.org, linaro-mm-sig@lists.linaro.org,
sumit.semwal@linaro.org, christian.koenig@amd.com,
pbonzini@redhat.com, seanjc@google.com,
alex.williamson@redhat.com, vivek.kasireddy@intel.com,
dan.j.williams@intel.com, yilun.xu@intel.com,
linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org,
lukas@wunner.de, yan.y.zhao@intel.com, daniel.vetter@ffwll.ch,
leon@kernel.org, baolu.lu@linux.intel.com,
zhenzhong.duan@intel.com, tao1.su@intel.com
On 16/5/25 02:53, Zhi Wang wrote:
> On Thu, 15 May 2025 16:44:47 +0000
> Zhi Wang <zhiw@nvidia.com> wrote:
>
>> On 15.5.2025 13.29, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 13/5/25 20:03, Zhi Wang wrote:
>>>> On Mon, 12 May 2025 11:06:17 -0300
>>>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>>>
>>>>> On Mon, May 12, 2025 at 07:30:21PM +1000, Alexey Kardashevskiy
>>>>> wrote:
>>>>>
>>>>>>>> I'm surprised by this.. iommufd shouldn't be doing PCI stuff,
>>>>>>>> it is just about managing the translation control of the
>>>>>>>> device.
>>>>>>>
>>>>>>> I have a little difficulty to understand. Is TSM bind PCI stuff?
>>>>>>> To me it is. Host sends PCI TDISP messages via PCI DOE to put
>>>>>>> the device in TDISP LOCKED state, so that device behaves
>>>>>>> differently from before. Then why put it in IOMMUFD?
>>>>>>
>>>>>>
>>>>>> "TSM bind" sets up the CPU side of it, it binds a VM to a piece
>>>>>> of IOMMU on the host CPU. The device does not know about the VM,
>>>>>> it just enables/disables encryption by a request from the CPU
>>>>>> (those start/stop interface commands). And IOMMUFD won't be
>>>>>> doing DOE, the platform driver (such as AMD CCP) will. Nothing
>>>>>> to do for VFIO here.
>>>>>>
>>>>>> We probably should notify VFIO about the state transition but I
>>>>>> do not know VFIO would want to do in response.
>>>>>
>>>>> We have an awkward fit for what CCA people are doing to the
>>>>> various Linux APIs. Looking somewhat maximally across all the
>>>>> arches a "bind" for a CC vPCI device creation operation does:
>>>>>
>>>>> - Setup the CPU page tables for the VM to have access to the
>>>>> MMIO
>>>>> - Revoke hypervisor access to the MMIO
>>>>> - Setup the vIOMMU to understand the vPCI device
>>>>> - Take over control of some of the IOVA translation, at least
>>>>> for T=1, and route to the the vIOMMU
>>>>> - Register the vPCI with any attestation functions the VM might
>>>>> use
>>>>> - Do some DOE stuff to manage/validate TDSIP/etc
>>>>>
>>>>> So we have interactions of things controlled by PCI, KVM, VFIO,
>>>>> and iommufd all mushed together.
>>>>>
>>>>> iommufd is the only area that already has a handle to all the
>>>>> required objects:
>>>>> - The physical PCI function
>>>>> - The CC vIOMMU object
>>>>> - The KVM FD
>>>>> - The CC vPCI object
>>>>>
>>>>> Which is why I have been thinking it is the right place to manage
>>>>> this.
>>>>>
>>>>> It doesn't mean that iommufd is suddenly doing PCI stuff, no, that
>>>>> stays in VFIO.
>>>>>
>>>>>>>> So your issue is you need to shoot down the dmabuf during vPCI
>>>>>>>> device destruction?
>>>>>>>
>>>>>>> I assume "vPCI device" refers to assigned device in both shared
>>>>>>> mode & prvate mode. So no, I need to shoot down the dmabuf
>>>>>>> during TSM unbind, a.k.a. when assigned device is converting
>>>>>>> from private to shared. Then recover the dmabuf after TSM
>>>>>>> unbind. The device could still work in VM in shared mode.
>>>>>
>>>>> What are you trying to protect with this? Is there some intelism
>>>>> where you can't have references to encrypted MMIO pages?
>>>>>
>>>>
>>>> I think it is a matter of design choice. The encrypted MMIO page is
>>>> related to the TDI context and secure second level translation
>>>> table (S-EPT). and S-EPT is related to the confidential VM's
>>>> context.
>>>>
>>>> AMD and ARM have another level of HW control, together
>>>> with a TSM-owned meta table, can simply mask out the access to
>>>> those encrypted MMIO pages. Thus, the life cycle of the encrypted
>>>> mappings in the second level translation table can be de-coupled
>>>> from the TDI unbound. They can be reaped un-harmfully later by
>>>> hypervisor in another path.
>>>>
>>>> While on Intel platform, it doesn't have that additional level of
>>>> HW control by design. Thus, the cleanup of encrypted MMIO page
>>>> mapping in the S-EPT has to be coupled tightly with TDI context
>>>> destruction in the TDI unbind process.
>>>>
>>>> If the TDI unbind is triggered in VFIO/IOMMUFD, there has be a
>>>> cross-module notification to KVM to do cleanup in the S-EPT.
>>>
>>> QEMU should know about this unbind and can tell KVM about it too.
>>> No cross module notification needed, it is not a hot path.
>>>
>>
>> Yes. QEMU knows almost everything important, it can do the required
>> flow and kernel can enforce the requirements. There shouldn't be
>> problem at runtime.
>>
>> But if QEMU crashes, what are left there are only fd closing paths
>> and objects that fds represent in the kernel. The modules those fds
>> belongs need to solve the dependencies of tearing down objects
>> without the help of QEMU.
>>
>> There will be private MMIO dmabuf fds, VFIO fds, IOMMU device fd, KVM
>> fds at that time. Who should trigger the TDI unbind at this time?
>>
>> I think it should be triggered in the vdevice teardown path in IOMMUfd
>> fd closing path, as it is where the bind is initiated.
This is how I do it now, yes.
>>
>> iommufd vdevice tear down (iommu fd closing path)
>> ----> tsm_tdi_unbind
>> ----> intel_tsm_tdi_unbind
>> ...
>> ----> private MMIO un-maping in KVM
>> ----> cleanup private MMIO mapping in S-EPT and
>> others ----> signal MMIO dmabuf can be safely removed.
>> ^TVM teardown path (dmabuf uninstall path)
>> checks this state and wait before it can decrease the
>> dmabuf fd refcount
This extra signaling is not needed on AMD SEV though - 1) VFIO will destroy this dmabuf on teardown (and it won't care about its RMP state) and 2) the CCP driver will clear RMPs for the device's resources. KVM mapping will die naturally when KVM fd is closed.
>> ...
>> ----> KVM TVM fd put
>> ----> continue iommufd vdevice teardown.
>>
>> Also, I think we need:
>>
>> iommufd vdevice TSM bind
>> ---> tsm_tdi_bind
>> ----> intel_tsm_tdi_bind
>> ...
>> ----> KVM TVM fd get
>
> ident problem, I mean KVM TVM fd is in tsm_tdi_bind(). I saw your code
> has already had it there.
Yup, that's right.
>
>> ...
>>
>> Z.
>>
>>>
>>>> So shooting down the DMABUF object (encrypted MMIO page) means
>>>> shooting down the S-EPT mapping and recovering the DMABUF object
>>>> means re-construct the non-encrypted MMIO mapping in the EPT after
>>>> the TDI is unbound.
>>>
>>> This is definitely QEMU's job to re-mmap MMIO to the userspace (as
>>> it does for non-trusted devices today) so later on nested page
>>> fault could fill the nested PTE. Thanks,
>>>
>>>
>>>>
>>>> Z.
>>>>
>>>>>>> What I really want is, one SW component to manage MMIO dmabuf,
>>>>>>> secure iommu & TSM bind/unbind. So easier coordinate these 3
>>>>>>> operations cause these ops are interconnected according to
>>>>>>> secure firmware's requirement.
>>>>>>
>>>>>> This SW component is QEMU. It knows about FLRs and other config
>>>>>> space things, it can destroy all these IOMMUFD objects and talk
>>>>>> to VFIO too, I've tried, so far it is looking easier to manage.
>>>>>> Thanks,
>>>>>
>>>>> Yes, qemu should be sequencing this. The kernel only needs to
>>>>> enforce any rules required to keep the system from crashing.
>>>>>
>>>>> Jason
>>>>>
>>>>
>>>
>>
>
--
Alexey
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-12 14:06 ` Jason Gunthorpe
2025-05-13 10:03 ` Zhi Wang
@ 2025-05-14 7:02 ` Xu Yilun
2025-05-14 16:33 ` Jason Gunthorpe
1 sibling, 1 reply; 134+ messages in thread
From: Xu Yilun @ 2025-05-14 7:02 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Alexey Kardashevskiy, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On Mon, May 12, 2025 at 11:06:17AM -0300, Jason Gunthorpe wrote:
> On Mon, May 12, 2025 at 07:30:21PM +1000, Alexey Kardashevskiy wrote:
>
> > > > I'm surprised by this.. iommufd shouldn't be doing PCI stuff, it is
> > > > just about managing the translation control of the device.
> > >
> > > I have a little difficulty to understand. Is TSM bind PCI stuff? To me
> > > it is. Host sends PCI TDISP messages via PCI DOE to put the device in
> > > TDISP LOCKED state, so that device behaves differently from before. Then
> > > why put it in IOMMUFD?
> >
> >
> > "TSM bind" sets up the CPU side of it, it binds a VM to a piece of
> > IOMMU on the host CPU. The device does not know about the VM, it
> > just enables/disables encryption by a request from the CPU (those
> > start/stop interface commands). And IOMMUFD won't be doing DOE, the
> > platform driver (such as AMD CCP) will. Nothing to do for VFIO here.
> >
> > We probably should notify VFIO about the state transition but I do
> > not know VFIO would want to do in response.
>
> We have an awkward fit for what CCA people are doing to the various
> Linux APIs. Looking somewhat maximally across all the arches a "bind"
> for a CC vPCI device creation operation does:
>
> - Setup the CPU page tables for the VM to have access to the MMIO
This is guest side thing, is it? Anything host need to opt-in?
> - Revoke hypervisor access to the MMIO
VFIO could choose never to mmap MMIO, so in this case nothing to do?
> - Setup the vIOMMU to understand the vPCI device
> - Take over control of some of the IOVA translation, at least for T=1,
> and route to the the vIOMMU
> - Register the vPCI with any attestation functions the VM might use
> - Do some DOE stuff to manage/validate TDSIP/etc
Intel TDX Connect has a extra requirement for "unbind":
- Revoke KVM page table (S-EPT) for the MMIO only after TDISP
CONFIG_UNLOCK
Another thing is, seems your term "bind" includes all steps for
shared -> private conversion. But in my mind, "bind" only includes
putting device in TDISP LOCK state & corresponding host setups required
by firmware. I.e "bind" means host lockes down the CC setup, waiting for
guest attestation.
While "unbind" means breaking CC setup, no matter the vPCI device is
already accepted as CC device, or only locked and waiting for attestation.
>
> So we have interactions of things controlled by PCI, KVM, VFIO, and
> iommufd all mushed together.
>
> iommufd is the only area that already has a handle to all the required
> objects:
> - The physical PCI function
> - The CC vIOMMU object
> - The KVM FD
> - The CC vPCI object
>
> Which is why I have been thinking it is the right place to manage
> this.
Yeah, I see the merit.
>
> It doesn't mean that iommufd is suddenly doing PCI stuff, no, that
> stays in VFIO.
I'm not sure if Alexey's patch [1] illustates your idea. It calls
tsm_tdi_bind() which directly does device stuff, and impacts MMIO.
VFIO doesn't know about this.
I have to interpret this as VFIO firstly hand over device CC features
and MMIO resources to IOMMUFD, so VFIO never cares about them.
[1] https://lore.kernel.org/all/20250218111017.491719-15-aik@amd.com/
>
> > > > So your issue is you need to shoot down the dmabuf during vPCI device
> > > > destruction?
> > >
> > > I assume "vPCI device" refers to assigned device in both shared mode &
> > > prvate mode. So no, I need to shoot down the dmabuf during TSM unbind,
> > > a.k.a. when assigned device is converting from private to shared.
> > > Then recover the dmabuf after TSM unbind. The device could still work
> > > in VM in shared mode.
>
> What are you trying to protect with this? Is there some intelism where
> you can't have references to encrypted MMIO pages?
>
> > > What I really want is, one SW component to manage MMIO dmabuf, secure
> > > iommu & TSM bind/unbind. So easier coordinate these 3 operations cause
> > > these ops are interconnected according to secure firmware's requirement.
> >
> > This SW component is QEMU. It knows about FLRs and other config
> > space things, it can destroy all these IOMMUFD objects and talk to
> > VFIO too, I've tried, so far it is looking easier to manage. Thanks,
>
> Yes, qemu should be sequencing this. The kernel only needs to enforce
> any rules required to keep the system from crashing.
To keep from crashing, The kernel still needs to enforce some firmware
specific rules. That doesn't reduce the interactions between kernel
components. E.g. for TDX, if VFIO doesn't control "bind" but controls
MMIO, it should refuse FLR or MSE when device is bound. That means VFIO
should at least know from IOMMUFD whether device is bound.
Further more, these rules are platform firmware specific, "QEMU executes
kernel checks" means more SW components should be aware of these rules.
That multiples the effort.
And QEMU can be killed, means if kernel wants to reclaim all the
resources, it still have to deal with the sequencing. And I don't think
it is a good idea that kernel just stales large amount of resources.
Thanks,
Yilun
>
> Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-14 7:02 ` Xu Yilun
@ 2025-05-14 16:33 ` Jason Gunthorpe
2025-05-15 16:04 ` Xu Yilun
0 siblings, 1 reply; 134+ messages in thread
From: Jason Gunthorpe @ 2025-05-14 16:33 UTC (permalink / raw)
To: Xu Yilun
Cc: Alexey Kardashevskiy, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On Wed, May 14, 2025 at 03:02:53PM +0800, Xu Yilun wrote:
> > We have an awkward fit for what CCA people are doing to the various
> > Linux APIs. Looking somewhat maximally across all the arches a "bind"
> > for a CC vPCI device creation operation does:
> >
> > - Setup the CPU page tables for the VM to have access to the MMIO
>
> This is guest side thing, is it? Anything host need to opt-in?
CPU hypervisor page tables.
> > - Revoke hypervisor access to the MMIO
>
> VFIO could choose never to mmap MMIO, so in this case nothing to do?
Yes, if you do it that way.
> > - Setup the vIOMMU to understand the vPCI device
> > - Take over control of some of the IOVA translation, at least for T=1,
> > and route to the the vIOMMU
> > - Register the vPCI with any attestation functions the VM might use
> > - Do some DOE stuff to manage/validate TDSIP/etc
>
> Intel TDX Connect has a extra requirement for "unbind":
>
> - Revoke KVM page table (S-EPT) for the MMIO only after TDISP
> CONFIG_UNLOCK
Maybe you could express this as the S-EPT always has the MMIO mapped
into it as long as the vPCI function is installed to the VM? Is KVM
responsible for the S-EPT?
> Another thing is, seems your term "bind" includes all steps for
> shared -> private conversion.
Well, I was talking about vPCI creation. I understand that during the
vPCI lifecycle the VM will do "bind" "unbind" which are more or less
switching the device into a T=1 mode. Though I understood on some
arches this was mostly invisible to the hypervisor?
> But in my mind, "bind" only includes
> putting device in TDISP LOCK state & corresponding host setups required
> by firmware. I.e "bind" means host lockes down the CC setup, waiting for
> guest attestation.
So we will need to have some other API for this that modifies the vPCI
object.
It might be reasonable to have VFIO reach into iommufd to do that on
an already existing iommufd VDEVICE object. A little weird, but we
could probably make that work.
But you have some weird ordering issues here if the S-EPT has to have
the VFIO MMIO then you have to have a close() destruction order that
sees VFIO remove the S-EPT and release the KVM, then have iommufd
destroy the VDEVICE object.
> > It doesn't mean that iommufd is suddenly doing PCI stuff, no, that
> > stays in VFIO.
>
> I'm not sure if Alexey's patch [1] illustates your idea. It calls
> tsm_tdi_bind() which directly does device stuff, and impacts MMIO.
> VFIO doesn't know about this.
>
> I have to interpret this as VFIO firstly hand over device CC features
> and MMIO resources to IOMMUFD, so VFIO never cares about them.
>
> [1] https://lore.kernel.org/all/20250218111017.491719-15-aik@amd.com/
There is also the PCI layer involved here and maybe PCI should be
participating in managing some of this. Like it makes a bit of sense
that PCI would block the FLR on platforms that require this?
Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-14 16:33 ` Jason Gunthorpe
@ 2025-05-15 16:04 ` Xu Yilun
2025-05-15 17:56 ` Jason Gunthorpe
2025-05-22 3:45 ` Alexey Kardashevskiy
0 siblings, 2 replies; 134+ messages in thread
From: Xu Yilun @ 2025-05-15 16:04 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Alexey Kardashevskiy, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On Wed, May 14, 2025 at 01:33:39PM -0300, Jason Gunthorpe wrote:
> On Wed, May 14, 2025 at 03:02:53PM +0800, Xu Yilun wrote:
> > > We have an awkward fit for what CCA people are doing to the various
> > > Linux APIs. Looking somewhat maximally across all the arches a "bind"
> > > for a CC vPCI device creation operation does:
> > >
> > > - Setup the CPU page tables for the VM to have access to the MMIO
> >
> > This is guest side thing, is it? Anything host need to opt-in?
>
> CPU hypervisor page tables.
>
> > > - Revoke hypervisor access to the MMIO
> >
> > VFIO could choose never to mmap MMIO, so in this case nothing to do?
>
> Yes, if you do it that way.
>
> > > - Setup the vIOMMU to understand the vPCI device
> > > - Take over control of some of the IOVA translation, at least for T=1,
> > > and route to the the vIOMMU
> > > - Register the vPCI with any attestation functions the VM might use
> > > - Do some DOE stuff to manage/validate TDSIP/etc
> >
> > Intel TDX Connect has a extra requirement for "unbind":
> >
> > - Revoke KVM page table (S-EPT) for the MMIO only after TDISP
> > CONFIG_UNLOCK
>
> Maybe you could express this as the S-EPT always has the MMIO mapped
> into it as long as the vPCI function is installed to the VM?
Yeah.
> Is KVM responsible for the S-EPT?
Yes.
>
> > Another thing is, seems your term "bind" includes all steps for
> > shared -> private conversion.
>
> Well, I was talking about vPCI creation. I understand that during the
> vPCI lifecycle the VM will do "bind" "unbind" which are more or less
> switching the device into a T=1 mode. Though I understood on some
I want to introduce some terms about CC vPCI.
1. "Bind", guest requests host do host side CC setup & put device in
CONFIG_LOCKED state, waiting for attestation. Any further change which
has secuity concern breaks "bind", e.g. reset, touch MMIO, physical MSE,
BAR addr...
2. "Attest", after "bind", guest verifies device evidences (cert,
measurement...).
3. "Accept", after successful attestation, guest do guest side CC setup &
switch the device into T=1 mode (TDISP RUN state)
4. "Unbind", guest requests host put device in CONFIG_UNLOCK state +
remove all CC setup.
> arches this was mostly invisible to the hypervisor?
Attest & Accept can be invisible to hypervisor, or host just help pass
data blobs between guest, firmware & device.
Bind cannot be host agnostic, host should be aware not to touch device
after Bind.
>
> > But in my mind, "bind" only includes
> > putting device in TDISP LOCK state & corresponding host setups required
> > by firmware. I.e "bind" means host lockes down the CC setup, waiting for
> > guest attestation.
>
> So we will need to have some other API for this that modifies the vPCI
> object.
IIUC, in Alexey's patch ioctl(iommufd, IOMMU_VDEVICE_TSM_BIND) does the
"Bind" thing in host.
>
> It might be reasonable to have VFIO reach into iommufd to do that on
> an already existing iommufd VDEVICE object. A little weird, but we
> could probably make that work.
Mm, Are you proposing an uAPI in VFIO, and a kAPI from VFIO -> IOMMUFD like:
ioctl(vfio_fd, VFIO_DEVICE_ATTACH_VDEV, vdev_id)
-> iommufd_device_attach_vdev()
-> tsm_tdi_bind()
>
> But you have some weird ordering issues here if the S-EPT has to have
> the VFIO MMIO then you have to have a close() destruction order that
Yeah, by holding kvm reference.
> sees VFIO remove the S-EPT and release the KVM, then have iommufd
> destroy the VDEVICE object.
Regarding VM destroy, TDX Connect has more enforcement, VM could only be
destroyed after all assigned CC vPCI devices are destroyed.
Nowadays, VFIO already holds KVM reference, so we need
close(vfio_fd)
-> iommufd_device_detach_vdev()
-> tsm_tdi_unbind()
-> tdi stop
-> callback to VFIO, dmabuf_move_notify(revoke)
-> KVM unmap MMIO
-> tdi metadata remove
-> kvm_put_kvm()
-> kvm_destroy_vm()
>
> > > It doesn't mean that iommufd is suddenly doing PCI stuff, no, that
> > > stays in VFIO.
> >
> > I'm not sure if Alexey's patch [1] illustates your idea. It calls
> > tsm_tdi_bind() which directly does device stuff, and impacts MMIO.
> > VFIO doesn't know about this.
> >
> > I have to interpret this as VFIO firstly hand over device CC features
> > and MMIO resources to IOMMUFD, so VFIO never cares about them.
> >
> > [1] https://lore.kernel.org/all/20250218111017.491719-15-aik@amd.com/
>
> There is also the PCI layer involved here and maybe PCI should be
> participating in managing some of this. Like it makes a bit of sense
> that PCI would block the FLR on platforms that require this?
FLR to a bound device is absolutely fine, just break the CC state.
Sometimes it is exactly what host need to stop CC immediately.
The problem is in VFIO's pre-FLR handling so we need to patch VFIO, not
PCI core.
Thanks,
Yilun
>
> Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-15 16:04 ` Xu Yilun
@ 2025-05-15 17:56 ` Jason Gunthorpe
2025-05-16 6:03 ` Xu Yilun
2025-05-22 3:45 ` Alexey Kardashevskiy
1 sibling, 1 reply; 134+ messages in thread
From: Jason Gunthorpe @ 2025-05-15 17:56 UTC (permalink / raw)
To: Xu Yilun
Cc: Alexey Kardashevskiy, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On Fri, May 16, 2025 at 12:04:04AM +0800, Xu Yilun wrote:
> > arches this was mostly invisible to the hypervisor?
>
> Attest & Accept can be invisible to hypervisor, or host just help pass
> data blobs between guest, firmware & device.
>
> Bind cannot be host agnostic, host should be aware not to touch device
> after Bind.
I'm not sure this is fully true, this could be a Intel thing. When the
vPCI is created the host can already know it shouldn't touch the PCI
device anymore and the secure world would enforce that when it gets a
bind command.
The fact it hasn't been locked out immediately at vPCI creation time
is sort of a detail that doesn't matter, IMHO.
> > It might be reasonable to have VFIO reach into iommufd to do that on
> > an already existing iommufd VDEVICE object. A little weird, but we
> > could probably make that work.
>
> Mm, Are you proposing an uAPI in VFIO, and a kAPI from VFIO -> IOMMUFD like:
>
> ioctl(vfio_fd, VFIO_DEVICE_ATTACH_VDEV, vdev_id)
> -> iommufd_device_attach_vdev()
> -> tsm_tdi_bind()
Not ATTACH, you wanted BIND. You could have a VFIO_DEVICE_BIND(iommufd
vdevice id)
> > sees VFIO remove the S-EPT and release the KVM, then have iommufd
> > destroy the VDEVICE object.
>
> Regarding VM destroy, TDX Connect has more enforcement, VM could only be
> destroyed after all assigned CC vPCI devices are destroyed.
And KVM destroys the VM?
> Nowadays, VFIO already holds KVM reference, so we need
>
> close(vfio_fd)
> -> iommufd_device_detach_vdev()
This doesn't happen though, it destroys the normal device (idev) which
the vdevice is stacked on top of. You'd have to make normal device
destruction trigger vdevice destruction
> -> tsm_tdi_unbind()
> -> tdi stop
> -> callback to VFIO, dmabuf_move_notify(revoke)
> -> KVM unmap MMIO
> -> tdi metadata remove
This omits the viommu. It won't get destroyed until the iommufd
closes, so iommufd will be holding the kvm and it will do the final
put.
Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-15 17:56 ` Jason Gunthorpe
@ 2025-05-16 6:03 ` Xu Yilun
0 siblings, 0 replies; 134+ messages in thread
From: Xu Yilun @ 2025-05-16 6:03 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Alexey Kardashevskiy, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On Thu, May 15, 2025 at 02:56:58PM -0300, Jason Gunthorpe wrote:
> On Fri, May 16, 2025 at 12:04:04AM +0800, Xu Yilun wrote:
> > > arches this was mostly invisible to the hypervisor?
> >
> > Attest & Accept can be invisible to hypervisor, or host just help pass
> > data blobs between guest, firmware & device.
> >
> > Bind cannot be host agnostic, host should be aware not to touch device
> > after Bind.
>
> I'm not sure this is fully true, this could be a Intel thing. When the
> vPCI is created the host can already know it shouldn't touch the PCI
> device anymore and the secure world would enforce that when it gets a
> bind command.
>
> The fact it hasn't been locked out immediately at vPCI creation time
> is sort of a detail that doesn't matter, IMHO.
I see, SW can define the lock out in a wider range. I suddenly understand
you are considering finish all host side CC setup on viommu_alloc &
vdevice_alloc before KVM run, then "Bind" could host agnostic, and TDISP
LOCK/STOP could also be a guest_request.
Now the problem is for TDX, host cannot be agnostic to LOCK/STOP because
of the KVM MMIO mapping ...
I still have to make VFIO uAPIs for "Bind"/"Unbind"
>
> > > It might be reasonable to have VFIO reach into iommufd to do that on
> > > an already existing iommufd VDEVICE object. A little weird, but we
> > > could probably make that work.
> >
> > Mm, Are you proposing an uAPI in VFIO, and a kAPI from VFIO -> IOMMUFD like:
> >
> > ioctl(vfio_fd, VFIO_DEVICE_ATTACH_VDEV, vdev_id)
> > -> iommufd_device_attach_vdev()
> > -> tsm_tdi_bind()
>
> Not ATTACH, you wanted BIND. You could have a VFIO_DEVICE_BIND(iommufd
> vdevice id)
Yes.
>
> > > sees VFIO remove the S-EPT and release the KVM, then have iommufd
> > > destroy the VDEVICE object.
> >
> > Regarding VM destroy, TDX Connect has more enforcement, VM could only be
> > destroyed after all assigned CC vPCI devices are destroyed.
>
> And KVM destroys the VM?
Yes.
>
> > Nowadays, VFIO already holds KVM reference, so we need
> >
> > close(vfio_fd)
> > -> iommufd_device_detach_vdev()
>
> This doesn't happen though, it destroys the normal device (idev) which
> the vdevice is stacked on top of. You'd have to make normal device
> destruction trigger vdevice destruction
>
> > -> tsm_tdi_unbind()
> > -> tdi stop
> > -> callback to VFIO, dmabuf_move_notify(revoke)
> > -> KVM unmap MMIO
> > -> tdi metadata remove
>
> This omits the viommu. It won't get destroyed until the iommufd
> closes, so iommufd will be holding the kvm and it will do the final
> put.
I see.
https://lore.kernel.org/all/20250319233111.GE126678@ziepe.ca/
Thanks,
Yilun
>
> Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-15 16:04 ` Xu Yilun
2025-05-15 17:56 ` Jason Gunthorpe
@ 2025-05-22 3:45 ` Alexey Kardashevskiy
2025-05-24 3:13 ` Xu Yilun
1 sibling, 1 reply; 134+ messages in thread
From: Alexey Kardashevskiy @ 2025-05-22 3:45 UTC (permalink / raw)
To: Xu Yilun, Jason Gunthorpe
Cc: kvm, dri-devel, linux-media, linaro-mm-sig, sumit.semwal,
christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On 16/5/25 02:04, Xu Yilun wrote:
> On Wed, May 14, 2025 at 01:33:39PM -0300, Jason Gunthorpe wrote:
>> On Wed, May 14, 2025 at 03:02:53PM +0800, Xu Yilun wrote:
>>>> We have an awkward fit for what CCA people are doing to the various
>>>> Linux APIs. Looking somewhat maximally across all the arches a "bind"
>>>> for a CC vPCI device creation operation does:
>>>>
>>>> - Setup the CPU page tables for the VM to have access to the MMIO
>>>
>>> This is guest side thing, is it? Anything host need to opt-in?
>>
>> CPU hypervisor page tables.
>>
>>>> - Revoke hypervisor access to the MMIO
>>>
>>> VFIO could choose never to mmap MMIO, so in this case nothing to do?
>>
>> Yes, if you do it that way.
>>
>>>> - Setup the vIOMMU to understand the vPCI device
>>>> - Take over control of some of the IOVA translation, at least for T=1,
>>>> and route to the the vIOMMU
>>>> - Register the vPCI with any attestation functions the VM might use
>>>> - Do some DOE stuff to manage/validate TDSIP/etc
>>>
>>> Intel TDX Connect has a extra requirement for "unbind":
>>>
>>> - Revoke KVM page table (S-EPT) for the MMIO only after TDISP
>>> CONFIG_UNLOCK
>>
>> Maybe you could express this as the S-EPT always has the MMIO mapped
>> into it as long as the vPCI function is installed to the VM?
>
> Yeah.
>
>> Is KVM responsible for the S-EPT?
>
> Yes.
>
>>
>>> Another thing is, seems your term "bind" includes all steps for
>>> shared -> private conversion.
>>
>> Well, I was talking about vPCI creation. I understand that during the
>> vPCI lifecycle the VM will do "bind" "unbind" which are more or less
>> switching the device into a T=1 mode. Though I understood on some
>
> I want to introduce some terms about CC vPCI.
>
> 1. "Bind", guest requests host do host side CC setup & put device in
> CONFIG_LOCKED state, waiting for attestation. Any further change which
> has secuity concern breaks "bind", e.g. reset, touch MMIO, physical MSE,
> BAR addr...
>
> 2. "Attest", after "bind", guest verifies device evidences (cert,
> measurement...).
>
> 3. "Accept", after successful attestation, guest do guest side CC setup &
> switch the device into T=1 mode (TDISP RUN state)
(implementation note)
AMD SEV moves TDI to RUN at "Attest" as a guest still can avoid encrypted MMIO access and the PSP keeps IOMMU blocked until the guest enables it.
> 4. "Unbind", guest requests host put device in CONFIG_UNLOCK state +
> remove all CC setup.
>
>> arches this was mostly invisible to the hypervisor?
>
> Attest & Accept can be invisible to hypervisor, or host just help pass
> data blobs between guest, firmware & device.
No, they cannot.
> Bind cannot be host agnostic, host should be aware not to touch device
> after Bind.
Bind actually connects a TDI to a guest, the guest could not possibly do that alone as it does not know/have access to the physical PCI function#0 to do the DOE/SecSPDM messaging, and neither does the PSP.
The non-touching clause (or, more precisely "selectively touching") is about "Attest" and "Accept" when the TDI is in the CONFIG_LOCKED or RUN state. Up to the point when we rather want to block the config space and MSIX BAR access after the TDI is CONFIG_LOCKED/RUN to prevent TDI from going to the ERROR state.
>>
>>> But in my mind, "bind" only includes
>>> putting device in TDISP LOCK state & corresponding host setups required
>>> by firmware. I.e "bind" means host lockes down the CC setup, waiting for
>>> guest attestation.
>>
>> So we will need to have some other API for this that modifies the vPCI
>> object.
>
> IIUC, in Alexey's patch ioctl(iommufd, IOMMU_VDEVICE_TSM_BIND) does the
> "Bind" thing in host.
I am still not sure what "vPCI" means exactly, a passed through PCI device? Or a piece of vIOMMU handling such device?
>> It might be reasonable to have VFIO reach into iommufd to do that on
>> an already existing iommufd VDEVICE object. A little weird, but we
>> could probably make that work.
>
> Mm, Are you proposing an uAPI in VFIO, and a kAPI from VFIO -> IOMMUFD like:
>
> ioctl(vfio_fd, VFIO_DEVICE_ATTACH_VDEV, vdev_id)
> -> iommufd_device_attach_vdev()
> -> tsm_tdi_bind()
>
>>
>> But you have some weird ordering issues here if the S-EPT has to have
>> the VFIO MMIO then you have to have a close() destruction order that
>
> Yeah, by holding kvm reference.
>
>> sees VFIO remove the S-EPT and release the KVM, then have iommufd
>> destroy the VDEVICE object.
>
> Regarding VM destroy, TDX Connect has more enforcement, VM could only be
> destroyed after all assigned CC vPCI devices are destroyed.
Can be done by making IOMMUFD/vdevice holding the kvm pointer to ensure tsm_tdi_unbind() is not called before the guest disappeared from the firmware. I seem to be just lucky with the current order of things being destroyed, hmm.
> Nowadays, VFIO already holds KVM reference, so we need
>
> close(vfio_fd)
> -> iommufd_device_detach_vdev()
> -> tsm_tdi_unbind()
> -> tdi stop
> -> callback to VFIO, dmabuf_move_notify(revoke)
> -> KVM unmap MMIO
> -> tdi metadata remove
> -> kvm_put_kvm()
> -> kvm_destroy_vm()
>
>
>>
>>>> It doesn't mean that iommufd is suddenly doing PCI stuff, no, that
>>>> stays in VFIO.
>>>
>>> I'm not sure if Alexey's patch [1] illustates your idea. It calls
>>> tsm_tdi_bind() which directly does device stuff, and impacts MMIO.
>>> VFIO doesn't know about this.
VFIO knows about this enough as we asked it to share MMIO via dmabuf's fd and not via mmap(), otherwise it is the same MMIO, exactly where it was, BARs do not change.
>>>
>>> I have to interpret this as VFIO firstly hand over device CC features
>>> and MMIO resources to IOMMUFD, so VFIO never cares about them.
>>>
>>> [1] https://lore.kernel.org/all/20250218111017.491719-15-aik@amd.com/
>>
>> There is also the PCI layer involved here and maybe PCI should be
>> participating in managing some of this. Like it makes a bit of sense
>> that PCI would block the FLR on platforms that require this?
>
> FLR to a bound device is absolutely fine, just break the CC state.
> Sometimes it is exactly what host need to stop CC immediately.
> The problem is in VFIO's pre-FLR handling so we need to patch VFIO, not
> PCI core.
What is a problem here exactly?
FLR by the host which equals to any other PCI error? The guest may or may not be able to handle it, afaik it does not handle any errors now, QEMU just stops the guest.
Or FLR by the guest? Then it knows it needs to do the dance with attest/accept, again.
Thanks,
>
> Thanks,
> Yilun
>
>>
>> Jason
--
Alexey
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-22 3:45 ` Alexey Kardashevskiy
@ 2025-05-24 3:13 ` Xu Yilun
2025-05-26 7:18 ` Alexey Kardashevskiy
0 siblings, 1 reply; 134+ messages in thread
From: Xu Yilun @ 2025-05-24 3:13 UTC (permalink / raw)
To: Alexey Kardashevskiy
Cc: Jason Gunthorpe, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On Thu, May 22, 2025 at 01:45:57PM +1000, Alexey Kardashevskiy wrote:
>
>
> On 16/5/25 02:04, Xu Yilun wrote:
> > On Wed, May 14, 2025 at 01:33:39PM -0300, Jason Gunthorpe wrote:
> > > On Wed, May 14, 2025 at 03:02:53PM +0800, Xu Yilun wrote:
> > > > > We have an awkward fit for what CCA people are doing to the various
> > > > > Linux APIs. Looking somewhat maximally across all the arches a "bind"
> > > > > for a CC vPCI device creation operation does:
> > > > >
> > > > > - Setup the CPU page tables for the VM to have access to the MMIO
> > > >
> > > > This is guest side thing, is it? Anything host need to opt-in?
> > >
> > > CPU hypervisor page tables.
> > >
> > > > > - Revoke hypervisor access to the MMIO
> > > >
> > > > VFIO could choose never to mmap MMIO, so in this case nothing to do?
> > >
> > > Yes, if you do it that way.
> > > > > - Setup the vIOMMU to understand the vPCI device
> > > > > - Take over control of some of the IOVA translation, at least for T=1,
> > > > > and route to the the vIOMMU
> > > > > - Register the vPCI with any attestation functions the VM might use
> > > > > - Do some DOE stuff to manage/validate TDSIP/etc
> > > >
> > > > Intel TDX Connect has a extra requirement for "unbind":
> > > >
> > > > - Revoke KVM page table (S-EPT) for the MMIO only after TDISP
> > > > CONFIG_UNLOCK
> > >
> > > Maybe you could express this as the S-EPT always has the MMIO mapped
> > > into it as long as the vPCI function is installed to the VM?
> >
> > Yeah.
> >
> > > Is KVM responsible for the S-EPT?
> >
> > Yes.
> >
> > >
> > > > Another thing is, seems your term "bind" includes all steps for
> > > > shared -> private conversion.
> > >
> > > Well, I was talking about vPCI creation. I understand that during the
> > > vPCI lifecycle the VM will do "bind" "unbind" which are more or less
> > > switching the device into a T=1 mode. Though I understood on some
> >
> > I want to introduce some terms about CC vPCI.
> >
> > 1. "Bind", guest requests host do host side CC setup & put device in
> > CONFIG_LOCKED state, waiting for attestation. Any further change which
> > has secuity concern breaks "bind", e.g. reset, touch MMIO, physical MSE,
> > BAR addr...
> >
> > 2. "Attest", after "bind", guest verifies device evidences (cert,
> > measurement...).
> >
> > 3. "Accept", after successful attestation, guest do guest side CC setup &
> > switch the device into T=1 mode (TDISP RUN state)
>
> (implementation note)
> AMD SEV moves TDI to RUN at "Attest" as a guest still can avoid encrypted MMIO access and the PSP keeps IOMMU blocked until the guest enables it.
>
Good to know. That's why we have these SW defined verbs rather than
reusing TDISP terms.
> > 4. "Unbind", guest requests host put device in CONFIG_UNLOCK state +
> > remove all CC setup.
> >
> > > arches this was mostly invisible to the hypervisor?
> >
> > Attest & Accept can be invisible to hypervisor, or host just help pass
> > data blobs between guest, firmware & device.
>
> No, they cannot.
MM.. TSM driver is the agent of trusted firmware in the OS, so I
excluded it from "hypervisor". TSM driver could parse data blobs and do
whatever requested by trusted firmware.
I want to justify the general guest_request interface, explain why
VIFO/IOMMUFD don't have to maintain the "attest", "accept" states.
>
> > Bind cannot be host agnostic, host should be aware not to touch device
> > after Bind.
>
> Bind actually connects a TDI to a guest, the guest could not possibly do that alone as it does not know/have access to the physical PCI function#0 to do the DOE/SecSPDM messaging, and neither does the PSP.
>
> The non-touching clause (or, more precisely "selectively touching") is about "Attest" and "Accept" when the TDI is in the CONFIG_LOCKED or RUN state. Up to the point when we rather want to block the config space and MSIX BAR access after the TDI is CONFIG_LOCKED/RUN to prevent TDI from going to the ERROR state.
>
>
> > >
> > > > But in my mind, "bind" only includes
> > > > putting device in TDISP LOCK state & corresponding host setups required
> > > > by firmware. I.e "bind" means host lockes down the CC setup, waiting for
> > > > guest attestation.
> > >
> > > So we will need to have some other API for this that modifies the vPCI
> > > object.
> >
> > IIUC, in Alexey's patch ioctl(iommufd, IOMMU_VDEVICE_TSM_BIND) does the
> > "Bind" thing in host.
>
>
> I am still not sure what "vPCI" means exactly, a passed through PCI device? Or a piece of vIOMMU handling such device?
>
My understanding is both. When you "Bind" you modifies the physical
device, you may also need to setup a piece of vIOMMU for private
assignement to work.
>
> > > It might be reasonable to have VFIO reach into iommufd to do that on
> > > an already existing iommufd VDEVICE object. A little weird, but we
> > > could probably make that work.
> >
> > Mm, Are you proposing an uAPI in VFIO, and a kAPI from VFIO -> IOMMUFD like:
> >
> > ioctl(vfio_fd, VFIO_DEVICE_ATTACH_VDEV, vdev_id)
> > -> iommufd_device_attach_vdev()
> > -> tsm_tdi_bind()
> >
> > >
> > > But you have some weird ordering issues here if the S-EPT has to have
> > > the VFIO MMIO then you have to have a close() destruction order that
> >
> > Yeah, by holding kvm reference.
> >
> > > sees VFIO remove the S-EPT and release the KVM, then have iommufd
> > > destroy the VDEVICE object.
> >
> > Regarding VM destroy, TDX Connect has more enforcement, VM could only be
> > destroyed after all assigned CC vPCI devices are destroyed.
>
> Can be done by making IOMMUFD/vdevice holding the kvm pointer to ensure tsm_tdi_unbind() is not called before the guest disappeared from the firmware. I seem to be just lucky with the current order of things being destroyed, hmm.
>
tsm_tdi_unbind() *should* be called before guest disappear. For TDX
Connect that is the enforcement. Holding KVM pointer is the effective
way.
>
> > Nowadays, VFIO already holds KVM reference, so we need
> >
> > close(vfio_fd)
> > -> iommufd_device_detach_vdev()
> > -> tsm_tdi_unbind()
> > -> tdi stop
> > -> callback to VFIO, dmabuf_move_notify(revoke)
> > -> KVM unmap MMIO
> > -> tdi metadata remove
> > -> kvm_put_kvm()
> > -> kvm_destroy_vm()
> >
> >
> > >
> > > > > It doesn't mean that iommufd is suddenly doing PCI stuff, no, that
> > > > > stays in VFIO.
> > > >
> > > > I'm not sure if Alexey's patch [1] illustates your idea. It calls
> > > > tsm_tdi_bind() which directly does device stuff, and impacts MMIO.
> > > > VFIO doesn't know about this.
>
> VFIO knows about this enough as we asked it to share MMIO via dmabuf's fd and not via mmap(), otherwise it is the same MMIO, exactly where it was, BARs do not change.
>
Yes, if you define a SW "lock down" in boarder sense than TDISP LOCKED.
But seems TDX Connect failed to adapt to this solution because it still
needs to handle MMIO invalidation before FLR, see below.
> > > >
> > > > I have to interpret this as VFIO firstly hand over device CC features
> > > > and MMIO resources to IOMMUFD, so VFIO never cares about them.
> > > >
> > > > [1] https://lore.kernel.org/all/20250218111017.491719-15-aik@amd.com/
> > >
> > > There is also the PCI layer involved here and maybe PCI should be
> > > participating in managing some of this. Like it makes a bit of sense
> > > that PCI would block the FLR on platforms that require this?
> >
> > FLR to a bound device is absolutely fine, just break the CC state.
> > Sometimes it is exactly what host need to stop CC immediately.
> > The problem is in VFIO's pre-FLR handling so we need to patch VFIO, not
> > PCI core.
>
> What is a problem here exactly?
> FLR by the host which equals to any other PCI error? The guest may or may not be able to handle it, afaik it does not handle any errors now, QEMU just stops the guest.
It is about TDX Connect.
According to the dmabuf patchset, the dmabuf needs to be revoked before
FLR. That means KVM unmaps MMIOs when the device is in LOCKED/RUN state.
That is forbidden by TDX Module and will crash KVM. So the safer way is
to unbind the TDI first, then revoke MMIOs, then do FLR.
I'm not sure when p2p dma is involved AMD will have the same issue.
Cause in that case, MMIOs would also be mapped in IOMMU PT and revoke
MMIOs means IOMMU mapping drop. The root cause of the concern is secure
firmware should monitor IOMMU mapping integrity for private assignement
or hypervisor could silently drop trusted DMA writting.
TDX Connect has the wider impact on this issue cause it uses the same
table for KVM S-EPT and Secure IOMMU PT.
Thanks,
Yilun
> Or FLR by the guest? Then it knows it needs to do the dance with attest/accept, again.
>
> Thanks,
>
> >
> > Thanks,
> > Yilun
> >
> > >
> > > Jason
>
> --
> Alexey
>
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-24 3:13 ` Xu Yilun
@ 2025-05-26 7:18 ` Alexey Kardashevskiy
2025-05-29 14:41 ` Xu Yilun
0 siblings, 1 reply; 134+ messages in thread
From: Alexey Kardashevskiy @ 2025-05-26 7:18 UTC (permalink / raw)
To: Xu Yilun
Cc: Jason Gunthorpe, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On 24/5/25 13:13, Xu Yilun wrote:
> On Thu, May 22, 2025 at 01:45:57PM +1000, Alexey Kardashevskiy wrote:
>>
>>
>> On 16/5/25 02:04, Xu Yilun wrote:
>>> On Wed, May 14, 2025 at 01:33:39PM -0300, Jason Gunthorpe wrote:
>>>> On Wed, May 14, 2025 at 03:02:53PM +0800, Xu Yilun wrote:
>>>>>> We have an awkward fit for what CCA people are doing to the various
>>>>>> Linux APIs. Looking somewhat maximally across all the arches a "bind"
>>>>>> for a CC vPCI device creation operation does:
>>>>>>
>>>>>> - Setup the CPU page tables for the VM to have access to the MMIO
>>>>>
>>>>> This is guest side thing, is it? Anything host need to opt-in?
>>>>
>>>> CPU hypervisor page tables.
>>>>
>>>>>> - Revoke hypervisor access to the MMIO
>>>>>
>>>>> VFIO could choose never to mmap MMIO, so in this case nothing to do?
>>>>
>>>> Yes, if you do it that way.
>>>>>> - Setup the vIOMMU to understand the vPCI device
>>>>>> - Take over control of some of the IOVA translation, at least for T=1,
>>>>>> and route to the the vIOMMU
>>>>>> - Register the vPCI with any attestation functions the VM might use
>>>>>> - Do some DOE stuff to manage/validate TDSIP/etc
>>>>>
>>>>> Intel TDX Connect has a extra requirement for "unbind":
>>>>>
>>>>> - Revoke KVM page table (S-EPT) for the MMIO only after TDISP
>>>>> CONFIG_UNLOCK
>>>>
>>>> Maybe you could express this as the S-EPT always has the MMIO mapped
>>>> into it as long as the vPCI function is installed to the VM?
>>>
>>> Yeah.
>>>
>>>> Is KVM responsible for the S-EPT?
>>>
>>> Yes.
>>>
>>>>
>>>>> Another thing is, seems your term "bind" includes all steps for
>>>>> shared -> private conversion.
>>>>
>>>> Well, I was talking about vPCI creation. I understand that during the
>>>> vPCI lifecycle the VM will do "bind" "unbind" which are more or less
>>>> switching the device into a T=1 mode. Though I understood on some
>>>
>>> I want to introduce some terms about CC vPCI.
>>>
>>> 1. "Bind", guest requests host do host side CC setup & put device in
>>> CONFIG_LOCKED state, waiting for attestation. Any further change which
>>> has secuity concern breaks "bind", e.g. reset, touch MMIO, physical MSE,
>>> BAR addr...
>>>
>>> 2. "Attest", after "bind", guest verifies device evidences (cert,
>>> measurement...).
>>>
>>> 3. "Accept", after successful attestation, guest do guest side CC setup &
>>> switch the device into T=1 mode (TDISP RUN state)
>>
>> (implementation note)
>> AMD SEV moves TDI to RUN at "Attest" as a guest still can avoid encrypted MMIO access and the PSP keeps IOMMU blocked until the guest enables it.
>>
>
> Good to know. That's why we have these SW defined verbs rather than
> reusing TDISP terms.
>
>>> 4. "Unbind", guest requests host put device in CONFIG_UNLOCK state +
>>> remove all CC setup.
>>>
>>>> arches this was mostly invisible to the hypervisor?
>>>
>>> Attest & Accept can be invisible to hypervisor, or host just help pass
>>> data blobs between guest, firmware & device.
>>
>> No, they cannot.
>
> MM.. TSM driver is the agent of trusted firmware in the OS, so I
> excluded it from "hypervisor". TSM driver could parse data blobs and do
> whatever requested by trusted firmware.
>
> I want to justify the general guest_request interface, explain why
> VIFO/IOMMUFD don't have to maintain the "attest", "accept" states.
>
>>
>>> Bind cannot be host agnostic, host should be aware not to touch device
>>> after Bind.
>>
>> Bind actually connects a TDI to a guest, the guest could not possibly do that alone as it does not know/have access to the physical PCI function#0 to do the DOE/SecSPDM messaging, and neither does the PSP.
>>
>> The non-touching clause (or, more precisely "selectively touching") is about "Attest" and "Accept" when the TDI is in the CONFIG_LOCKED or RUN state. Up to the point when we rather want to block the config space and MSIX BAR access after the TDI is CONFIG_LOCKED/RUN to prevent TDI from going to the ERROR state.
>>
>>
>>>>
>>>>> But in my mind, "bind" only includes
>>>>> putting device in TDISP LOCK state & corresponding host setups required
>>>>> by firmware. I.e "bind" means host lockes down the CC setup, waiting for
>>>>> guest attestation.
>>>>
>>>> So we will need to have some other API for this that modifies the vPCI
>>>> object.
>>>
>>> IIUC, in Alexey's patch ioctl(iommufd, IOMMU_VDEVICE_TSM_BIND) does the
>>> "Bind" thing in host.
>>
>>
>> I am still not sure what "vPCI" means exactly, a passed through PCI device? Or a piece of vIOMMU handling such device?
>>
>
> My understanding is both. When you "Bind" you modifies the physical
> device, you may also need to setup a piece of vIOMMU for private
> assignement to work.
>
>>
>>>> It might be reasonable to have VFIO reach into iommufd to do that on
>>>> an already existing iommufd VDEVICE object. A little weird, but we
>>>> could probably make that work.
>>>
>>> Mm, Are you proposing an uAPI in VFIO, and a kAPI from VFIO -> IOMMUFD like:
>>>
>>> ioctl(vfio_fd, VFIO_DEVICE_ATTACH_VDEV, vdev_id)
>>> -> iommufd_device_attach_vdev()
>>> -> tsm_tdi_bind()
>>>
>>>>
>>>> But you have some weird ordering issues here if the S-EPT has to have
>>>> the VFIO MMIO then you have to have a close() destruction order that
>>>
>>> Yeah, by holding kvm reference.
>>>
>>>> sees VFIO remove the S-EPT and release the KVM, then have iommufd
>>>> destroy the VDEVICE object.
>>>
>>> Regarding VM destroy, TDX Connect has more enforcement, VM could only be
>>> destroyed after all assigned CC vPCI devices are destroyed.
>>
>> Can be done by making IOMMUFD/vdevice holding the kvm pointer to ensure tsm_tdi_unbind() is not called before the guest disappeared from the firmware. I seem to be just lucky with the current order of things being destroyed, hmm.
>>
>
> tsm_tdi_unbind() *should* be called before guest disappear. For TDX
> Connect that is the enforcement. Holding KVM pointer is the effective
> way.
>
>>
>>> Nowadays, VFIO already holds KVM reference, so we need
>>>
>>> close(vfio_fd)
>>> -> iommufd_device_detach_vdev()
>>> -> tsm_tdi_unbind()
>>> -> tdi stop
>>> -> callback to VFIO, dmabuf_move_notify(revoke)
>>> -> KVM unmap MMIO
>>> -> tdi metadata remove
>>> -> kvm_put_kvm()
>>> -> kvm_destroy_vm()
>>>
>>>
>>>>
>>>>>> It doesn't mean that iommufd is suddenly doing PCI stuff, no, that
>>>>>> stays in VFIO.
>>>>>
>>>>> I'm not sure if Alexey's patch [1] illustates your idea. It calls
>>>>> tsm_tdi_bind() which directly does device stuff, and impacts MMIO.
>>>>> VFIO doesn't know about this.
>>
>> VFIO knows about this enough as we asked it to share MMIO via dmabuf's fd and not via mmap(), otherwise it is the same MMIO, exactly where it was, BARs do not change.
>>
>
> Yes, if you define a SW "lock down" in boarder sense than TDISP LOCKED.
> But seems TDX Connect failed to adapt to this solution because it still
> needs to handle MMIO invalidation before FLR, see below.
>
>>>>>
>>>>> I have to interpret this as VFIO firstly hand over device CC features
>>>>> and MMIO resources to IOMMUFD, so VFIO never cares about them.
>>>>>
>>>>> [1] https://lore.kernel.org/all/20250218111017.491719-15-aik@amd.com/
>>>>
>>>> There is also the PCI layer involved here and maybe PCI should be
>>>> participating in managing some of this. Like it makes a bit of sense
>>>> that PCI would block the FLR on platforms that require this?
>>>
>>> FLR to a bound device is absolutely fine, just break the CC state.
>>> Sometimes it is exactly what host need to stop CC immediately.
>>> The problem is in VFIO's pre-FLR handling so we need to patch VFIO, not
>>> PCI core.
>>
>> What is a problem here exactly?
>> FLR by the host which equals to any other PCI error? The guest may or may not be able to handle it, afaik it does not handle any errors now, QEMU just stops the guest.
>
> It is about TDX Connect.
>
> According to the dmabuf patchset, the dmabuf needs to be revoked before
> FLR. That means KVM unmaps MMIOs when the device is in LOCKED/RUN state.
> That is forbidden by TDX Module and will crash KVM.
FLR is something you tell the device to do, how/why would TDX know about it? Or it check the TDI state on every map/unmap (unlikely)?
> So the safer way is
> to unbind the TDI first, then revoke MMIOs, then do FLR.
>
> I'm not sure when p2p dma is involved AMD will have the same issue.
On AMD, the host can "revoke" at any time, at worst it'll see RMP events from IOMMU. Thanks,
> Cause in that case, MMIOs would also be mapped in IOMMU PT and revoke
> MMIOs means IOMMU mapping drop. The root cause of the concern is secure
> firmware should monitor IOMMU mapping integrity for private assignement
> or hypervisor could silently drop trusted DMA writting.
>
> TDX Connect has the wider impact on this issue cause it uses the same
> table for KVM S-EPT and Secure IOMMU PT.
>
> Thanks,
> Yilun
>
>> Or FLR by the guest? Then it knows it needs to do the dance with attest/accept, again.
>>
>> Thanks,
>>
>>>
>>> Thanks,
>>> Yilun
>>>
>>>>
>>>> Jason
>>
>> --
>> Alexey
>>
--
Alexey
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-26 7:18 ` Alexey Kardashevskiy
@ 2025-05-29 14:41 ` Xu Yilun
2025-05-29 16:29 ` Jason Gunthorpe
2025-05-30 2:29 ` Alexey Kardashevskiy
0 siblings, 2 replies; 134+ messages in thread
From: Xu Yilun @ 2025-05-29 14:41 UTC (permalink / raw)
To: Alexey Kardashevskiy
Cc: Jason Gunthorpe, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
> > > >
> > > > FLR to a bound device is absolutely fine, just break the CC state.
> > > > Sometimes it is exactly what host need to stop CC immediately.
> > > > The problem is in VFIO's pre-FLR handling so we need to patch VFIO, not
> > > > PCI core.
> > >
> > > What is a problem here exactly?
> > > FLR by the host which equals to any other PCI error? The guest may or may not be able to handle it, afaik it does not handle any errors now, QEMU just stops the guest.
> >
> > It is about TDX Connect.
> >
> > According to the dmabuf patchset, the dmabuf needs to be revoked before
> > FLR. That means KVM unmaps MMIOs when the device is in LOCKED/RUN state.
> > That is forbidden by TDX Module and will crash KVM.
>
>
> FLR is something you tell the device to do, how/why would TDX know about it?
I'm talking about FLR in VFIO driver. The VFIO driver would zap bar
before FLR. The zapping would trigger KVM unmap MMIOs. See
vfio_pci_zap_bars() for legacy case, and see [1] for dmabuf case.
[1] https://lore.kernel.org/kvm/20250307052248.405803-4-vivek.kasireddy@intel.com/
A pure FLR without zapping bar is absolutely OK.
> Or it check the TDI state on every map/unmap (unlikely)?
Yeah, TDX Module would check TDI state on every unmapping.
>
>
> > So the safer way is
> > to unbind the TDI first, then revoke MMIOs, then do FLR.
> >
> > I'm not sure when p2p dma is involved AMD will have the same issue.
>
> On AMD, the host can "revoke" at any time, at worst it'll see RMP events from IOMMU. Thanks,
Is the RMP event firstly detected by host or guest? If by host,
host could fool guest by just suppress the event. Guest thought the
DMA writting is successful but it is not and may cause security issue.
Thanks,
Yilun
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-29 14:41 ` Xu Yilun
@ 2025-05-29 16:29 ` Jason Gunthorpe
2025-05-30 16:07 ` Xu Yilun
2025-05-30 2:29 ` Alexey Kardashevskiy
1 sibling, 1 reply; 134+ messages in thread
From: Jason Gunthorpe @ 2025-05-29 16:29 UTC (permalink / raw)
To: Xu Yilun
Cc: Alexey Kardashevskiy, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On Thu, May 29, 2025 at 10:41:15PM +0800, Xu Yilun wrote:
> > On AMD, the host can "revoke" at any time, at worst it'll see RMP
> > events from IOMMU. Thanks,
>
> Is the RMP event firstly detected by host or guest? If by host,
> host could fool guest by just suppress the event. Guest thought the
> DMA writting is successful but it is not and may cause security issue.
Is that in scope of the threat model though? Host must not be able to
change DMAs or target them to different memory, but the host can stop
DMA and loose it, surely?
Host controls the PCI memory enable bit, doesn't it?
Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-29 16:29 ` Jason Gunthorpe
@ 2025-05-30 16:07 ` Xu Yilun
0 siblings, 0 replies; 134+ messages in thread
From: Xu Yilun @ 2025-05-30 16:07 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Alexey Kardashevskiy, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On Thu, May 29, 2025 at 01:29:23PM -0300, Jason Gunthorpe wrote:
> On Thu, May 29, 2025 at 10:41:15PM +0800, Xu Yilun wrote:
>
> > > On AMD, the host can "revoke" at any time, at worst it'll see RMP
> > > events from IOMMU. Thanks,
> >
> > Is the RMP event firstly detected by host or guest? If by host,
> > host could fool guest by just suppress the event. Guest thought the
> > DMA writting is successful but it is not and may cause security issue.
>
> Is that in scope of the threat model though? Host must not be able to
> change DMAs or target them to different memory, but the host can stop
> DMA and loose it, surely?
This is within the threat model, this is a data integrity issue, not a
DoS issue. If secure firmware don't care, then no component within the
TCB could be aware of the data loss.
>
> Host controls the PCI memory enable bit, doesn't it?
That's why DSM should fallback the device to CONFIG_UNLOCKED when memory
enable is toggled, that makes TD/TDI aware of the problem. But for IOMMU
PT blocking, DSM cannot be aware, TSM must do something.
Zhi helps find something in SEV-TIO Firmware Interface SPEC, Section 2.11
which seems to indicate SEV does do something for this.
"If a bound TDI sends a request to the root complex, and the IOMMU detects a fault caused by host
configuration, the root complex fences the ASID from all further I/O to or from that guest. A host
fault is either a host page table fault or an RMP check violation. ASID fencing means that the
IOMMU blocks all further I/O from the root complex to the guest that the TDI was bound, and the
root complex blocks all MMIO accesses by the guest. When a guest writes to MMIO, the write is
silently dropped. When a guest reads from MMIO, the guest reads 1s."
Blocking all TDIs should definitely be avoided. Now I'm more sure Unbind
before DMABUF revoke is necessary.
Thanks,
Yilun
>
> Jason
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-29 14:41 ` Xu Yilun
2025-05-29 16:29 ` Jason Gunthorpe
@ 2025-05-30 2:29 ` Alexey Kardashevskiy
2025-05-30 16:23 ` Xu Yilun
1 sibling, 1 reply; 134+ messages in thread
From: Alexey Kardashevskiy @ 2025-05-30 2:29 UTC (permalink / raw)
To: Xu Yilun
Cc: Jason Gunthorpe, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On 30/5/25 00:41, Xu Yilun wrote:
>>>>>
>>>>> FLR to a bound device is absolutely fine, just break the CC state.
>>>>> Sometimes it is exactly what host need to stop CC immediately.
>>>>> The problem is in VFIO's pre-FLR handling so we need to patch VFIO, not
>>>>> PCI core.
>>>>
>>>> What is a problem here exactly?
>>>> FLR by the host which equals to any other PCI error? The guest may or may not be able to handle it, afaik it does not handle any errors now, QEMU just stops the guest.
>>>
>>> It is about TDX Connect.
>>>
>>> According to the dmabuf patchset, the dmabuf needs to be revoked before
>>> FLR. That means KVM unmaps MMIOs when the device is in LOCKED/RUN state.
>>> That is forbidden by TDX Module and will crash KVM.
>>
>>
>> FLR is something you tell the device to do, how/why would TDX know about it?
>
> I'm talking about FLR in VFIO driver. The VFIO driver would zap bar
> before FLR. The zapping would trigger KVM unmap MMIOs. See
> vfio_pci_zap_bars() for legacy case, and see [1] for dmabuf case.
oh I did not know that we do this zapping, thanks for the pointer.
> [1] https://lore.kernel.org/kvm/20250307052248.405803-4-vivek.kasireddy@intel.com/
>
> A pure FLR without zapping bar is absolutely OK.
>
>> Or it check the TDI state on every map/unmap (unlikely)?
>
> Yeah, TDX Module would check TDI state on every unmapping.
_every_? Reading the state from DOE mailbox is not cheap enough (imho) to do on every unmap.
>>
>>> So the safer way is
>>> to unbind the TDI first, then revoke MMIOs, then do FLR.
>>>
>>> I'm not sure when p2p dma is involved AMD will have the same issue.
>>
>> On AMD, the host can "revoke" at any time, at worst it'll see RMP events from IOMMU. Thanks,
>
> Is the RMP event firstly detected by host or guest? If by host,
Host.
> host could fool guest by just suppress the event. Guest thought the
> DMA writting is successful but it is not and may cause security issue.
An RMP event on the host is an indication that RMP check has failed and DMA to the guest did not complete so the guest won't see new data. Same as other PCI errors really. RMP acts like a firewall, things behind it do not need to know if something was dropped. Thanks,
>
> Thanks,
> Yilun
--
Alexey
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-30 2:29 ` Alexey Kardashevskiy
@ 2025-05-30 16:23 ` Xu Yilun
2025-06-10 4:20 ` Alexey Kardashevskiy
0 siblings, 1 reply; 134+ messages in thread
From: Xu Yilun @ 2025-05-30 16:23 UTC (permalink / raw)
To: Alexey Kardashevskiy
Cc: Jason Gunthorpe, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On Fri, May 30, 2025 at 12:29:30PM +1000, Alexey Kardashevskiy wrote:
>
>
> On 30/5/25 00:41, Xu Yilun wrote:
> > > > > >
> > > > > > FLR to a bound device is absolutely fine, just break the CC state.
> > > > > > Sometimes it is exactly what host need to stop CC immediately.
> > > > > > The problem is in VFIO's pre-FLR handling so we need to patch VFIO, not
> > > > > > PCI core.
> > > > >
> > > > > What is a problem here exactly?
> > > > > FLR by the host which equals to any other PCI error? The guest may or may not be able to handle it, afaik it does not handle any errors now, QEMU just stops the guest.
> > > >
> > > > It is about TDX Connect.
> > > >
> > > > According to the dmabuf patchset, the dmabuf needs to be revoked before
> > > > FLR. That means KVM unmaps MMIOs when the device is in LOCKED/RUN state.
> > > > That is forbidden by TDX Module and will crash KVM.
> > >
> > >
> > > FLR is something you tell the device to do, how/why would TDX know about it?
> >
> > I'm talking about FLR in VFIO driver. The VFIO driver would zap bar
> > before FLR. The zapping would trigger KVM unmap MMIOs. See
> > vfio_pci_zap_bars() for legacy case, and see [1] for dmabuf case.
>
> oh I did not know that we do this zapping, thanks for the pointer.
> > [1] https://lore.kernel.org/kvm/20250307052248.405803-4-vivek.kasireddy@intel.com/
> >
> > A pure FLR without zapping bar is absolutely OK.
> >
> > > Or it check the TDI state on every map/unmap (unlikely)?
> >
> > Yeah, TDX Module would check TDI state on every unmapping.
>
> _every_? Reading the state from DOE mailbox is not cheap enough (imho) to do on every unmap.
Sorry for confusing. TDX firmware just checks if STOP TDI firmware call
is executed, will not check the real device state via DOE. That means
even if device has physically exited to UNLOCKED, TDX host should still
call STOP TDI fwcall first, then MMIO unmap.
>
> > >
> > > > So the safer way is
> > > > to unbind the TDI first, then revoke MMIOs, then do FLR.
> > > >
> > > > I'm not sure when p2p dma is involved AMD will have the same issue.
> > >
> > > On AMD, the host can "revoke" at any time, at worst it'll see RMP events from IOMMU. Thanks,
> >
> > Is the RMP event firstly detected by host or guest? If by host,
>
> Host.
>
> > host could fool guest by just suppress the event. Guest thought the
> > DMA writting is successful but it is not and may cause security issue.
>
> An RMP event on the host is an indication that RMP check has failed and DMA to the guest did not complete so the guest won't see new data. Same as other PCI errors really. RMP acts like a firewall, things behind it do not need to know if something was dropped. Thanks,
Not really, guest thought the data is changed but it actually doesn't.
I.e. data integrity is broken.
Also please help check if the following relates to this issue:
SEV-TIO Firmware Interface SPEC, Section 2.11
If a bound TDI sends a request to the root complex, and the IOMMU detects a fault caused by host
configuration, the root complex fences the ASID from all further I/O to or from that guest. A host
fault is either a host page table fault or an RMP check violation. ASID fencing means that the
IOMMU blocks all further I/O from the root complex to the guest that the TDI was bound, and the
root complex blocks all MMIO accesses by the guest. When a guest writes to MMIO, the write is
silently dropped. When a guest reads from MMIO, the guest reads 1s.
Thanks,
Yilun
>
> >
> > Thanks,
> > Yilun
>
> --
> Alexey
>
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-30 16:23 ` Xu Yilun
@ 2025-06-10 4:20 ` Alexey Kardashevskiy
2025-06-10 5:19 ` Baolu Lu
2025-06-10 6:53 ` Xu Yilun
0 siblings, 2 replies; 134+ messages in thread
From: Alexey Kardashevskiy @ 2025-06-10 4:20 UTC (permalink / raw)
To: Xu Yilun
Cc: Jason Gunthorpe, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On 31/5/25 02:23, Xu Yilun wrote:
> On Fri, May 30, 2025 at 12:29:30PM +1000, Alexey Kardashevskiy wrote:
>>
>>
>> On 30/5/25 00:41, Xu Yilun wrote:
>>>>>>>
>>>>>>> FLR to a bound device is absolutely fine, just break the CC state.
>>>>>>> Sometimes it is exactly what host need to stop CC immediately.
>>>>>>> The problem is in VFIO's pre-FLR handling so we need to patch VFIO, not
>>>>>>> PCI core.
>>>>>>
>>>>>> What is a problem here exactly?
>>>>>> FLR by the host which equals to any other PCI error? The guest may or may not be able to handle it, afaik it does not handle any errors now, QEMU just stops the guest.
>>>>>
>>>>> It is about TDX Connect.
>>>>>
>>>>> According to the dmabuf patchset, the dmabuf needs to be revoked before
>>>>> FLR. That means KVM unmaps MMIOs when the device is in LOCKED/RUN state.
>>>>> That is forbidden by TDX Module and will crash KVM.
>>>>
>>>>
>>>> FLR is something you tell the device to do, how/why would TDX know about it?
>>>
>>> I'm talking about FLR in VFIO driver. The VFIO driver would zap bar
>>> before FLR. The zapping would trigger KVM unmap MMIOs. See
>>> vfio_pci_zap_bars() for legacy case, and see [1] for dmabuf case.
>>
>> oh I did not know that we do this zapping, thanks for the pointer.
>>> [1] https://lore.kernel.org/kvm/20250307052248.405803-4-vivek.kasireddy@intel.com/
>>>
>>> A pure FLR without zapping bar is absolutely OK.
>>>
>>>> Or it check the TDI state on every map/unmap (unlikely)?
>>>
>>> Yeah, TDX Module would check TDI state on every unmapping.
>>
>> _every_? Reading the state from DOE mailbox is not cheap enough (imho) to do on every unmap.
>
> Sorry for confusing. TDX firmware just checks if STOP TDI firmware call
> is executed, will not check the real device state via DOE. That means
> even if device has physically exited to UNLOCKED, TDX host should still
> call STOP TDI fwcall first, then MMIO unmap.
>
>>
>>>>
>>>>> So the safer way is
>>>>> to unbind the TDI first, then revoke MMIOs, then do FLR.
>>>>>
>>>>> I'm not sure when p2p dma is involved AMD will have the same issue.
>>>>
>>>> On AMD, the host can "revoke" at any time, at worst it'll see RMP events from IOMMU. Thanks,
>>>
>>> Is the RMP event firstly detected by host or guest? If by host,
>>
>> Host.
>>
>>> host could fool guest by just suppress the event. Guest thought the
>>> DMA writting is successful but it is not and may cause security issue.
>>
>> An RMP event on the host is an indication that RMP check has failed and DMA to the guest did not complete so the guest won't see new data. Same as other PCI errors really. RMP acts like a firewall, things behind it do not need to know if something was dropped. Thanks,
>
> Not really, guest thought the data is changed but it actually doesn't.
> I.e. data integrity is broken.
I am not following, sorry. Integrity is broken when something untrusted (== other than the SNP guest and the trusted device) manages to write to the guest encrypted memory successfully. If nothing is written - the guest can easily see this and do... nothing? Devices have bugs or spurious interrupts happen, the guest driver should be able to cope with that.
> Also please help check if the following relates to this issue:
>
> SEV-TIO Firmware Interface SPEC, Section 2.11
>
> If a bound TDI sends a request to the root complex, and the IOMMU detects a fault caused by host
> configuration, the root complex fences the ASID from all further I/O to or from that guest. A host
> fault is either a host page table fault or an RMP check violation. ASID fencing means that the
> IOMMU blocks all further I/O from the root complex to the guest that the TDI was bound, and the
> root complex blocks all MMIO accesses by the guest. When a guest writes to MMIO, the write is
> silently dropped. When a guest reads from MMIO, the guest reads 1s.
Right, this is about not letting bad data through, i.e. integrity. Thanks,
>
> Thanks,
> Yilun
>
>>
>>>
>>> Thanks,
>>> Yilun
>>
>> --
>> Alexey
>>
--
Alexey
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-06-10 4:20 ` Alexey Kardashevskiy
@ 2025-06-10 5:19 ` Baolu Lu
2025-06-10 6:53 ` Xu Yilun
1 sibling, 0 replies; 134+ messages in thread
From: Baolu Lu @ 2025-06-10 5:19 UTC (permalink / raw)
To: Alexey Kardashevskiy, Xu Yilun
Cc: Jason Gunthorpe, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon,
zhenzhong.duan, tao1.su
On 6/10/25 12:20, Alexey Kardashevskiy wrote:
>
>
> On 31/5/25 02:23, Xu Yilun wrote:
>> On Fri, May 30, 2025 at 12:29:30PM +1000, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 30/5/25 00:41, Xu Yilun wrote:
>>>>>>>>
>>>>>>>> FLR to a bound device is absolutely fine, just break the CC state.
>>>>>>>> Sometimes it is exactly what host need to stop CC immediately.
>>>>>>>> The problem is in VFIO's pre-FLR handling so we need to patch
>>>>>>>> VFIO, not
>>>>>>>> PCI core.
>>>>>>>
>>>>>>> What is a problem here exactly?
>>>>>>> FLR by the host which equals to any other PCI error? The guest
>>>>>>> may or may not be able to handle it, afaik it does not handle any
>>>>>>> errors now, QEMU just stops the guest.
>>>>>>
>>>>>> It is about TDX Connect.
>>>>>>
>>>>>> According to the dmabuf patchset, the dmabuf needs to be revoked
>>>>>> before
>>>>>> FLR. That means KVM unmaps MMIOs when the device is in LOCKED/RUN
>>>>>> state.
>>>>>> That is forbidden by TDX Module and will crash KVM.
>>>>>
>>>>>
>>>>> FLR is something you tell the device to do, how/why would TDX know
>>>>> about it?
>>>>
>>>> I'm talking about FLR in VFIO driver. The VFIO driver would zap bar
>>>> before FLR. The zapping would trigger KVM unmap MMIOs. See
>>>> vfio_pci_zap_bars() for legacy case, and see [1] for dmabuf case.
>>>
>>> oh I did not know that we do this zapping, thanks for the pointer.
>>>> [1] https://lore.kernel.org/kvm/20250307052248.405803-4-
>>>> vivek.kasireddy@intel.com/
>>>>
>>>> A pure FLR without zapping bar is absolutely OK.
>>>>
>>>>> Or it check the TDI state on every map/unmap (unlikely)?
>>>>
>>>> Yeah, TDX Module would check TDI state on every unmapping.
>>>
>>> _every_? Reading the state from DOE mailbox is not cheap enough
>>> (imho) to do on every unmap.
>>
>> Sorry for confusing. TDX firmware just checks if STOP TDI firmware call
>> is executed, will not check the real device state via DOE. That means
>> even if device has physically exited to UNLOCKED, TDX host should still
>> call STOP TDI fwcall first, then MMIO unmap.
>>
>>>
>>>>>
>>>>>> So the safer way is
>>>>>> to unbind the TDI first, then revoke MMIOs, then do FLR.
>>>>>>
>>>>>> I'm not sure when p2p dma is involved AMD will have the same issue.
>>>>>
>>>>> On AMD, the host can "revoke" at any time, at worst it'll see RMP
>>>>> events from IOMMU. Thanks,
>>>>
>>>> Is the RMP event firstly detected by host or guest? If by host,
>>>
>>> Host.
>>>
>>>> host could fool guest by just suppress the event. Guest thought the
>>>> DMA writting is successful but it is not and may cause security issue.
>>>
>>> An RMP event on the host is an indication that RMP check has failed
>>> and DMA to the guest did not complete so the guest won't see new
>>> data. Same as other PCI errors really. RMP acts like a firewall,
>>> things behind it do not need to know if something was dropped. Thanks,
>>
>> Not really, guest thought the data is changed but it actually doesn't.
>> I.e. data integrity is broken.
>
> I am not following, sorry. Integrity is broken when something untrusted
> (== other than the SNP guest and the trusted device) manages to write to
> the guest encrypted memory successfully. If nothing is written - the
> guest can easily see this and do... nothing? Devices have bugs or
> spurious interrupts happen, the guest driver should be able to cope with
> that.
Data integrity might not be the most accurate way to describe the
situation here. If I understand correctly, the MMIO mapping was
destroyed before the device was unbound (meaning the guest still sees
the device). When the guest issues a P2P write to the device's MMIO, it
will definitely fail, but the guest won't be aware of this failure.
Imagine this on a bare-metal system: if a P2P access targets a device's
MMIO but the device or platform considers it an illegal access, there
should be a bus error or machine check exception. Alternatively, if the
device supports out-of-band AER, the AER driver should then catch and
process these errors.
Therefore, unbinding the device before MMIO invalidation could generally
avoid this.
Thanks,
baolu
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-06-10 4:20 ` Alexey Kardashevskiy
2025-06-10 5:19 ` Baolu Lu
@ 2025-06-10 6:53 ` Xu Yilun
1 sibling, 0 replies; 134+ messages in thread
From: Xu Yilun @ 2025-06-10 6:53 UTC (permalink / raw)
To: Alexey Kardashevskiy
Cc: Jason Gunthorpe, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On Tue, Jun 10, 2025 at 02:20:03PM +1000, Alexey Kardashevskiy wrote:
>
>
> On 31/5/25 02:23, Xu Yilun wrote:
> > On Fri, May 30, 2025 at 12:29:30PM +1000, Alexey Kardashevskiy wrote:
> > >
> > >
> > > On 30/5/25 00:41, Xu Yilun wrote:
> > > > > > > >
> > > > > > > > FLR to a bound device is absolutely fine, just break the CC state.
> > > > > > > > Sometimes it is exactly what host need to stop CC immediately.
> > > > > > > > The problem is in VFIO's pre-FLR handling so we need to patch VFIO, not
> > > > > > > > PCI core.
> > > > > > >
> > > > > > > What is a problem here exactly?
> > > > > > > FLR by the host which equals to any other PCI error? The guest may or may not be able to handle it, afaik it does not handle any errors now, QEMU just stops the guest.
> > > > > >
> > > > > > It is about TDX Connect.
> > > > > >
> > > > > > According to the dmabuf patchset, the dmabuf needs to be revoked before
> > > > > > FLR. That means KVM unmaps MMIOs when the device is in LOCKED/RUN state.
> > > > > > That is forbidden by TDX Module and will crash KVM.
> > > > >
> > > > >
> > > > > FLR is something you tell the device to do, how/why would TDX know about it?
> > > >
> > > > I'm talking about FLR in VFIO driver. The VFIO driver would zap bar
> > > > before FLR. The zapping would trigger KVM unmap MMIOs. See
> > > > vfio_pci_zap_bars() for legacy case, and see [1] for dmabuf case.
> > >
> > > oh I did not know that we do this zapping, thanks for the pointer.
> > > > [1] https://lore.kernel.org/kvm/20250307052248.405803-4-vivek.kasireddy@intel.com/
> > > >
> > > > A pure FLR without zapping bar is absolutely OK.
> > > >
> > > > > Or it check the TDI state on every map/unmap (unlikely)?
> > > >
> > > > Yeah, TDX Module would check TDI state on every unmapping.
> > >
> > > _every_? Reading the state from DOE mailbox is not cheap enough (imho) to do on every unmap.
> >
> > Sorry for confusing. TDX firmware just checks if STOP TDI firmware call
> > is executed, will not check the real device state via DOE. That means
> > even if device has physically exited to UNLOCKED, TDX host should still
> > call STOP TDI fwcall first, then MMIO unmap.
> >
> > >
> > > > >
> > > > > > So the safer way is
> > > > > > to unbind the TDI first, then revoke MMIOs, then do FLR.
> > > > > >
> > > > > > I'm not sure when p2p dma is involved AMD will have the same issue.
> > > > >
> > > > > On AMD, the host can "revoke" at any time, at worst it'll see RMP events from IOMMU. Thanks,
> > > >
> > > > Is the RMP event firstly detected by host or guest? If by host,
> > >
> > > Host.
> > >
> > > > host could fool guest by just suppress the event. Guest thought the
> > > > DMA writting is successful but it is not and may cause security issue.
> > >
> > > An RMP event on the host is an indication that RMP check has failed and DMA to the guest did not complete so the guest won't see new data. Same as other PCI errors really. RMP acts like a firewall, things behind it do not need to know if something was dropped. Thanks,
> >
> > Not really, guest thought the data is changed but it actually doesn't.
> > I.e. data integrity is broken.
>
> I am not following, sorry. Integrity is broken when something untrusted (== other than the SNP guest and the trusted device) manages to write to the guest encrypted memory successfully.
Integrity is also broken when guest thought the content in some addr was
written to A but it actually stays B.
> If nothing is written - the guest can easily see this and do... nothing?
The guest may not see this only by RMP event, or IOMMU fault, malicious
host could surpress these events. Yes, guest may later read the addr
and see the trick, but this cannot be ensured. There is no general
contract saying SW must read the addr to ensure DMA write successful.
And DMA to MMIO is the worse case than DMA to memory. SW even cannot
read back the content since MMIO registers may be Write Only.
So you need ASID fence to make guest easily see the DMA Silent Drop.
Intel & ARM also have there own way.
The purpose here is to have a consensus that benigh VMM should avoid
triggering these DMA Silent Drop protections, by "unbind TDI first,
then invalidate MMIO".
Thanks,
Yilun
> Devices have bugs or spurious interrupts happen, the guest driver should be able to cope with that.
> > Also please help check if the following relates to this issue:
> >
> > SEV-TIO Firmware Interface SPEC, Section 2.11
> >
> > If a bound TDI sends a request to the root complex, and the IOMMU detects a fault caused by host
> > configuration, the root complex fences the ASID from all further I/O to or from that guest. A host
> > fault is either a host page table fault or an RMP check violation. ASID fencing means that the
> > IOMMU blocks all further I/O from the root complex to the guest that the TDI was bound, and the
> > root complex blocks all MMIO accesses by the guest. When a guest writes to MMIO, the write is
> > silently dropped. When a guest reads from MMIO, the guest reads 1s.
>
> Right, this is about not letting bad data through, i.e. integrity. Thanks,
>
> >
> > Thanks,
> > Yilun
> >
> > >
> > > >
> > > > Thanks,
> > > > Yilun
> > >
> > > --
> > > Alexey
> > >
>
> --
> Alexey
>
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-12 9:30 ` Alexey Kardashevskiy
2025-05-12 14:06 ` Jason Gunthorpe
@ 2025-05-14 3:20 ` Xu Yilun
2025-06-10 4:37 ` Alexey Kardashevskiy
1 sibling, 1 reply; 134+ messages in thread
From: Xu Yilun @ 2025-05-14 3:20 UTC (permalink / raw)
To: Alexey Kardashevskiy
Cc: Jason Gunthorpe, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On Mon, May 12, 2025 at 07:30:21PM +1000, Alexey Kardashevskiy wrote:
>
>
> On 10/5/25 13:47, Xu Yilun wrote:
> > On Fri, May 09, 2025 at 03:43:18PM -0300, Jason Gunthorpe wrote:
> > > On Sat, May 10, 2025 at 12:28:48AM +0800, Xu Yilun wrote:
> > > > On Fri, May 09, 2025 at 07:12:46PM +0800, Xu Yilun wrote:
> > > > > On Fri, May 09, 2025 at 01:04:58PM +1000, Alexey Kardashevskiy wrote:
> > > > > > Ping?
> > > > >
> > > > > Sorry for late reply from vacation.
> > > > >
> > > > > > Also, since there is pushback on 01/12 "dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI", what is the plan now? Thanks,
> > > > >
> > > > > As disscussed in the thread, this kAPI is not well considered but IIUC
> > > > > the concept of "importer mapping" is still valid. We need more
> > > > > investigation about all the needs - P2P, CC memory, private bus
> > > > > channel, and work out a formal API.
> > > > >
> > > > > However in last few months I'm focusing on high level TIO flow - TSM
> > > > > framework, IOMMUFD based bind/unbind, so no much progress here and is
> > > > > still using this temporary kAPI. But as long as "importer mapping" is
> > > > > alive, the dmabuf fd for KVM is still valid and we could enable TIO
> > > > > based on that.
> > > >
> > > > Oh I forgot to mention I moved the dmabuf creation from VFIO to IOMMUFD
> > > > recently, the IOCTL is against iommufd_device.
> > >
> > > I'm surprised by this.. iommufd shouldn't be doing PCI stuff, it is
> > > just about managing the translation control of the device.
> >
> > I have a little difficulty to understand. Is TSM bind PCI stuff? To me
> > it is. Host sends PCI TDISP messages via PCI DOE to put the device in
> > TDISP LOCKED state, so that device behaves differently from before. Then
> > why put it in IOMMUFD?
>
>
> "TSM bind" sets up the CPU side of it, it binds a VM to a piece of IOMMU on the host CPU.
I didn't fully get your idea, are you defending for "TSM bind is NOT PCI
stuff"? To me it is not true.
TSM bind also sets up the device side. From your patch, it calls
tsm_tdi_bind(), which in turn calls spdm_forward(), I assume it is doing
TDISP LOCK. And TDISP LOCK changes device a lot.
> The device does not know about the VM, it just enables/disables encryption by a request from the CPU (those start/stop interface commands).
> And IOMMUFD won't be doing DOE, the platform driver (such as AMD CCP) will. Nothing to do for VFIO here.
IOMMUFD calls tsm_tdi_bind(), which is an interface doing PCI stuff.
Thanks,
Yilun
>
> We probably should notify VFIO about the state transition but I do not know VFIO would want to do in response.
>
>
^ permalink raw reply [flat|nested] 134+ messages in thread
* Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev
2025-05-14 3:20 ` Xu Yilun
@ 2025-06-10 4:37 ` Alexey Kardashevskiy
0 siblings, 0 replies; 134+ messages in thread
From: Alexey Kardashevskiy @ 2025-06-10 4:37 UTC (permalink / raw)
To: Xu Yilun
Cc: Jason Gunthorpe, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
On 14/5/25 13:20, Xu Yilun wrote:
> On Mon, May 12, 2025 at 07:30:21PM +1000, Alexey Kardashevskiy wrote:
>>
>>
>> On 10/5/25 13:47, Xu Yilun wrote:
>>> On Fri, May 09, 2025 at 03:43:18PM -0300, Jason Gunthorpe wrote:
>>>> On Sat, May 10, 2025 at 12:28:48AM +0800, Xu Yilun wrote:
>>>>> On Fri, May 09, 2025 at 07:12:46PM +0800, Xu Yilun wrote:
>>>>>> On Fri, May 09, 2025 at 01:04:58PM +1000, Alexey Kardashevskiy wrote:
>>>>>>> Ping?
>>>>>>
>>>>>> Sorry for late reply from vacation.
>>>>>>
>>>>>>> Also, since there is pushback on 01/12 "dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI", what is the plan now? Thanks,
>>>>>>
>>>>>> As disscussed in the thread, this kAPI is not well considered but IIUC
>>>>>> the concept of "importer mapping" is still valid. We need more
>>>>>> investigation about all the needs - P2P, CC memory, private bus
>>>>>> channel, and work out a formal API.
>>>>>>
>>>>>> However in last few months I'm focusing on high level TIO flow - TSM
>>>>>> framework, IOMMUFD based bind/unbind, so no much progress here and is
>>>>>> still using this temporary kAPI. But as long as "importer mapping" is
>>>>>> alive, the dmabuf fd for KVM is still valid and we could enable TIO
>>>>>> based on that.
>>>>>
>>>>> Oh I forgot to mention I moved the dmabuf creation from VFIO to IOMMUFD
>>>>> recently, the IOCTL is against iommufd_device.
>>>>
>>>> I'm surprised by this.. iommufd shouldn't be doing PCI stuff, it is
>>>> just about managing the translation control of the device.
>>>
>>> I have a little difficulty to understand. Is TSM bind PCI stuff? To me
>>> it is. Host sends PCI TDISP messages via PCI DOE to put the device in
>>> TDISP LOCKED state, so that device behaves differently from before. Then
>>> why put it in IOMMUFD?
>>
>>
>> "TSM bind" sets up the CPU side of it, it binds a VM to a piece of IOMMU on the host CPU.
>
> I didn't fully get your idea, are you defending for "TSM bind is NOT PCI
> stuff"? To me it is not true.
It is more IOMMU stuff than PCI and for the PCI part VFIO has nothing to add to this.
> TSM bind also sets up the device side. From your patch, it calls
> tsm_tdi_bind(), which in turn calls spdm_forward(), I assume it is doing
> TDISP LOCK. And TDISP LOCK changes device a lot.
DMA runs, MMIO works, what is that "lot"? Config space access works a bit different but it traps into QEMU anyway and QEMU already knows about all this binding business and can act accordingly.
>> The device does not know about the VM, it just enables/disables encryption by a request from the CPU (those start/stop interface commands).
>> And IOMMUFD won't be doing DOE, the platform driver (such as AMD CCP) will. Nothing to do for VFIO here.
>
> IOMMUFD calls tsm_tdi_bind(), which is an interface doing PCI stuff.
Only forwards messages, no state change in page tables or anywhere in the host kernel really. Thanks,
ps. hard to follow a million of (sub)threads but I am trying, sorry for the delays :)
>
> Thanks,
> Yilun
>
>>
>> We probably should notify VFIO about the state transition but I do not know VFIO would want to do in response.
>>
>>
--
Alexey
^ permalink raw reply [flat|nested] 134+ messages in thread