[PATCH 0/2] vfio/pci: vfio device address space mapping

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] vfio/pci: vfio device address space mapping
@ 2024-05-23 19:56 Alex Williamson
  2024-05-23 19:56 ` [PATCH 1/2] vfio: Create vfio_fs_type with inode per device Alex Williamson
  2024-05-23 19:56 ` [PATCH 2/2] vfio/pci: Use unmap_mapping_range() Alex Williamson
  0 siblings, 2 replies; 21+ messages in thread
From: Alex Williamson @ 2024-05-23 19:56 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, ajones, yan.y.zhao, kevin.tian, jgg, peterx

Upstream commit ba168b52bf8e ("mm: use rwsem assertion macros for
mmap_lock") changes a long standing lockdep issue where we call
io_remap_pfn_range() from within the vm_ops fault handler callback
without the proper write lock[1], generating a WARN_ON that we can
no longer stall to fix.

Attaching an address space to the vfio device file has been discussed
for some time as a way to make use of unmap_mapping_range(), which
provides an easy mechanism for zapping all vmas mapping a section of
the device file, for example mmaps to PCI BARs.  This means that we
no longer need to track those vmas for the purpose of zapping, which
removes a bunch of really ugly locking.  This vma list was also used
to avoid duplicate mappings for concurrent faults to the same vma.
As a result, we now use the more acceptable vmf_insert_pfn() which
actually manages locking correctly from the fault handler versus
io_remap_pfn_range().

The unfortunate side effect of this is that we now fault per page
rather than populate the entire vma with a single fault.  While
this overhead is fairly insignificant for average BAR sizes, it
is notable.  There's potentially quite ugly code we could use to
walk the vmas in the address space to proactively reinsert mappings
to avoid this, but the simpler solution seems to be to teach
vmf_insert_pfn_{pmd,pud}() about pfnmaps such that we can extend
the faulting behavior to include vm_ops huge_fault to both vastly
reduce the number of faults as well as reducing tlb usage.

The above commit seems to require an iterative solution where we
introduce the address space, remove the vma tracking, and make use
of vmf_insert_pfn() in the short term and work on the mm aspects to
enable huge_fault in the long term.

This series is intended for v6.10 given the WARN_ON now encountered
for all vfio-pci uses.  Thanks,

Alex

[1]https://lore.kernel.org/all/20230508125842.28193-1-yan.y.zhao@intel.com/

Alex Williamson (2):
  vfio: Create vfio_fs_type with inode per device
  vfio/pci: Use unmap_mapping_range()

 drivers/vfio/device_cdev.c       |   7 +
 drivers/vfio/group.c             |   7 +
 drivers/vfio/pci/vfio_pci_core.c | 256 +++++++------------------------
 drivers/vfio/vfio_main.c         |  44 ++++++
 include/linux/vfio.h             |   1 +
 include/linux/vfio_pci_core.h    |   2 -
 6 files changed, 115 insertions(+), 202 deletions(-)

-- 
2.45.0

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/2] vfio: Create vfio_fs_type with inode per device
  2024-05-23 19:56 [PATCH 0/2] vfio/pci: vfio device address space mapping Alex Williamson
@ 2024-05-23 19:56 ` Alex Williamson
  2024-05-24 13:24   ` Jason Gunthorpe
  2024-05-29 23:59   ` Tian, Kevin
  2024-05-23 19:56 ` [PATCH 2/2] vfio/pci: Use unmap_mapping_range() Alex Williamson
  1 sibling, 2 replies; 21+ messages in thread
From: Alex Williamson @ 2024-05-23 19:56 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, ajones, yan.y.zhao, kevin.tian, jgg, peterx

By linking all the device fds we provide to userspace to an
address space through a new pseudo fs, we can use tools like
unmap_mapping_range() to zap all vmas associated with a device.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/device_cdev.c |  7 ++++++
 drivers/vfio/group.c       |  7 ++++++
 drivers/vfio/vfio_main.c   | 44 ++++++++++++++++++++++++++++++++++++++
 include/linux/vfio.h       |  1 +
 4 files changed, 59 insertions(+)

diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c
index e75da0a70d1f..bb1817bd4ff3 100644
--- a/drivers/vfio/device_cdev.c
+++ b/drivers/vfio/device_cdev.c
@@ -39,6 +39,13 @@ int vfio_device_fops_cdev_open(struct inode *inode, struct file *filep)
 
 	filep->private_data = df;
 
+	/*
+	 * Use the pseudo fs inode on the device to link all mmaps
+	 * to the same address space, allowing us to unmap all vmas
+	 * associated to this device using unmap_mapping_range().
+	 */
+	filep->f_mapping = device->inode->i_mapping;
+
 	return 0;
 
 err_put_registration:
diff --git a/drivers/vfio/group.c b/drivers/vfio/group.c
index 610a429c6191..ded364588d29 100644
--- a/drivers/vfio/group.c
+++ b/drivers/vfio/group.c
@@ -286,6 +286,13 @@ static struct file *vfio_device_open_file(struct vfio_device *device)
 	 */
 	filep->f_mode |= (FMODE_PREAD | FMODE_PWRITE);
 
+	/*
+	 * Use the pseudo fs inode on the device to link all mmaps
+	 * to the same address space, allowing us to unmap all vmas
+	 * associated to this device using unmap_mapping_range().
+	 */
+	filep->f_mapping = device->inode->i_mapping;
+
 	if (device->group->type == VFIO_NO_IOMMU)
 		dev_warn(device->dev, "vfio-noiommu device opened by user "
 			 "(%s:%d)\n", current->comm, task_pid_nr(current));
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index e97d796a54fb..a5a62d9d963f 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -22,8 +22,10 @@
 #include <linux/list.h>
 #include <linux/miscdevice.h>
 #include <linux/module.h>
+#include <linux/mount.h>
 #include <linux/mutex.h>
 #include <linux/pci.h>
+#include <linux/pseudo_fs.h>
 #include <linux/rwsem.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
@@ -43,9 +45,13 @@
 #define DRIVER_AUTHOR	"Alex Williamson <alex.williamson@redhat.com>"
 #define DRIVER_DESC	"VFIO - User Level meta-driver"
 
+#define VFIO_MAGIC 0x5646494f /* "VFIO" */
+
 static struct vfio {
 	struct class			*device_class;
 	struct ida			device_ida;
+	struct vfsmount			*vfs_mount;
+	int				fs_count;
 } vfio;
 
 #ifdef CONFIG_VFIO_NOIOMMU
@@ -186,6 +192,8 @@ static void vfio_device_release(struct device *dev)
 	if (device->ops->release)
 		device->ops->release(device);
 
+	iput(device->inode);
+	simple_release_fs(&vfio.vfs_mount, &vfio.fs_count);
 	kvfree(device);
 }
 
@@ -228,6 +236,34 @@ struct vfio_device *_vfio_alloc_device(size_t size, struct device *dev,
 }
 EXPORT_SYMBOL_GPL(_vfio_alloc_device);
 
+static int vfio_fs_init_fs_context(struct fs_context *fc)
+{
+	return init_pseudo(fc, VFIO_MAGIC) ? 0 : -ENOMEM;
+}
+
+static struct file_system_type vfio_fs_type = {
+	.name = "vfio",
+	.owner = THIS_MODULE,
+	.init_fs_context = vfio_fs_init_fs_context,
+	.kill_sb = kill_anon_super,
+};
+
+static struct inode *vfio_fs_inode_new(void)
+{
+	struct inode *inode;
+	int ret;
+
+	ret = simple_pin_fs(&vfio_fs_type, &vfio.vfs_mount, &vfio.fs_count);
+	if (ret)
+		return ERR_PTR(ret);
+
+	inode = alloc_anon_inode(vfio.vfs_mount->mnt_sb);
+	if (IS_ERR(inode))
+		simple_release_fs(&vfio.vfs_mount, &vfio.fs_count);
+
+	return inode;
+}
+
 /*
  * Initialize a vfio_device so it can be registered to vfio core.
  */
@@ -246,6 +282,11 @@ static int vfio_init_device(struct vfio_device *device, struct device *dev,
 	init_completion(&device->comp);
 	device->dev = dev;
 	device->ops = ops;
+	device->inode = vfio_fs_inode_new();
+	if (IS_ERR(device->inode)) {
+		ret = PTR_ERR(device->inode);
+		goto out_inode;
+	}
 
 	if (ops->init) {
 		ret = ops->init(device);
@@ -260,6 +301,9 @@ static int vfio_init_device(struct vfio_device *device, struct device *dev,
 	return 0;
 
 out_uninit:
+	iput(device->inode);
+	simple_release_fs(&vfio.vfs_mount, &vfio.fs_count);
+out_inode:
 	vfio_release_device_set(device);
 	ida_free(&vfio.device_ida, device->index);
 	return ret;
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 8b1a29820409..000a6cab2d31 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -64,6 +64,7 @@ struct vfio_device {
 	struct completion comp;
 	struct iommufd_access *iommufd_access;
 	void (*put_kvm)(struct kvm *kvm);
+	struct inode *inode;
 #if IS_ENABLED(CONFIG_IOMMUFD)
 	struct iommufd_device *iommufd_device;
 	u8 iommufd_attached:1;
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] vfio: Create vfio_fs_type with inode per device
  2024-05-23 19:56 ` [PATCH 1/2] vfio: Create vfio_fs_type with inode per device Alex Williamson
@ 2024-05-24 13:24   ` Jason Gunthorpe
  2024-05-29 23:59   ` Tian, Kevin
  1 sibling, 0 replies; 21+ messages in thread
From: Jason Gunthorpe @ 2024-05-24 13:24 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, ajones, yan.y.zhao, kevin.tian, peterx

On Thu, May 23, 2024 at 01:56:26PM -0600, Alex Williamson wrote:
> By linking all the device fds we provide to userspace to an
> address space through a new pseudo fs, we can use tools like
> unmap_mapping_range() to zap all vmas associated with a device.
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  drivers/vfio/device_cdev.c |  7 ++++++
>  drivers/vfio/group.c       |  7 ++++++
>  drivers/vfio/vfio_main.c   | 44 ++++++++++++++++++++++++++++++++++++++
>  include/linux/vfio.h       |  1 +
>  4 files changed, 59 insertions(+)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH 1/2] vfio: Create vfio_fs_type with inode per device
  2024-05-23 19:56 ` [PATCH 1/2] vfio: Create vfio_fs_type with inode per device Alex Williamson
  2024-05-24 13:24   ` Jason Gunthorpe
@ 2024-05-29 23:59   ` Tian, Kevin
  1 sibling, 0 replies; 21+ messages in thread
From: Tian, Kevin @ 2024-05-29 23:59 UTC (permalink / raw)
  To: Alex Williamson, kvm@vger.kernel.org
  Cc: ajones@ventanamicro.com, Zhao, Yan Y, jgg@nvidia.com,
	peterx@redhat.com

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, May 24, 2024 3:56 AM
> 
> By linking all the device fds we provide to userspace to an
> address space through a new pseudo fs, we can use tools like
> unmap_mapping_range() to zap all vmas associated with a device.
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 2/2] vfio/pci: Use unmap_mapping_range()
  2024-05-23 19:56 [PATCH 0/2] vfio/pci: vfio device address space mapping Alex Williamson
  2024-05-23 19:56 ` [PATCH 1/2] vfio: Create vfio_fs_type with inode per device Alex Williamson
@ 2024-05-23 19:56 ` Alex Williamson
  2024-05-24  0:39   ` Yan Zhao
                     ` (2 more replies)
  1 sibling, 3 replies; 21+ messages in thread
From: Alex Williamson @ 2024-05-23 19:56 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, ajones, yan.y.zhao, kevin.tian, jgg, peterx

With the vfio device fd tied to the address space of the pseudo fs
inode, we can use the mm to track all vmas that might be mmap'ing
device BARs, which removes our vma_list and all the complicated lock
ordering necessary to manually zap each related vma.

Note that we can no longer store the pfn in vm_pgoff if we want to use
unmap_mapping_range() to zap a selective portion of the device fd
corresponding to BAR mappings.

This also converts our mmap fault handler to use vmf_insert_pfn()
because we no longer have a vma_list to avoid the concurrency problem
with io_remap_pfn_range().  The goal is to eventually use the vm_ops
huge_fault handler to avoid the additional faulting overhead, but
vmf_insert_pfn_{pmd,pud}() need to learn about pfnmaps first.

Also, Jason notes that a race exists between unmap_mapping_range() and
the fops mmap callback if we were to call io_remap_pfn_range() to
populate the vma on mmap.  Specifically, mmap_region() does call_mmap()
before it does vma_link_file() which gives a window where the vma is
populated but invisible to unmap_mapping_range().

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 256 +++++++------------------------
 include/linux/vfio_pci_core.h    |   2 -
 2 files changed, 56 insertions(+), 202 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 80cae87fff36..234e23ecaa94 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1610,100 +1610,20 @@ ssize_t vfio_pci_core_write(struct vfio_device *core_vdev, const char __user *bu
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_write);
 
-/* Return 1 on zap and vma_lock acquired, 0 on contention (only with @try) */
-static int vfio_pci_zap_and_vma_lock(struct vfio_pci_core_device *vdev, bool try)
+static void vfio_pci_zap_bars(struct vfio_pci_core_device *vdev)
 {
-	struct vfio_pci_mmap_vma *mmap_vma, *tmp;
+	struct vfio_device *core_vdev = &vdev->vdev;
+	loff_t start = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX);
+	loff_t end = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_ROM_REGION_INDEX);
+	loff_t len = end - start;
 
-	/*
-	 * Lock ordering:
-	 * vma_lock is nested under mmap_lock for vm_ops callback paths.
-	 * The memory_lock semaphore is used by both code paths calling
-	 * into this function to zap vmas and the vm_ops.fault callback
-	 * to protect the memory enable state of the device.
-	 *
-	 * When zapping vmas we need to maintain the mmap_lock => vma_lock
-	 * ordering, which requires using vma_lock to walk vma_list to
-	 * acquire an mm, then dropping vma_lock to get the mmap_lock and
-	 * reacquiring vma_lock.  This logic is derived from similar
-	 * requirements in uverbs_user_mmap_disassociate().
-	 *
-	 * mmap_lock must always be the top-level lock when it is taken.
-	 * Therefore we can only hold the memory_lock write lock when
-	 * vma_list is empty, as we'd need to take mmap_lock to clear
-	 * entries.  vma_list can only be guaranteed empty when holding
-	 * vma_lock, thus memory_lock is nested under vma_lock.
-	 *
-	 * This enables the vm_ops.fault callback to acquire vma_lock,
-	 * followed by memory_lock read lock, while already holding
-	 * mmap_lock without risk of deadlock.
-	 */
-	while (1) {
-		struct mm_struct *mm = NULL;
-
-		if (try) {
-			if (!mutex_trylock(&vdev->vma_lock))
-				return 0;
-		} else {
-			mutex_lock(&vdev->vma_lock);
-		}
-		while (!list_empty(&vdev->vma_list)) {
-			mmap_vma = list_first_entry(&vdev->vma_list,
-						    struct vfio_pci_mmap_vma,
-						    vma_next);
-			mm = mmap_vma->vma->vm_mm;
-			if (mmget_not_zero(mm))
-				break;
-
-			list_del(&mmap_vma->vma_next);
-			kfree(mmap_vma);
-			mm = NULL;
-		}
-		if (!mm)
-			return 1;
-		mutex_unlock(&vdev->vma_lock);
-
-		if (try) {
-			if (!mmap_read_trylock(mm)) {
-				mmput(mm);
-				return 0;
-			}
-		} else {
-			mmap_read_lock(mm);
-		}
-		if (try) {
-			if (!mutex_trylock(&vdev->vma_lock)) {
-				mmap_read_unlock(mm);
-				mmput(mm);
-				return 0;
-			}
-		} else {
-			mutex_lock(&vdev->vma_lock);
-		}
-		list_for_each_entry_safe(mmap_vma, tmp,
-					 &vdev->vma_list, vma_next) {
-			struct vm_area_struct *vma = mmap_vma->vma;
-
-			if (vma->vm_mm != mm)
-				continue;
-
-			list_del(&mmap_vma->vma_next);
-			kfree(mmap_vma);
-
-			zap_vma_ptes(vma, vma->vm_start,
-				     vma->vm_end - vma->vm_start);
-		}
-		mutex_unlock(&vdev->vma_lock);
-		mmap_read_unlock(mm);
-		mmput(mm);
-	}
+	unmap_mapping_range(core_vdev->inode->i_mapping, start, len, true);
 }
 
 void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_core_device *vdev)
 {
-	vfio_pci_zap_and_vma_lock(vdev, false);
 	down_write(&vdev->memory_lock);
-	mutex_unlock(&vdev->vma_lock);
+	vfio_pci_zap_bars(vdev);
 }
 
 u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_core_device *vdev)
@@ -1725,99 +1645,48 @@ void vfio_pci_memory_unlock_and_restore(struct vfio_pci_core_device *vdev, u16 c
 	up_write(&vdev->memory_lock);
 }
 
-/* Caller holds vma_lock */
-static int __vfio_pci_add_vma(struct vfio_pci_core_device *vdev,
-			      struct vm_area_struct *vma)
+static int vma_to_pfn(struct vm_area_struct *vma, unsigned long *pfn)
 {
-	struct vfio_pci_mmap_vma *mmap_vma;
-
-	mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL_ACCOUNT);
-	if (!mmap_vma)
-		return -ENOMEM;
-
-	mmap_vma->vma = vma;
-	list_add(&mmap_vma->vma_next, &vdev->vma_list);
+	struct vfio_pci_core_device *vdev = vma->vm_private_data;
+	int index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
+	u64 pgoff;
 
-	return 0;
-}
+	if (index >= VFIO_PCI_ROM_REGION_INDEX ||
+	    !vdev->bar_mmap_supported[index] || !vdev->barmap[index])
+		return -EINVAL;
 
-/*
- * Zap mmaps on open so that we can fault them in on access and therefore
- * our vma_list only tracks mappings accessed since last zap.
- */
-static void vfio_pci_mmap_open(struct vm_area_struct *vma)
-{
-	zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
-}
+	pgoff = vma->vm_pgoff &
+		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
 
-static void vfio_pci_mmap_close(struct vm_area_struct *vma)
-{
-	struct vfio_pci_core_device *vdev = vma->vm_private_data;
-	struct vfio_pci_mmap_vma *mmap_vma;
+	*pfn = (pci_resource_start(vdev->pdev, index) >> PAGE_SHIFT) + pgoff;
 
-	mutex_lock(&vdev->vma_lock);
-	list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) {
-		if (mmap_vma->vma == vma) {
-			list_del(&mmap_vma->vma_next);
-			kfree(mmap_vma);
-			break;
-		}
-	}
-	mutex_unlock(&vdev->vma_lock);
+	return 0;
 }
 
 static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct vfio_pci_core_device *vdev = vma->vm_private_data;
-	struct vfio_pci_mmap_vma *mmap_vma;
-	vm_fault_t ret = VM_FAULT_NOPAGE;
+	unsigned long pfn, pgoff = vmf->pgoff - vma->vm_pgoff;
+	vm_fault_t ret = VM_FAULT_SIGBUS;
 
-	mutex_lock(&vdev->vma_lock);
-	down_read(&vdev->memory_lock);
-
-	/*
-	 * Memory region cannot be accessed if the low power feature is engaged
-	 * or memory access is disabled.
-	 */
-	if (vdev->pm_runtime_engaged || !__vfio_pci_memory_enabled(vdev)) {
-		ret = VM_FAULT_SIGBUS;
-		goto up_out;
-	}
+	if (vma_to_pfn(vma, &pfn))
+		return ret;
 
-	/*
-	 * We populate the whole vma on fault, so we need to test whether
-	 * the vma has already been mapped, such as for concurrent faults
-	 * to the same vma.  io_remap_pfn_range() will trigger a BUG_ON if
-	 * we ask it to fill the same range again.
-	 */
-	list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) {
-		if (mmap_vma->vma == vma)
-			goto up_out;
-	}
+	down_read(&vdev->memory_lock);
 
-	if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
-			       vma->vm_end - vma->vm_start,
-			       vma->vm_page_prot)) {
-		ret = VM_FAULT_SIGBUS;
-		zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
-		goto up_out;
-	}
+	if (vdev->pm_runtime_engaged || !__vfio_pci_memory_enabled(vdev))
+		goto out_disabled;
 
-	if (__vfio_pci_add_vma(vdev, vma)) {
-		ret = VM_FAULT_OOM;
-		zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
-	}
+	ret = vmf_insert_pfn(vma, vmf->address, pfn + pgoff);
 
-up_out:
+out_disabled:
 	up_read(&vdev->memory_lock);
-	mutex_unlock(&vdev->vma_lock);
+
 	return ret;
 }
 
 static const struct vm_operations_struct vfio_pci_mmap_ops = {
-	.open = vfio_pci_mmap_open,
-	.close = vfio_pci_mmap_close,
 	.fault = vfio_pci_mmap_fault,
 };
 
@@ -1880,11 +1749,12 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma
 
 	vma->vm_private_data = vdev;
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
-	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
+	vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
 
 	/*
-	 * See remap_pfn_range(), called from vfio_pci_fault() but we can't
-	 * change vm_flags within the fault handler.  Set them now.
+	 * Set vm_flags now, they should not be changed in the fault handler.
+	 * We want the same flags and page protection (decrypted above) as
+	 * io_remap_pfn_range() would set.
 	 *
 	 * VM_ALLOW_ANY_UNCACHED: The VMA flag is implemented for ARM64,
 	 * allowing KVM stage 2 device mapping attributes to use Normal-NC
@@ -2202,8 +2072,6 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev)
 	mutex_init(&vdev->ioeventfds_lock);
 	INIT_LIST_HEAD(&vdev->dummy_resources_list);
 	INIT_LIST_HEAD(&vdev->ioeventfds_list);
-	mutex_init(&vdev->vma_lock);
-	INIT_LIST_HEAD(&vdev->vma_list);
 	INIT_LIST_HEAD(&vdev->sriov_pfs_item);
 	init_rwsem(&vdev->memory_lock);
 	xa_init(&vdev->ctx);
@@ -2219,7 +2087,6 @@ void vfio_pci_core_release_dev(struct vfio_device *core_vdev)
 
 	mutex_destroy(&vdev->igate);
 	mutex_destroy(&vdev->ioeventfds_lock);
-	mutex_destroy(&vdev->vma_lock);
 	kfree(vdev->region);
 	kfree(vdev->pm_save);
 }
@@ -2506,17 +2373,11 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 				      struct vfio_pci_group_info *groups,
 				      struct iommufd_ctx *iommufd_ctx)
 {
-	struct vfio_pci_core_device *cur_mem;
-	struct vfio_pci_core_device *cur_vma;
-	struct vfio_pci_core_device *cur;
+	struct vfio_pci_core_device *vdev;
 	struct pci_dev *pdev;
-	bool is_mem = true;
 	int ret;
 
 	mutex_lock(&dev_set->lock);
-	cur_mem = list_first_entry(&dev_set->device_list,
-				   struct vfio_pci_core_device,
-				   vdev.dev_set_list);
 
 	pdev = vfio_pci_dev_set_resettable(dev_set);
 	if (!pdev) {
@@ -2533,7 +2394,7 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 	if (ret)
 		goto err_unlock;
 
-	list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) {
+	list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list) {
 		bool owned;
 
 		/*
@@ -2557,38 +2418,36 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 		 * Otherwise, reset is not allowed.
 		 */
 		if (iommufd_ctx) {
-			int devid = vfio_iommufd_get_dev_id(&cur_vma->vdev,
+			int devid = vfio_iommufd_get_dev_id(&vdev->vdev,
 							    iommufd_ctx);
 
 			owned = (devid > 0 || devid == -ENOENT);
 		} else {
-			owned = vfio_dev_in_groups(&cur_vma->vdev, groups);
+			owned = vfio_dev_in_groups(&vdev->vdev, groups);
 		}
 
 		if (!owned) {
 			ret = -EINVAL;
-			goto err_undo;
+			break;
 		}
 
 		/*
 		 * Locking multiple devices is prone to deadlock, runaway and
 		 * unwind if we hit contention.
 		 */
-		if (!vfio_pci_zap_and_vma_lock(cur_vma, true)) {
+		if (!down_write_trylock(&vdev->memory_lock)) {
 			ret = -EBUSY;
-			goto err_undo;
+			break;
 		}
+
+		vfio_pci_zap_bars(vdev);
 	}
-	cur_vma = NULL;
 
-	list_for_each_entry(cur_mem, &dev_set->device_list, vdev.dev_set_list) {
-		if (!down_write_trylock(&cur_mem->memory_lock)) {
-			ret = -EBUSY;
-			goto err_undo;
-		}
-		mutex_unlock(&cur_mem->vma_lock);
+	if (!list_entry_is_head(vdev,
+				&dev_set->device_list, vdev.dev_set_list)) {
+		vdev = list_prev_entry(vdev, vdev.dev_set_list);
+		goto err_undo;
 	}
-	cur_mem = NULL;
 
 	/*
 	 * The pci_reset_bus() will reset all the devices in the bus.
@@ -2599,25 +2458,22 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 	 * cause the PCI config space reset without restoring the original
 	 * state (saved locally in 'vdev->pm_save').
 	 */
-	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
-		vfio_pci_set_power_state(cur, PCI_D0);
+	list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list)
+		vfio_pci_set_power_state(vdev, PCI_D0);
 
 	ret = pci_reset_bus(pdev);
 
+	vdev = list_last_entry(&dev_set->device_list,
+			       struct vfio_pci_core_device, vdev.dev_set_list);
+
 err_undo:
-	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
-		if (cur == cur_mem)
-			is_mem = false;
-		if (cur == cur_vma)
-			break;
-		if (is_mem)
-			up_write(&cur->memory_lock);
-		else
-			mutex_unlock(&cur->vma_lock);
-	}
+	list_for_each_entry_from_reverse(vdev, &dev_set->device_list,
+					 vdev.dev_set_list)
+		up_write(&vdev->memory_lock);
+
+	list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list)
+		pm_runtime_put(&vdev->pdev->dev);
 
-	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
-		pm_runtime_put(&cur->pdev->dev);
 err_unlock:
 	mutex_unlock(&dev_set->lock);
 	return ret;
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index a2c8b8bba711..f87067438ed4 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -93,8 +93,6 @@ struct vfio_pci_core_device {
 	struct list_head		sriov_pfs_item;
 	struct vfio_pci_core_device	*sriov_pf_core_dev;
 	struct notifier_block	nb;
-	struct mutex		vma_lock;
-	struct list_head	vma_list;
 	struct rw_semaphore	memory_lock;
 };
 
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] vfio/pci: Use unmap_mapping_range()
  2024-05-23 19:56 ` [PATCH 2/2] vfio/pci: Use unmap_mapping_range() Alex Williamson
@ 2024-05-24  0:39   ` Yan Zhao
  2024-05-24  0:49     ` Peter Xu
  2024-05-24 13:42   ` Jason Gunthorpe
  2024-05-30  0:09   ` Tian, Kevin
  2 siblings, 1 reply; 21+ messages in thread
From: Yan Zhao @ 2024-05-24  0:39 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, ajones, kevin.tian, jgg, peterx

On Thu, May 23, 2024 at 01:56:27PM -0600, Alex Williamson wrote:
> With the vfio device fd tied to the address space of the pseudo fs
> inode, we can use the mm to track all vmas that might be mmap'ing
> device BARs, which removes our vma_list and all the complicated lock
> ordering necessary to manually zap each related vma.
> 
> Note that we can no longer store the pfn in vm_pgoff if we want to use
> unmap_mapping_range() to zap a selective portion of the device fd
> corresponding to BAR mappings.
> 
> This also converts our mmap fault handler to use vmf_insert_pfn()
Looks vmf_insert_pfn() does not call memtype_reserve() to reserve memory type
for the PFN on x86 as what's done in io_remap_pfn_range().

Instead, it just calls lookup_memtype() and determine the final prot based on
the result from this lookup, which might not prevent others from reserving the
PFN to other memory types.

Does that matter?
> because we no longer have a vma_list to avoid the concurrency problem
> with io_remap_pfn_range().  The goal is to eventually use the vm_ops
> huge_fault handler to avoid the additional faulting overhead, but
> vmf_insert_pfn_{pmd,pud}() need to learn about pfnmaps first.
>
> Also, Jason notes that a race exists between unmap_mapping_range() and
> the fops mmap callback if we were to call io_remap_pfn_range() to
> populate the vma on mmap.  Specifically, mmap_region() does call_mmap()
> before it does vma_link_file() which gives a window where the vma is
> populated but invisible to unmap_mapping_range().
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] vfio/pci: Use unmap_mapping_range()
  2024-05-24  0:39   ` Yan Zhao
@ 2024-05-24  0:49     ` Peter Xu
  2024-05-24  1:47       ` Yan Zhao
  2024-05-24  8:40       ` Tian, Kevin
  0 siblings, 2 replies; 21+ messages in thread
From: Peter Xu @ 2024-05-24  0:49 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Alex Williamson, kvm, ajones, kevin.tian, jgg

Hi, Yan,

On Fri, May 24, 2024 at 08:39:37AM +0800, Yan Zhao wrote:
> On Thu, May 23, 2024 at 01:56:27PM -0600, Alex Williamson wrote:
> > With the vfio device fd tied to the address space of the pseudo fs
> > inode, we can use the mm to track all vmas that might be mmap'ing
> > device BARs, which removes our vma_list and all the complicated lock
> > ordering necessary to manually zap each related vma.
> > 
> > Note that we can no longer store the pfn in vm_pgoff if we want to use
> > unmap_mapping_range() to zap a selective portion of the device fd
> > corresponding to BAR mappings.
> > 
> > This also converts our mmap fault handler to use vmf_insert_pfn()
> Looks vmf_insert_pfn() does not call memtype_reserve() to reserve memory type
> for the PFN on x86 as what's done in io_remap_pfn_range().
> 
> Instead, it just calls lookup_memtype() and determine the final prot based on
> the result from this lookup, which might not prevent others from reserving the
> PFN to other memory types.

I didn't worry too much on others reserving the same pfn range, as that
should be the mmio region for this device, and this device should be owned
by vfio driver.

However I share the same question, see:

https://lore.kernel.org/r/20240523223745.395337-2-peterx@redhat.com

So far I think it's not a major issue as VFIO always use UC- mem type, and
that's also the default.  But I do also feel like there's something we can
do better, and I'll keep you copied too if I'll resend the series.

Thanks,

> 
> Does that matter?
> > because we no longer have a vma_list to avoid the concurrency problem
> > with io_remap_pfn_range().  The goal is to eventually use the vm_ops
> > huge_fault handler to avoid the additional faulting overhead, but
> > vmf_insert_pfn_{pmd,pud}() need to learn about pfnmaps first.
> >
> > Also, Jason notes that a race exists between unmap_mapping_range() and
> > the fops mmap callback if we were to call io_remap_pfn_range() to
> > populate the vma on mmap.  Specifically, mmap_region() does call_mmap()
> > before it does vma_link_file() which gives a window where the vma is
> > populated but invisible to unmap_mapping_range().
> > 
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] vfio/pci: Use unmap_mapping_range()
  2024-05-24  0:49     ` Peter Xu
@ 2024-05-24  1:47       ` Yan Zhao
  2024-05-28 18:42         ` Alex Williamson
  2024-05-24  8:40       ` Tian, Kevin
  1 sibling, 1 reply; 21+ messages in thread
From: Yan Zhao @ 2024-05-24  1:47 UTC (permalink / raw)
  To: Peter Xu; +Cc: Alex Williamson, kvm, ajones, kevin.tian, jgg

On Thu, May 23, 2024 at 08:49:03PM -0400, Peter Xu wrote:
> Hi, Yan,
> 
> On Fri, May 24, 2024 at 08:39:37AM +0800, Yan Zhao wrote:
> > On Thu, May 23, 2024 at 01:56:27PM -0600, Alex Williamson wrote:
> > > With the vfio device fd tied to the address space of the pseudo fs
> > > inode, we can use the mm to track all vmas that might be mmap'ing
> > > device BARs, which removes our vma_list and all the complicated lock
> > > ordering necessary to manually zap each related vma.
> > > 
> > > Note that we can no longer store the pfn in vm_pgoff if we want to use
> > > unmap_mapping_range() to zap a selective portion of the device fd
> > > corresponding to BAR mappings.
> > > 
> > > This also converts our mmap fault handler to use vmf_insert_pfn()
> > Looks vmf_insert_pfn() does not call memtype_reserve() to reserve memory type
> > for the PFN on x86 as what's done in io_remap_pfn_range().
> > 
> > Instead, it just calls lookup_memtype() and determine the final prot based on
> > the result from this lookup, which might not prevent others from reserving the
> > PFN to other memory types.
> 
> I didn't worry too much on others reserving the same pfn range, as that
> should be the mmio region for this device, and this device should be owned
> by vfio driver.
> 
> However I share the same question, see:
> 
> https://lore.kernel.org/r/20240523223745.395337-2-peterx@redhat.com
> 
> So far I think it's not a major issue as VFIO always use UC- mem type, and
> that's also the default.  But I do also feel like there's something we can
Right, but I feel that it may lead to inconsistency in reserved mem type if VFIO
(or the variant driver) opts to use WC for certain BAR as mem type in future.
Not sure if it will be true though :)

> do better, and I'll keep you copied too if I'll resend the series.
Thanks. 

> 
> 
> > 
> > Does that matter?
> > > because we no longer have a vma_list to avoid the concurrency problem
> > > with io_remap_pfn_range().  The goal is to eventually use the vm_ops
> > > huge_fault handler to avoid the additional faulting overhead, but
> > > vmf_insert_pfn_{pmd,pud}() need to learn about pfnmaps first.
> > >
> > > Also, Jason notes that a race exists between unmap_mapping_range() and
> > > the fops mmap callback if we were to call io_remap_pfn_range() to
> > > populate the vma on mmap.  Specifically, mmap_region() does call_mmap()
> > > before it does vma_link_file() which gives a window where the vma is
> > > populated but invisible to unmap_mapping_range().
> > > 
> > 
> 
> -- 
> Peter Xu
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] vfio/pci: Use unmap_mapping_range()
  2024-05-24  1:47       ` Yan Zhao
@ 2024-05-28 18:42         ` Alex Williamson
  2024-05-29  2:29           ` Yan Zhao
  0 siblings, 1 reply; 21+ messages in thread
From: Alex Williamson @ 2024-05-28 18:42 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Peter Xu, kvm, ajones, kevin.tian, jgg

On Fri, 24 May 2024 09:47:03 +0800
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Thu, May 23, 2024 at 08:49:03PM -0400, Peter Xu wrote:
> > Hi, Yan,
> > 
> > On Fri, May 24, 2024 at 08:39:37AM +0800, Yan Zhao wrote:  
> > > On Thu, May 23, 2024 at 01:56:27PM -0600, Alex Williamson wrote:  
> > > > With the vfio device fd tied to the address space of the pseudo fs
> > > > inode, we can use the mm to track all vmas that might be mmap'ing
> > > > device BARs, which removes our vma_list and all the complicated lock
> > > > ordering necessary to manually zap each related vma.
> > > > 
> > > > Note that we can no longer store the pfn in vm_pgoff if we want to use
> > > > unmap_mapping_range() to zap a selective portion of the device fd
> > > > corresponding to BAR mappings.
> > > > 
> > > > This also converts our mmap fault handler to use vmf_insert_pfn()  
> > > Looks vmf_insert_pfn() does not call memtype_reserve() to reserve memory type
> > > for the PFN on x86 as what's done in io_remap_pfn_range().
> > > 
> > > Instead, it just calls lookup_memtype() and determine the final prot based on
> > > the result from this lookup, which might not prevent others from reserving the
> > > PFN to other memory types.  
> > 
> > I didn't worry too much on others reserving the same pfn range, as that
> > should be the mmio region for this device, and this device should be owned
> > by vfio driver.
> > 
> > However I share the same question, see:
> > 
> > https://lore.kernel.org/r/20240523223745.395337-2-peterx@redhat.com
> > 
> > So far I think it's not a major issue as VFIO always use UC- mem type, and
> > that's also the default.  But I do also feel like there's something we can  
> Right, but I feel that it may lead to inconsistency in reserved mem type if VFIO
> (or the variant driver) opts to use WC for certain BAR as mem type in future.
> Not sure if it will be true though :)

Does Kevin's comment[1] satisfy your concern?  vfio_pci_core_mmap()
needs to make sure the PCI BAR region is requested before the mmap,
which is tracked via the barmap.  Therefore the barmap is always setup
via pci_iomap() which will call memtype_reserve() with UC- attribute.

If there are any additional comments required to make this more clear
or outline steps for WC support in the future, please provide
suggestions.  Thanks,

Alex

[1]https://lore.kernel.org/all/BN9PR11MB52764E958E6481A112649B5D8CF52@BN9PR11MB5276.namprd11.prod.outlook.com/

> > > Does that matter?  
> > > > because we no longer have a vma_list to avoid the concurrency problem
> > > > with io_remap_pfn_range().  The goal is to eventually use the vm_ops
> > > > huge_fault handler to avoid the additional faulting overhead, but
> > > > vmf_insert_pfn_{pmd,pud}() need to learn about pfnmaps first.
> > > >
> > > > Also, Jason notes that a race exists between unmap_mapping_range() and
> > > > the fops mmap callback if we were to call io_remap_pfn_range() to
> > > > populate the vma on mmap.  Specifically, mmap_region() does call_mmap()
> > > > before it does vma_link_file() which gives a window where the vma is
> > > > populated but invisible to unmap_mapping_range().
> > > >   
> > >   
> > 
> > -- 
> > Peter Xu
> >   
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] vfio/pci: Use unmap_mapping_range()
  2024-05-28 18:42         ` Alex Williamson
@ 2024-05-29  2:29           ` Yan Zhao
  2024-05-29  3:12             ` Alex Williamson
  0 siblings, 1 reply; 21+ messages in thread
From: Yan Zhao @ 2024-05-29  2:29 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Peter Xu, kvm, ajones, kevin.tian, jgg

On Tue, May 28, 2024 at 12:42:51PM -0600, Alex Williamson wrote:
> On Fri, 24 May 2024 09:47:03 +0800
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Thu, May 23, 2024 at 08:49:03PM -0400, Peter Xu wrote:
> > > Hi, Yan,
> > > 
> > > On Fri, May 24, 2024 at 08:39:37AM +0800, Yan Zhao wrote:  
> > > > On Thu, May 23, 2024 at 01:56:27PM -0600, Alex Williamson wrote:  
> > > > > With the vfio device fd tied to the address space of the pseudo fs
> > > > > inode, we can use the mm to track all vmas that might be mmap'ing
> > > > > device BARs, which removes our vma_list and all the complicated lock
> > > > > ordering necessary to manually zap each related vma.
> > > > > 
> > > > > Note that we can no longer store the pfn in vm_pgoff if we want to use
> > > > > unmap_mapping_range() to zap a selective portion of the device fd
> > > > > corresponding to BAR mappings.
> > > > > 
> > > > > This also converts our mmap fault handler to use vmf_insert_pfn()  
> > > > Looks vmf_insert_pfn() does not call memtype_reserve() to reserve memory type
> > > > for the PFN on x86 as what's done in io_remap_pfn_range().
> > > > 
> > > > Instead, it just calls lookup_memtype() and determine the final prot based on
> > > > the result from this lookup, which might not prevent others from reserving the
> > > > PFN to other memory types.  
> > > 
> > > I didn't worry too much on others reserving the same pfn range, as that
> > > should be the mmio region for this device, and this device should be owned
> > > by vfio driver.
> > > 
> > > However I share the same question, see:
> > > 
> > > https://lore.kernel.org/r/20240523223745.395337-2-peterx@redhat.com
> > > 
> > > So far I think it's not a major issue as VFIO always use UC- mem type, and
> > > that's also the default.  But I do also feel like there's something we can  
> > Right, but I feel that it may lead to inconsistency in reserved mem type if VFIO
> > (or the variant driver) opts to use WC for certain BAR as mem type in future.
> > Not sure if it will be true though :)
> 
> Does Kevin's comment[1] satisfy your concern?  vfio_pci_core_mmap()
> needs to make sure the PCI BAR region is requested before the mmap,
> which is tracked via the barmap.  Therefore the barmap is always setup
> via pci_iomap() which will call memtype_reserve() with UC- attribute.
Just a question out of curiosity.
Is this a must to call pci_iomap() in vfio_pci_core_mmap()?
I don't see it or ioremap*() is called before nvgrace_gpu_mmap().

> 
> If there are any additional comments required to make this more clear
> or outline steps for WC support in the future, please provide
> suggestions.  Thanks,
> 
> Alex
> 
> [1]https://lore.kernel.org/all/BN9PR11MB52764E958E6481A112649B5D8CF52@BN9PR11MB5276.namprd11.prod.outlook.com/
> 
> > > > Does that matter?  
> > > > > because we no longer have a vma_list to avoid the concurrency problem
> > > > > with io_remap_pfn_range().  The goal is to eventually use the vm_ops
> > > > > huge_fault handler to avoid the additional faulting overhead, but
> > > > > vmf_insert_pfn_{pmd,pud}() need to learn about pfnmaps first.
> > > > >
> > > > > Also, Jason notes that a race exists between unmap_mapping_range() and
> > > > > the fops mmap callback if we were to call io_remap_pfn_range() to
> > > > > populate the vma on mmap.  Specifically, mmap_region() does call_mmap()
> > > > > before it does vma_link_file() which gives a window where the vma is
> > > > > populated but invisible to unmap_mapping_range().
> > > > >   
> > > >   
> > > 
> > > -- 
> > > Peter Xu
> > >   
> > 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] vfio/pci: Use unmap_mapping_range()
  2024-05-29  2:29           ` Yan Zhao
@ 2024-05-29  3:12             ` Alex Williamson
  2024-05-29  6:34               ` Yan Zhao
  0 siblings, 1 reply; 21+ messages in thread
From: Alex Williamson @ 2024-05-29  3:12 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Peter Xu, kvm, ajones, kevin.tian, jgg

On Wed, 29 May 2024 10:29:33 +0800
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Tue, May 28, 2024 at 12:42:51PM -0600, Alex Williamson wrote:
> > On Fri, 24 May 2024 09:47:03 +0800
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >   
> > > On Thu, May 23, 2024 at 08:49:03PM -0400, Peter Xu wrote:  
> > > > Hi, Yan,
> > > > 
> > > > On Fri, May 24, 2024 at 08:39:37AM +0800, Yan Zhao wrote:    
> > > > > On Thu, May 23, 2024 at 01:56:27PM -0600, Alex Williamson wrote:    
> > > > > > With the vfio device fd tied to the address space of the pseudo fs
> > > > > > inode, we can use the mm to track all vmas that might be mmap'ing
> > > > > > device BARs, which removes our vma_list and all the complicated lock
> > > > > > ordering necessary to manually zap each related vma.
> > > > > > 
> > > > > > Note that we can no longer store the pfn in vm_pgoff if we want to use
> > > > > > unmap_mapping_range() to zap a selective portion of the device fd
> > > > > > corresponding to BAR mappings.
> > > > > > 
> > > > > > This also converts our mmap fault handler to use vmf_insert_pfn()    
> > > > > Looks vmf_insert_pfn() does not call memtype_reserve() to reserve memory type
> > > > > for the PFN on x86 as what's done in io_remap_pfn_range().
> > > > > 
> > > > > Instead, it just calls lookup_memtype() and determine the final prot based on
> > > > > the result from this lookup, which might not prevent others from reserving the
> > > > > PFN to other memory types.    
> > > > 
> > > > I didn't worry too much on others reserving the same pfn range, as that
> > > > should be the mmio region for this device, and this device should be owned
> > > > by vfio driver.
> > > > 
> > > > However I share the same question, see:
> > > > 
> > > > https://lore.kernel.org/r/20240523223745.395337-2-peterx@redhat.com
> > > > 
> > > > So far I think it's not a major issue as VFIO always use UC- mem type, and
> > > > that's also the default.  But I do also feel like there's something we can    
> > > Right, but I feel that it may lead to inconsistency in reserved mem type if VFIO
> > > (or the variant driver) opts to use WC for certain BAR as mem type in future.
> > > Not sure if it will be true though :)  
> > 
> > Does Kevin's comment[1] satisfy your concern?  vfio_pci_core_mmap()
> > needs to make sure the PCI BAR region is requested before the mmap,
> > which is tracked via the barmap.  Therefore the barmap is always setup
> > via pci_iomap() which will call memtype_reserve() with UC- attribute.  
> Just a question out of curiosity.
> Is this a must to call pci_iomap() in vfio_pci_core_mmap()?
> I don't see it or ioremap*() is called before nvgrace_gpu_mmap().

nvgrace-gpu is exposing a non-PCI coherent memory region as a BAR, so
it doesn't request the PCI BAR region and is on it's own for read/write
access as well.  To mmap an actual PCI BAR it's required to request the
region and vfio-pci-core uses the barmap to track which BARs have been
requested.  Thanks,

Alex

> > If there are any additional comments required to make this more clear
> > or outline steps for WC support in the future, please provide
> > suggestions.  Thanks,
> > 
> > Alex
> > 
> > [1]https://lore.kernel.org/all/BN9PR11MB52764E958E6481A112649B5D8CF52@BN9PR11MB5276.namprd11.prod.outlook.com/
> >   
> > > > > Does that matter?    
> > > > > > because we no longer have a vma_list to avoid the concurrency problem
> > > > > > with io_remap_pfn_range().  The goal is to eventually use the vm_ops
> > > > > > huge_fault handler to avoid the additional faulting overhead, but
> > > > > > vmf_insert_pfn_{pmd,pud}() need to learn about pfnmaps first.
> > > > > >
> > > > > > Also, Jason notes that a race exists between unmap_mapping_range() and
> > > > > > the fops mmap callback if we were to call io_remap_pfn_range() to
> > > > > > populate the vma on mmap.  Specifically, mmap_region() does call_mmap()
> > > > > > before it does vma_link_file() which gives a window where the vma is
> > > > > > populated but invisible to unmap_mapping_range().
> > > > > >     
> > > > >     
> > > > 
> > > > -- 
> > > > Peter Xu
> > > >     
> > >   
> >   
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] vfio/pci: Use unmap_mapping_range()
  2024-05-29  3:12             ` Alex Williamson
@ 2024-05-29  6:34               ` Yan Zhao
  2024-05-29 16:50                 ` Alex Williamson
  0 siblings, 1 reply; 21+ messages in thread
From: Yan Zhao @ 2024-05-29  6:34 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Peter Xu, kvm, ajones, kevin.tian, jgg

On Tue, May 28, 2024 at 09:12:00PM -0600, Alex Williamson wrote:
> On Wed, 29 May 2024 10:29:33 +0800
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Tue, May 28, 2024 at 12:42:51PM -0600, Alex Williamson wrote:
> > > On Fri, 24 May 2024 09:47:03 +0800
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >   
> > > > On Thu, May 23, 2024 at 08:49:03PM -0400, Peter Xu wrote:  
> > > > > Hi, Yan,
> > > > > 
> > > > > On Fri, May 24, 2024 at 08:39:37AM +0800, Yan Zhao wrote:    
> > > > > > On Thu, May 23, 2024 at 01:56:27PM -0600, Alex Williamson wrote:    
> > > > > > > With the vfio device fd tied to the address space of the pseudo fs
> > > > > > > inode, we can use the mm to track all vmas that might be mmap'ing
> > > > > > > device BARs, which removes our vma_list and all the complicated lock
> > > > > > > ordering necessary to manually zap each related vma.
> > > > > > > 
> > > > > > > Note that we can no longer store the pfn in vm_pgoff if we want to use
> > > > > > > unmap_mapping_range() to zap a selective portion of the device fd
> > > > > > > corresponding to BAR mappings.
> > > > > > > 
> > > > > > > This also converts our mmap fault handler to use vmf_insert_pfn()    
> > > > > > Looks vmf_insert_pfn() does not call memtype_reserve() to reserve memory type
> > > > > > for the PFN on x86 as what's done in io_remap_pfn_range().
> > > > > > 
> > > > > > Instead, it just calls lookup_memtype() and determine the final prot based on
> > > > > > the result from this lookup, which might not prevent others from reserving the
> > > > > > PFN to other memory types.    
> > > > > 
> > > > > I didn't worry too much on others reserving the same pfn range, as that
> > > > > should be the mmio region for this device, and this device should be owned
> > > > > by vfio driver.
> > > > > 
> > > > > However I share the same question, see:
> > > > > 
> > > > > https://lore.kernel.org/r/20240523223745.395337-2-peterx@redhat.com
> > > > > 
> > > > > So far I think it's not a major issue as VFIO always use UC- mem type, and
> > > > > that's also the default.  But I do also feel like there's something we can    
> > > > Right, but I feel that it may lead to inconsistency in reserved mem type if VFIO
> > > > (or the variant driver) opts to use WC for certain BAR as mem type in future.
> > > > Not sure if it will be true though :)  
> > > 
> > > Does Kevin's comment[1] satisfy your concern?  vfio_pci_core_mmap()
> > > needs to make sure the PCI BAR region is requested before the mmap,
> > > which is tracked via the barmap.  Therefore the barmap is always setup
> > > via pci_iomap() which will call memtype_reserve() with UC- attribute.  
> > Just a question out of curiosity.
> > Is this a must to call pci_iomap() in vfio_pci_core_mmap()?
> > I don't see it or ioremap*() is called before nvgrace_gpu_mmap().
> 
> nvgrace-gpu is exposing a non-PCI coherent memory region as a BAR, so
> it doesn't request the PCI BAR region and is on it's own for read/write
> access as well.  To mmap an actual PCI BAR it's required to request the
Thanks for explanation!
So, if mmap happens before read/write, where is page memtype reserved?

> region and vfio-pci-core uses the barmap to track which BARs have been
> requested.  Thanks,

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] vfio/pci: Use unmap_mapping_range()
  2024-05-29  6:34               ` Yan Zhao
@ 2024-05-29 16:50                 ` Alex Williamson
  2024-05-30  7:46                   ` Yan Zhao
  0 siblings, 1 reply; 21+ messages in thread
From: Alex Williamson @ 2024-05-29 16:50 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Peter Xu, kvm, ajones, kevin.tian, jgg

On Wed, 29 May 2024 14:34:23 +0800
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Tue, May 28, 2024 at 09:12:00PM -0600, Alex Williamson wrote:
> > On Wed, 29 May 2024 10:29:33 +0800
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >   
> > > On Tue, May 28, 2024 at 12:42:51PM -0600, Alex Williamson wrote:  
> > > > On Fri, 24 May 2024 09:47:03 +0800
> > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >     
> > > > > On Thu, May 23, 2024 at 08:49:03PM -0400, Peter Xu wrote:    
> > > > > > Hi, Yan,
> > > > > > 
> > > > > > On Fri, May 24, 2024 at 08:39:37AM +0800, Yan Zhao wrote:      
> > > > > > > On Thu, May 23, 2024 at 01:56:27PM -0600, Alex Williamson wrote:      
> > > > > > > > With the vfio device fd tied to the address space of the pseudo fs
> > > > > > > > inode, we can use the mm to track all vmas that might be mmap'ing
> > > > > > > > device BARs, which removes our vma_list and all the complicated lock
> > > > > > > > ordering necessary to manually zap each related vma.
> > > > > > > > 
> > > > > > > > Note that we can no longer store the pfn in vm_pgoff if we want to use
> > > > > > > > unmap_mapping_range() to zap a selective portion of the device fd
> > > > > > > > corresponding to BAR mappings.
> > > > > > > > 
> > > > > > > > This also converts our mmap fault handler to use vmf_insert_pfn()      
> > > > > > > Looks vmf_insert_pfn() does not call memtype_reserve() to reserve memory type
> > > > > > > for the PFN on x86 as what's done in io_remap_pfn_range().
> > > > > > > 
> > > > > > > Instead, it just calls lookup_memtype() and determine the final prot based on
> > > > > > > the result from this lookup, which might not prevent others from reserving the
> > > > > > > PFN to other memory types.      
> > > > > > 
> > > > > > I didn't worry too much on others reserving the same pfn range, as that
> > > > > > should be the mmio region for this device, and this device should be owned
> > > > > > by vfio driver.
> > > > > > 
> > > > > > However I share the same question, see:
> > > > > > 
> > > > > > https://lore.kernel.org/r/20240523223745.395337-2-peterx@redhat.com
> > > > > > 
> > > > > > So far I think it's not a major issue as VFIO always use UC- mem type, and
> > > > > > that's also the default.  But I do also feel like there's something we can      
> > > > > Right, but I feel that it may lead to inconsistency in reserved mem type if VFIO
> > > > > (or the variant driver) opts to use WC for certain BAR as mem type in future.
> > > > > Not sure if it will be true though :)    
> > > > 
> > > > Does Kevin's comment[1] satisfy your concern?  vfio_pci_core_mmap()
> > > > needs to make sure the PCI BAR region is requested before the mmap,
> > > > which is tracked via the barmap.  Therefore the barmap is always setup
> > > > via pci_iomap() which will call memtype_reserve() with UC- attribute.    
> > > Just a question out of curiosity.
> > > Is this a must to call pci_iomap() in vfio_pci_core_mmap()?
> > > I don't see it or ioremap*() is called before nvgrace_gpu_mmap().  
> > 
> > nvgrace-gpu is exposing a non-PCI coherent memory region as a BAR, so
> > it doesn't request the PCI BAR region and is on it's own for read/write
> > access as well.  To mmap an actual PCI BAR it's required to request the  
> Thanks for explanation!
> So, if mmap happens before read/write, where is page memtype reserved?

For nvgrace-gpu?  The device for this variant driver only exists on ARM
platforms.  memtype_reserve() only exists on x86.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] vfio/pci: Use unmap_mapping_range()
  2024-05-29 16:50                 ` Alex Williamson
@ 2024-05-30  7:46                   ` Yan Zhao
  0 siblings, 0 replies; 21+ messages in thread
From: Yan Zhao @ 2024-05-30  7:46 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Peter Xu, kvm, ajones, kevin.tian, jgg

On Wed, May 29, 2024 at 10:50:12AM -0600, Alex Williamson wrote:
> On Wed, 29 May 2024 14:34:23 +0800
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Tue, May 28, 2024 at 09:12:00PM -0600, Alex Williamson wrote:
> > > On Wed, 29 May 2024 10:29:33 +0800
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >   
> > > > On Tue, May 28, 2024 at 12:42:51PM -0600, Alex Williamson wrote:  
> > > > > On Fri, 24 May 2024 09:47:03 +0800
> > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > >     
> > > > > > On Thu, May 23, 2024 at 08:49:03PM -0400, Peter Xu wrote:    
> > > > > > > Hi, Yan,
> > > > > > > 
> > > > > > > On Fri, May 24, 2024 at 08:39:37AM +0800, Yan Zhao wrote:      
> > > > > > > > On Thu, May 23, 2024 at 01:56:27PM -0600, Alex Williamson wrote:      
> > > > > > > > > With the vfio device fd tied to the address space of the pseudo fs
> > > > > > > > > inode, we can use the mm to track all vmas that might be mmap'ing
> > > > > > > > > device BARs, which removes our vma_list and all the complicated lock
> > > > > > > > > ordering necessary to manually zap each related vma.
> > > > > > > > > 
> > > > > > > > > Note that we can no longer store the pfn in vm_pgoff if we want to use
> > > > > > > > > unmap_mapping_range() to zap a selective portion of the device fd
> > > > > > > > > corresponding to BAR mappings.
> > > > > > > > > 
> > > > > > > > > This also converts our mmap fault handler to use vmf_insert_pfn()      
> > > > > > > > Looks vmf_insert_pfn() does not call memtype_reserve() to reserve memory type
> > > > > > > > for the PFN on x86 as what's done in io_remap_pfn_range().
> > > > > > > > 
> > > > > > > > Instead, it just calls lookup_memtype() and determine the final prot based on
> > > > > > > > the result from this lookup, which might not prevent others from reserving the
> > > > > > > > PFN to other memory types.      
> > > > > > > 
> > > > > > > I didn't worry too much on others reserving the same pfn range, as that
> > > > > > > should be the mmio region for this device, and this device should be owned
> > > > > > > by vfio driver.
> > > > > > > 
> > > > > > > However I share the same question, see:
> > > > > > > 
> > > > > > > https://lore.kernel.org/r/20240523223745.395337-2-peterx@redhat.com
> > > > > > > 
> > > > > > > So far I think it's not a major issue as VFIO always use UC- mem type, and
> > > > > > > that's also the default.  But I do also feel like there's something we can      
> > > > > > Right, but I feel that it may lead to inconsistency in reserved mem type if VFIO
> > > > > > (or the variant driver) opts to use WC for certain BAR as mem type in future.
> > > > > > Not sure if it will be true though :)    
> > > > > 
> > > > > Does Kevin's comment[1] satisfy your concern?  vfio_pci_core_mmap()
> > > > > needs to make sure the PCI BAR region is requested before the mmap,
> > > > > which is tracked via the barmap.  Therefore the barmap is always setup
> > > > > via pci_iomap() which will call memtype_reserve() with UC- attribute.    
> > > > Just a question out of curiosity.
> > > > Is this a must to call pci_iomap() in vfio_pci_core_mmap()?
> > > > I don't see it or ioremap*() is called before nvgrace_gpu_mmap().  
> > > 
> > > nvgrace-gpu is exposing a non-PCI coherent memory region as a BAR, so
> > > it doesn't request the PCI BAR region and is on it's own for read/write
> > > access as well.  To mmap an actual PCI BAR it's required to request the  
> > Thanks for explanation!
> > So, if mmap happens before read/write, where is page memtype reserved?
> 
> For nvgrace-gpu?  The device for this variant driver only exists on ARM
> platforms.  memtype_reserve() only exists on x86.  Thanks,
> 
Oh. Thanks! 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH 2/2] vfio/pci: Use unmap_mapping_range()
  2024-05-24  0:49     ` Peter Xu
  2024-05-24  1:47       ` Yan Zhao
@ 2024-05-24  8:40       ` Tian, Kevin
  2024-05-24 13:22         ` Jason Gunthorpe
  1 sibling, 1 reply; 21+ messages in thread
From: Tian, Kevin @ 2024-05-24  8:40 UTC (permalink / raw)
  To: Peter Xu, Zhao, Yan Y
  Cc: Alex Williamson, kvm@vger.kernel.org, ajones@ventanamicro.com,
	jgg@nvidia.com

> From: Peter Xu <peterx@redhat.com>
> Sent: Friday, May 24, 2024 8:49 AM
> 
> Hi, Yan,
> 
> On Fri, May 24, 2024 at 08:39:37AM +0800, Yan Zhao wrote:
> > On Thu, May 23, 2024 at 01:56:27PM -0600, Alex Williamson wrote:
> > > With the vfio device fd tied to the address space of the pseudo fs
> > > inode, we can use the mm to track all vmas that might be mmap'ing
> > > device BARs, which removes our vma_list and all the complicated lock
> > > ordering necessary to manually zap each related vma.
> > >
> > > Note that we can no longer store the pfn in vm_pgoff if we want to use
> > > unmap_mapping_range() to zap a selective portion of the device fd
> > > corresponding to BAR mappings.
> > >
> > > This also converts our mmap fault handler to use vmf_insert_pfn()
> > Looks vmf_insert_pfn() does not call memtype_reserve() to reserve
> memory type
> > for the PFN on x86 as what's done in io_remap_pfn_range().
> >
> > Instead, it just calls lookup_memtype() and determine the final prot based
> on
> > the result from this lookup, which might not prevent others from reserving
> the
> > PFN to other memory types.
> 
> I didn't worry too much on others reserving the same pfn range, as that
> should be the mmio region for this device, and this device should be owned
> by vfio driver.

and the earliest point doing memtype_reserve() is here:

vfio_pci_core_mmap()
	vdev->barmap[index] = pci_iomap(pdev, index, 0);

> 
> However I share the same question, see:
> 
> https://lore.kernel.org/r/20240523223745.395337-2-peterx@redhat.com
> 
> So far I think it's not a major issue as VFIO always use UC- mem type, and
> that's also the default.  But I do also feel like there's something we can
> do better, and I'll keep you copied too if I'll resend the series.
> 

vfio-nvgrace uses WC. But it directly does remap_pfn_range() in its
nvgrace_gpu_mmap() so not suffering from the issue here.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] vfio/pci: Use unmap_mapping_range()
  2024-05-24  8:40       ` Tian, Kevin
@ 2024-05-24 13:22         ` Jason Gunthorpe
  2024-05-24 23:15           ` Peter Xu
  0 siblings, 1 reply; 21+ messages in thread
From: Jason Gunthorpe @ 2024-05-24 13:22 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Peter Xu, Zhao, Yan Y, Alex Williamson, kvm@vger.kernel.org,
	ajones@ventanamicro.com

On Fri, May 24, 2024 at 08:40:26AM +0000, Tian, Kevin wrote:
> > From: Peter Xu <peterx@redhat.com>
> > Sent: Friday, May 24, 2024 8:49 AM
> > 
> > Hi, Yan,
> > 
> > On Fri, May 24, 2024 at 08:39:37AM +0800, Yan Zhao wrote:
> > > On Thu, May 23, 2024 at 01:56:27PM -0600, Alex Williamson wrote:
> > > > With the vfio device fd tied to the address space of the pseudo fs
> > > > inode, we can use the mm to track all vmas that might be mmap'ing
> > > > device BARs, which removes our vma_list and all the complicated lock
> > > > ordering necessary to manually zap each related vma.
> > > >
> > > > Note that we can no longer store the pfn in vm_pgoff if we want to use
> > > > unmap_mapping_range() to zap a selective portion of the device fd
> > > > corresponding to BAR mappings.
> > > >
> > > > This also converts our mmap fault handler to use vmf_insert_pfn()
> > > Looks vmf_insert_pfn() does not call memtype_reserve() to reserve
> > memory type
> > > for the PFN on x86 as what's done in io_remap_pfn_range().
> > >
> > > Instead, it just calls lookup_memtype() and determine the final prot based
> > on
> > > the result from this lookup, which might not prevent others from reserving
> > the
> > > PFN to other memory types.
> > 
> > I didn't worry too much on others reserving the same pfn range, as that
> > should be the mmio region for this device, and this device should be owned
> > by vfio driver.
> 
> and the earliest point doing memtype_reserve() is here:
> 
> vfio_pci_core_mmap()
> 	vdev->barmap[index] = pci_iomap(pdev, index, 0);
> 
> > 
> > However I share the same question, see:
> > 
> > https://lore.kernel.org/r/20240523223745.395337-2-peterx@redhat.com
> > 
> > So far I think it's not a major issue as VFIO always use UC- mem type, and
> > that's also the default.  But I do also feel like there's something we can
> > do better, and I'll keep you copied too if I'll resend the series.
> > 
> 
> vfio-nvgrace uses WC. But it directly does remap_pfn_range() in its
> nvgrace_gpu_mmap() so not suffering from the issue here.

People keep asking for WC on normal VFIO PCI as well, we shouldn't
rule out, or at least provide a big warning comment what needs to be
fixed to allow it.

Jason

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] vfio/pci: Use unmap_mapping_range()
  2024-05-24 13:22         ` Jason Gunthorpe
@ 2024-05-24 23:15           ` Peter Xu
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2024-05-24 23:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Zhao, Yan Y, Alex Williamson, kvm@vger.kernel.org,
	ajones@ventanamicro.com

On Fri, May 24, 2024 at 10:22:40AM -0300, Jason Gunthorpe wrote:
> On Fri, May 24, 2024 at 08:40:26AM +0000, Tian, Kevin wrote:
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Friday, May 24, 2024 8:49 AM
> > > 
> > > Hi, Yan,
> > > 
> > > On Fri, May 24, 2024 at 08:39:37AM +0800, Yan Zhao wrote:
> > > > On Thu, May 23, 2024 at 01:56:27PM -0600, Alex Williamson wrote:
> > > > > With the vfio device fd tied to the address space of the pseudo fs
> > > > > inode, we can use the mm to track all vmas that might be mmap'ing
> > > > > device BARs, which removes our vma_list and all the complicated lock
> > > > > ordering necessary to manually zap each related vma.
> > > > >
> > > > > Note that we can no longer store the pfn in vm_pgoff if we want to use
> > > > > unmap_mapping_range() to zap a selective portion of the device fd
> > > > > corresponding to BAR mappings.
> > > > >
> > > > > This also converts our mmap fault handler to use vmf_insert_pfn()
> > > > Looks vmf_insert_pfn() does not call memtype_reserve() to reserve
> > > memory type
> > > > for the PFN on x86 as what's done in io_remap_pfn_range().
> > > >
> > > > Instead, it just calls lookup_memtype() and determine the final prot based
> > > on
> > > > the result from this lookup, which might not prevent others from reserving
> > > the
> > > > PFN to other memory types.
> > > 
> > > I didn't worry too much on others reserving the same pfn range, as that
> > > should be the mmio region for this device, and this device should be owned
> > > by vfio driver.
> > 
> > and the earliest point doing memtype_reserve() is here:
> > 
> > vfio_pci_core_mmap()
> > 	vdev->barmap[index] = pci_iomap(pdev, index, 0);
> > 
> > > 
> > > However I share the same question, see:
> > > 
> > > https://lore.kernel.org/r/20240523223745.395337-2-peterx@redhat.com
> > > 
> > > So far I think it's not a major issue as VFIO always use UC- mem type, and
> > > that's also the default.  But I do also feel like there's something we can
> > > do better, and I'll keep you copied too if I'll resend the series.
> > > 
> > 
> > vfio-nvgrace uses WC. But it directly does remap_pfn_range() in its
> > nvgrace_gpu_mmap() so not suffering from the issue here.
> 
> People keep asking for WC on normal VFIO PCI as well, we shouldn't
> rule out, or at least provide a big warning comment what needs to be
> fixed to allow it.

Maybe we can have a comment indeed.  Or as long as that pat series can get
merged before adding WC support we should also be good, and that's also the
hope..

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] vfio/pci: Use unmap_mapping_range()
  2024-05-23 19:56 ` [PATCH 2/2] vfio/pci: Use unmap_mapping_range() Alex Williamson
  2024-05-24  0:39   ` Yan Zhao
@ 2024-05-24 13:42   ` Jason Gunthorpe
  2024-05-30  0:09   ` Tian, Kevin
  2 siblings, 0 replies; 21+ messages in thread
From: Jason Gunthorpe @ 2024-05-24 13:42 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, ajones, yan.y.zhao, kevin.tian, peterx

On Thu, May 23, 2024 at 01:56:27PM -0600, Alex Williamson wrote:
> With the vfio device fd tied to the address space of the pseudo fs
> inode, we can use the mm to track all vmas that might be mmap'ing
> device BARs, which removes our vma_list and all the complicated lock
> ordering necessary to manually zap each related vma.
> 
> Note that we can no longer store the pfn in vm_pgoff if we want to use
> unmap_mapping_range() to zap a selective portion of the device fd
> corresponding to BAR mappings.
> 
> This also converts our mmap fault handler to use vmf_insert_pfn()
> because we no longer have a vma_list to avoid the concurrency problem
> with io_remap_pfn_range().  The goal is to eventually use the vm_ops
> huge_fault handler to avoid the additional faulting overhead, but
> vmf_insert_pfn_{pmd,pud}() need to learn about pfnmaps first.
> 
> Also, Jason notes that a race exists between unmap_mapping_range() and
> the fops mmap callback if we were to call io_remap_pfn_range() to
> populate the vma on mmap.  Specifically, mmap_region() does call_mmap()
> before it does vma_link_file() which gives a window where the vma is
> populated but invisible to unmap_mapping_range().
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 256 +++++++------------------------
>  include/linux/vfio_pci_core.h    |   2 -
>  2 files changed, 56 insertions(+), 202 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH 2/2] vfio/pci: Use unmap_mapping_range()
  2024-05-23 19:56 ` [PATCH 2/2] vfio/pci: Use unmap_mapping_range() Alex Williamson
  2024-05-24  0:39   ` Yan Zhao
  2024-05-24 13:42   ` Jason Gunthorpe
@ 2024-05-30  0:09   ` Tian, Kevin
  2024-05-30  2:22     ` Alex Williamson
  2 siblings, 1 reply; 21+ messages in thread
From: Tian, Kevin @ 2024-05-30  0:09 UTC (permalink / raw)
  To: Alex Williamson, kvm@vger.kernel.org
  Cc: ajones@ventanamicro.com, Zhao, Yan Y, jgg@nvidia.com,
	peterx@redhat.com

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, May 24, 2024 3:56 AM
> 
> -/* Caller holds vma_lock */
> -static int __vfio_pci_add_vma(struct vfio_pci_core_device *vdev,
> -			      struct vm_area_struct *vma)
> +static int vma_to_pfn(struct vm_area_struct *vma, unsigned long *pfn)
>  {
> -	struct vfio_pci_mmap_vma *mmap_vma;
> -
> -	mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL_ACCOUNT);
> -	if (!mmap_vma)
> -		return -ENOMEM;
> -
> -	mmap_vma->vma = vma;
> -	list_add(&mmap_vma->vma_next, &vdev->vma_list);
> +	struct vfio_pci_core_device *vdev = vma->vm_private_data;
> +	int index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT -
> PAGE_SHIFT);
> +	u64 pgoff;
> 
> -	return 0;
> -}
> +	if (index >= VFIO_PCI_ROM_REGION_INDEX ||
> +	    !vdev->bar_mmap_supported[index] || !vdev->barmap[index])
> +		return -EINVAL;

Is a WARN_ON() required here? If those checks fail vfio_pci_core_mmap()
will return error w/o installing vm_ops.

> @@ -2506,17 +2373,11 @@ static int vfio_pci_dev_set_hot_reset(struct
> vfio_device_set *dev_set,
>  				      struct vfio_pci_group_info *groups,
>  				      struct iommufd_ctx *iommufd_ctx)

the comment before this function should be updated too:

/*
 * We need to get memory_lock for each device, but devices can share mmap_lock,
 * therefore we need to zap and hold the vma_lock for each device, and only then
 * get each memory_lock.
 */

otherwise looks good:

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] vfio/pci: Use unmap_mapping_range()
  2024-05-30  0:09   ` Tian, Kevin
@ 2024-05-30  2:22     ` Alex Williamson
  2024-05-30  2:47       ` Tian, Kevin
  0 siblings, 1 reply; 21+ messages in thread
From: Alex Williamson @ 2024-05-30  2:22 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: kvm@vger.kernel.org, ajones@ventanamicro.com, Zhao, Yan Y,
	jgg@nvidia.com, peterx@redhat.com

On Thu, 30 May 2024 00:09:49 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Friday, May 24, 2024 3:56 AM
> > 
> > -/* Caller holds vma_lock */
> > -static int __vfio_pci_add_vma(struct vfio_pci_core_device *vdev,
> > -			      struct vm_area_struct *vma)
> > +static int vma_to_pfn(struct vm_area_struct *vma, unsigned long *pfn)
> >  {
> > -	struct vfio_pci_mmap_vma *mmap_vma;
> > -
> > -	mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL_ACCOUNT);
> > -	if (!mmap_vma)
> > -		return -ENOMEM;
> > -
> > -	mmap_vma->vma = vma;
> > -	list_add(&mmap_vma->vma_next, &vdev->vma_list);
> > +	struct vfio_pci_core_device *vdev = vma->vm_private_data;
> > +	int index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT -
> > PAGE_SHIFT);
> > +	u64 pgoff;
> > 
> > -	return 0;
> > -}
> > +	if (index >= VFIO_PCI_ROM_REGION_INDEX ||
> > +	    !vdev->bar_mmap_supported[index] || !vdev->barmap[index])
> > +		return -EINVAL;  
> 
> Is a WARN_ON() required here? If those checks fail vfio_pci_core_mmap()
> will return error w/o installing vm_ops.

I think these tests largely come from previous iterations of the patch
where this function had more versatility, because yes, they do exactly
duplicate tests that we would have already passed before we established
this function in the vm_ops.fault path.

We could therefore wrap this in a WARN_ON, but actually with the
current usage it's really just a sanity test that vma->vm_pgoff hasn't
changed.  We don't change barmap or bar_mmap_supported while the device
is opened.  Is it all too much paranoia and we should remove the test
entirely and have this function return void?

> > @@ -2506,17 +2373,11 @@ static int vfio_pci_dev_set_hot_reset(struct
> > vfio_device_set *dev_set,
> >  				      struct vfio_pci_group_info *groups,
> >  				      struct iommufd_ctx *iommufd_ctx)  
> 
> the comment before this function should be updated too:
> 
> /*
>  * We need to get memory_lock for each device, but devices can share mmap_lock,
>  * therefore we need to zap and hold the vma_lock for each device, and only then
>  * get each memory_lock.
>  */

Good catch.  I think I'd just delete this comment altogether and expand
the existing comment in the loop body as:

                /*
		 * Take the memory write lock for each device and zap BAR
		 * mappings to prevent the user accessing the device while in
		 * reset.  Locking multiple devices is prone to deadlock,
		 * runaway and unwind if we hit contention.
                 */
                if (!down_write_trylock(&vdev->memory_lock)) {
                        ret = -EBUSY;
                        break;
                }

Sound good?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH 2/2] vfio/pci: Use unmap_mapping_range()
  2024-05-30  2:22     ` Alex Williamson
@ 2024-05-30  2:47       ` Tian, Kevin
  0 siblings, 0 replies; 21+ messages in thread
From: Tian, Kevin @ 2024-05-30  2:47 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm@vger.kernel.org, ajones@ventanamicro.com, Zhao, Yan Y,
	jgg@nvidia.com, peterx@redhat.com

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Thursday, May 30, 2024 10:22 AM
> 
> On Thu, 30 May 2024 00:09:49 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Friday, May 24, 2024 3:56 AM
> > >
> > > -/* Caller holds vma_lock */
> > > -static int __vfio_pci_add_vma(struct vfio_pci_core_device *vdev,
> > > -			      struct vm_area_struct *vma)
> > > +static int vma_to_pfn(struct vm_area_struct *vma, unsigned long *pfn)
> > >  {
> > > -	struct vfio_pci_mmap_vma *mmap_vma;
> > > -
> > > -	mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL_ACCOUNT);
> > > -	if (!mmap_vma)
> > > -		return -ENOMEM;
> > > -
> > > -	mmap_vma->vma = vma;
> > > -	list_add(&mmap_vma->vma_next, &vdev->vma_list);
> > > +	struct vfio_pci_core_device *vdev = vma->vm_private_data;
> > > +	int index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT -
> > > PAGE_SHIFT);
> > > +	u64 pgoff;
> > >
> > > -	return 0;
> > > -}
> > > +	if (index >= VFIO_PCI_ROM_REGION_INDEX ||
> > > +	    !vdev->bar_mmap_supported[index] || !vdev->barmap[index])
> > > +		return -EINVAL;
> >
> > Is a WARN_ON() required here? If those checks fail vfio_pci_core_mmap()
> > will return error w/o installing vm_ops.
> 
> I think these tests largely come from previous iterations of the patch
> where this function had more versatility, because yes, they do exactly
> duplicate tests that we would have already passed before we established
> this function in the vm_ops.fault path.
> 
> We could therefore wrap this in a WARN_ON, but actually with the
> current usage it's really just a sanity test that vma->vm_pgoff hasn't
> changed.  We don't change barmap or bar_mmap_supported while the
> device
> is opened.  Is it all too much paranoia and we should remove the test
> entirely and have this function return void?

yes this sounds reasonable.

> 
> > > @@ -2506,17 +2373,11 @@ static int vfio_pci_dev_set_hot_reset(struct
> > > vfio_device_set *dev_set,
> > >  				      struct vfio_pci_group_info *groups,
> > >  				      struct iommufd_ctx *iommufd_ctx)
> >
> > the comment before this function should be updated too:
> >
> > /*
> >  * We need to get memory_lock for each device, but devices can share
> mmap_lock,
> >  * therefore we need to zap and hold the vma_lock for each device, and
> only then
> >  * get each memory_lock.
> >  */
> 
> Good catch.  I think I'd just delete this comment altogether and expand
> the existing comment in the loop body as:
> 
>                 /*
> 		 * Take the memory write lock for each device and zap BAR
> 		 * mappings to prevent the user accessing the device while in
> 		 * reset.  Locking multiple devices is prone to deadlock,
> 		 * runaway and unwind if we hit contention.
>                  */
>                 if (!down_write_trylock(&vdev->memory_lock)) {
>                         ret = -EBUSY;
>                         break;
>                 }
> 
> Sound good?  Thanks,
> 

yes

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-05-30  7:47 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-23 19:56 [PATCH 0/2] vfio/pci: vfio device address space mapping Alex Williamson
2024-05-23 19:56 ` [PATCH 1/2] vfio: Create vfio_fs_type with inode per device Alex Williamson
2024-05-24 13:24   ` Jason Gunthorpe
2024-05-29 23:59   ` Tian, Kevin
2024-05-23 19:56 ` [PATCH 2/2] vfio/pci: Use unmap_mapping_range() Alex Williamson
2024-05-24  0:39   ` Yan Zhao
2024-05-24  0:49     ` Peter Xu
2024-05-24  1:47       ` Yan Zhao
2024-05-28 18:42         ` Alex Williamson
2024-05-29  2:29           ` Yan Zhao
2024-05-29  3:12             ` Alex Williamson
2024-05-29  6:34               ` Yan Zhao
2024-05-29 16:50                 ` Alex Williamson
2024-05-30  7:46                   ` Yan Zhao
2024-05-24  8:40       ` Tian, Kevin
2024-05-24 13:22         ` Jason Gunthorpe
2024-05-24 23:15           ` Peter Xu
2024-05-24 13:42   ` Jason Gunthorpe
2024-05-30  0:09   ` Tian, Kevin
2024-05-30  2:22     ` Alex Williamson
2024-05-30  2:47       ` Tian, Kevin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox