VFIO MSIX Query

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* VFIO MSIX Query
@ 2016-05-03 18:42 Saxena, Nitin
  0 siblings, 0 replies; 6+ messages in thread
From: Saxena, Nitin @ 2016-05-03 18:42 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org

Hi,

I am a newbie to VFIO framework and trying to use it for MSIX interrupt handling in my userspace application. My userspace application is like intel's dpdk.

My query is why vfio kernel code does not support msix masking/ unmasking i.e VFIO_SET_ACTION_TRIGGER is not implemented in kernel vfio code for MSIX but its present for legacy INTX?

Another question since MSIX interrupt masking is absent in kernel who is supposed to mask/unmask msix interrupt? does userspace needs to mask/unmask interrupt after getting notified by eventfd. Is there any usecase where MSIX handling is done via VFIO framework?

Thanks,
Nitin 

Regards,
Nitin

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: VFIO MSIX Query
       [not found] <CAB995Mu0==PJ0iU4h_3oAMGo-VYDFrjOg_BO757M=PDWgMWtCw@mail.gmail.com>
@ 2016-05-03 20:02 ` Alex Williamson
  2016-05-04  2:12   ` Saxena, Nitin
       [not found]   ` <CAB995MsFjv=GvHSmo029fmuTWhGiYev3-2A358orFz_dq724Fw@mail.gmail.com>
  0 siblings, 2 replies; 6+ messages in thread
From: Alex Williamson @ 2016-05-03 20:02 UTC (permalink / raw)
  To: Nitin Saxena
  Cc: linux-kernel, linux-arm-kernel@lists.infradead.org, Saxena, Nitin

Hi,

On Tue, 3 May 2016 23:56:38 +0530
Nitin Saxena <nitin.lnx@gmail.com> wrote:

> Hi,
> 
> I am a newbie to VFIO framework and trying to use it for MSIX interrupt
> handling in my userspace application. My userspace application is like
> intel's dpdk.
> 
> My query is why vfio kernel code does not support msix masking/ unmasking
> i.e VFIO_SET_ACTION_TRIGGER is not implemented in kernel vfio code for MSIX
> but its present for legacy INTX?

I think you mean VFIO_IRQ_SET_ACTION_MASK and
VFIO_IRQ_SET_ACTION_UNMASK are not implemented for MSI/X,
VFIO_IRQ_SET_ACTION_TRIGGER is the only one that is implemented.  It
mostly comes down to it hasn't been needed.  We would just need to
verify that disable_irq() and enable_irq() do the right thing and
enable the features through the existing VFIO API.  The reason we have
these for INTx is that INTx is level triggered and we must mask the
interrupt on the host while it's serviced by the user because the
interrupt will continue to assert and we have no service guarantees
from the user.  MSI/X is of course edge triggered, so doesn't have this
problem.  The device would need to continue to re-assert the
interrupt.  In cases like a QEMU user, we might want to continue to
receive guest masked interrupts to QEMU for emulating the pending bits
array, so even if we had VFIO masking support for these interrupts, QEMU
probably would not use it.  If MSI/X masking is useful for you vs simply
ignoring the eventfd, patches welcome.

> Another question since MSIX interrupt masking is absent in kernel who is
> supposed to mask/unmask msix interrupt? does userspace needs to mask/unmask
> interrupt after getting notified by eventfd. Is there any usecase where
> MSIX handling is done via VFIO framework?

MSI/X is edge triggered and VFIO does not set the AUTOMASKED flag on
them, so there is simply no masking of them at all aside from fully
disabling them.  QEMU has full support for MSI/X through VFIO that you
can use for reference.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: VFIO MSIX Query
  2016-05-03 20:02 ` VFIO MSIX Query Alex Williamson
@ 2016-05-04  2:12   ` Saxena, Nitin
       [not found]   ` <CAB995MsFjv=GvHSmo029fmuTWhGiYev3-2A358orFz_dq724Fw@mail.gmail.com>
  1 sibling, 0 replies; 6+ messages in thread
From: Saxena, Nitin @ 2016-05-04  2:12 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Nitin Saxena, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org

Got it. Thanks Alex. I may have future queries as well on this. Will get back to you.

Thanks,
Nitin

> On 04-May-2016, at 01:33, Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> Hi,
> 
> On Tue, 3 May 2016 23:56:38 +0530
> Nitin Saxena <nitin.lnx@gmail.com> wrote:
> 
>> Hi,
>> 
>> I am a newbie to VFIO framework and trying to use it for MSIX interrupt
>> handling in my userspace application. My userspace application is like
>> intel's dpdk.
>> 
>> My query is why vfio kernel code does not support msix masking/ unmasking
>> i.e VFIO_SET_ACTION_TRIGGER is not implemented in kernel vfio code for MSIX
>> but its present for legacy INTX?
> 
> I think you mean VFIO_IRQ_SET_ACTION_MASK and
> VFIO_IRQ_SET_ACTION_UNMASK are not implemented for MSI/X,
> VFIO_IRQ_SET_ACTION_TRIGGER is the only one that is implemented.  It
> mostly comes down to it hasn't been needed.  We would just need to
> verify that disable_irq() and enable_irq() do the right thing and
> enable the features through the existing VFIO API.  The reason we have
> these for INTx is that INTx is level triggered and we must mask the
> interrupt on the host while it's serviced by the user because the
> interrupt will continue to assert and we have no service guarantees
> from the user.  MSI/X is of course edge triggered, so doesn't have this
> problem.  The device would need to continue to re-assert the
> interrupt.  In cases like a QEMU user, we might want to continue to
> receive guest masked interrupts to QEMU for emulating the pending bits
> array, so even if we had VFIO masking support for these interrupts, QEMU
> probably would not use it.  If MSI/X masking is useful for you vs simply
> ignoring the eventfd, patches welcome.
> 
>> Another question since MSIX interrupt masking is absent in kernel who is
>> supposed to mask/unmask msix interrupt? does userspace needs to mask/unmask
>> interrupt after getting notified by eventfd. Is there any usecase where
>> MSIX handling is done via VFIO framework?
> 
> MSI/X is edge triggered and VFIO does not set the AUTOMASKED flag on
> them, so there is simply no masking of them at all aside from fully
> disabling them.  QEMU has full support for MSI/X through VFIO that you
> can use for reference.  Thanks,
> 
> Alex

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Fwd: VFIO MSIX Query
       [not found]   ` <CAB995MsFjv=GvHSmo029fmuTWhGiYev3-2A358orFz_dq724Fw@mail.gmail.com>
@ 2016-05-07 15:50     ` Nitin Saxena
  2016-05-08  1:50     ` Alex Williamson
  1 sibling, 0 replies; 6+ messages in thread
From: Nitin Saxena @ 2016-05-07 15:50 UTC (permalink / raw)
  To: Alex Williamson, linux-arm-kernel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4447 bytes --]

Hi,

I have specific questions related to the problem I am facing in using
VFIO for MSIX interrupt.

My userspace program running on x86 talks to NIC connected via PCIe
interface. Current program works fine in poll mode with no VF and
SRIOV requirement. I am changing my app from poll mode to interrupt
mode using VFIO NOIOMMU feature as I am interested in getting MSIX
interrupts only (For this I have ported VFIO-NOIOMMU feature patch
from kernel 4.5 to 3.10 SL7).  After applying the patch I can see
/dev/vfio/noiommu-X and I could assign my PCI device to
/dev/vfio/noiommu-0. Seems like kernel patch applied from 4.5 to 3.10
working fine. Also MSIX interrupts works fine with another kernel
application but I am facing issues in using MSIX interrupts with my
userspace program integrated with VFIO. I am running single process
only  with one MSIX interrupt requirement

Few Questions:

Q1) VFIO_GROUP_GET_DEVICE_FD ioctl call stops poll mode data traffic. Why?
I used my poll mode data path loop (where I am not waiting for MSIX)
with my vfio_init() API I found that if I call follow VFIO sequence in
vfio_init():
a) create container fd (b) get group_fd (c) VFIO_SET_IOMMU as NOIOMMU
==> then my poll mode data path works fine
but if I call a) create container fd (b) get group_fd (c)
VFIO_SET_IOMMU as NOIOMMU (d) VFIO_GROUP_GET_DEVICE_FD  ==> then my
poll mode data path does not work. I don't know why


Q2)Does MSIX bar/table mapping required for my case. Why?
Even if VFIO is not used in my userspace program, BAR0 and BAR1 was
already mmapped to process but not MSIX table. Do I need to mmap MSIX
table in my interrupt based data path as I saw this mmaping of MSIX
table in QEMU and dpdk both but I think these software must be using
VFIO_IOMMU_TYPE1 and not VFIO_NOIOMMU which is my case. So can you
confirm is it needed for me to mmap MSIX table.

Q3) Does VFIO_DEVICE_RESET required, when and why ?
My userspace program implicitly resets PCI hardware and then I call my
vfio_init() API and then start data path. So at what point I should
can VFIO_DEVICE_RESET?

Q4) Is there any Do's and Dont's document for VFIO so that I can
follow proper sequence.


Attaching kernel patch that I created for VFIO-NOIOMMU support from
kernel 4.5 to 3.10

Thanks in advance,
Nitin


On Wed, May 4, 2016 at 1:32 AM, Alex Williamson
<alex.williamson@redhat.com> wrote:
>
> Hi,
>
> On Tue, 3 May 2016 23:56:38 +0530
> Nitin Saxena <nitin.lnx@gmail.com> wrote:
>
> > Hi,
> >
> > I am a newbie to VFIO framework and trying to use it for MSIX interrupt
> > handling in my userspace application. My userspace application is like
> > intel's dpdk.
> >
> > My query is why vfio kernel code does not support msix masking/ unmasking
> > i.e VFIO_SET_ACTION_TRIGGER is not implemented in kernel vfio code for MSIX
> > but its present for legacy INTX?
>
> I think you mean VFIO_IRQ_SET_ACTION_MASK and
> VFIO_IRQ_SET_ACTION_UNMASK are not implemented for MSI/X,
> VFIO_IRQ_SET_ACTION_TRIGGER is the only one that is implemented.  It
> mostly comes down to it hasn't been needed.  We would just need to
> verify that disable_irq() and enable_irq() do the right thing and
> enable the features through the existing VFIO API.  The reason we have
> these for INTx is that INTx is level triggered and we must mask the
> interrupt on the host while it's serviced by the user because the
> interrupt will continue to assert and we have no service guarantees
> from the user.  MSI/X is of course edge triggered, so doesn't have this
> problem.  The device would need to continue to re-assert the
> interrupt.  In cases like a QEMU user, we might want to continue to
> receive guest masked interrupts to QEMU for emulating the pending bits
> array, so even if we had VFIO masking support for these interrupts, QEMU
> probably would not use it.  If MSI/X masking is useful for you vs simply
> ignoring the eventfd, patches welcome.
>
> > Another question since MSIX interrupt masking is absent in kernel who is
> > supposed to mask/unmask msix interrupt? does userspace needs to mask/unmask
> > interrupt after getting notified by eventfd. Is there any usecase where
> > MSIX handling is done via VFIO framework?
>
> MSI/X is edge triggered and VFIO does not set the AUTOMASKED flag on
> them, so there is simply no masking of them at all aside from fully
> disabling them.  QEMU has full support for MSI/X through VFIO that you
> can use for reference.  Thanks,
>
> Alex

[-- Attachment #2: vfio_noiommu_changes-3.10-123.0-SL7.0.patch --]
[-- Type: text/x-patch, Size: 14785 bytes --]

diff -purN linux-3.10.0-123.el7/drivers/vfio/Kconfig linux-3.10.0-123.el7_vfio_changes/drivers/vfio/Kconfig
--- linux-3.10.0-123.el7/drivers/vfio/Kconfig	2014-05-05 20:28:32.000000000 +0530
+++ linux-3.10.0-123.el7_vfio_changes/drivers/vfio/Kconfig	2016-04-25 18:27:42.402962747 +0530
@@ -13,4 +13,20 @@ menuconfig VFIO
 
 	  If you don't know what to do here, say N.
 
+menuconfig VFIO_NOIOMMU
+   bool "VFIO No-IOMMU support"
+   depends on VFIO
+   help
+     VFIO is built on the ability to isolate devices using the IOMMU.
+     Only with an IOMMU can userspace access to DMA capable devices be
+     considered secure.  VFIO No-IOMMU mode enables IOMMU groups for
+     devices without IOMMU backing for the purpose of re-using the VFIO
+     infrastructure in a non-secure mode.  Use of this mode will result
+     in an unsupportable kernel and will therefore taint the kernel.
+     Device assignment to virtual machines is also not possible with
+     this mode since there is no IOMMU to provide DMA translation.
+
+     If you don't know what to do here, say N.
+
+
 source "drivers/vfio/pci/Kconfig"
diff -purN linux-3.10.0-123.el7/drivers/vfio/pci/vfio_pci.c linux-3.10.0-123.el7_vfio_changes/drivers/vfio/pci/vfio_pci.c
--- linux-3.10.0-123.el7/drivers/vfio/pci/vfio_pci.c	2014-05-05 20:28:32.000000000 +0530
+++ linux-3.10.0-123.el7_vfio_changes/drivers/vfio/pci/vfio_pci.c	2016-04-25 19:09:57.231300944 +0530
@@ -820,13 +820,13 @@ static int vfio_pci_probe(struct pci_dev
 	if ((type & PCI_HEADER_TYPE) != PCI_HEADER_TYPE_NORMAL)
 		return -EINVAL;
 
-	group = iommu_group_get(&pdev->dev);
+	group = vfio_iommu_group_get(&pdev->dev);
 	if (!group)
 		return -EINVAL;
 
 	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
 	if (!vdev) {
-		iommu_group_put(group);
+		vfio_iommu_group_put(group, &pdev->dev);
 		return -ENOMEM;
 	}
 
@@ -838,7 +838,7 @@ static int vfio_pci_probe(struct pci_dev
 
 	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
 	if (ret) {
-		iommu_group_put(group);
+		vfio_iommu_group_put(group, &pdev->dev);
 		kfree(vdev);
 	}
 
@@ -853,7 +853,7 @@ static void vfio_pci_remove(struct pci_d
 	if (!vdev)
 		return;
 
-	iommu_group_put(pdev->dev.iommu_group);
+	vfio_iommu_group_put(pdev->dev.iommu_group, &pdev->dev);
 	kfree(vdev);
 }
 
diff -purN linux-3.10.0-123.el7/drivers/vfio/vfio.c linux-3.10.0-123.el7_vfio_changes/drivers/vfio/vfio.c
--- linux-3.10.0-123.el7/drivers/vfio/vfio.c	2014-05-05 20:28:32.000000000 +0530
+++ linux-3.10.0-123.el7_vfio_changes/drivers/vfio/vfio.c	2016-04-25 19:04:41.209365878 +0530
@@ -61,6 +61,7 @@ struct vfio_container {
 	struct rw_semaphore		group_lock;
 	struct vfio_iommu_driver	*iommu_driver;
 	void				*iommu_data;
+    bool                noiommu;
 };
 
 struct vfio_group {
@@ -76,6 +77,7 @@ struct vfio_group {
 	struct list_head		vfio_next;
 	struct list_head		container_next;
 	atomic_t			opened;
+    bool                noiommu;
 };
 
 struct vfio_device {
@@ -87,6 +89,130 @@ struct vfio_device {
 	void				*device_data;
 };
 
+#ifdef CONFIG_VFIO_NOIOMMU
+static bool noiommu __read_mostly;
+module_param_named(enable_unsafe_noiommu_mode,
+                 noiommu, bool, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(enable_unsafe_noiommu_mode, "Enable UNSAFE, no-IOMMU mode.  This mode provides no device isolation, no DMA translation, no host kernel protection, cannot be used for device assignment to virtual machines, requires RAWIO permissions, and will taint the kernel.  If you do not know what this is for, step away. (default: false)");
+#endif
+
+/*
+    * vfio_iommu_group_{get,put} are only intended for VFIO bus driver probe
+    * and remove functions, any use cases other than acquiring the first
+    * reference for the purpose of calling vfio_add_group_dev() or removing
+    * that symmetric reference after vfio_del_group_dev() should use the raw
+    * iommu_group_{get,put} functions.  In particular, vfio_iommu_group_put()
+    * removes the device from the dummy group and cannot be nested.
+    */
+struct iommu_group *vfio_iommu_group_get(struct device *dev)
+{
+          struct iommu_group *group;
+          int __maybe_unused ret;
+       
+          group = iommu_group_get(dev);
+       
+           #ifdef CONFIG_VFIO_NOIOMMU
+              /*
+                       * With noiommu enabled, an IOMMU group will be created for a device
+                       * that doesn't already have one and doesn't have an iommu_ops on their
+                       * bus.  We use iommu_present() again in the main code to detect these
+                       * fake groups.
+                       */
+             if (group || !noiommu || iommu_present(dev->bus))
+                  return group;
+       
+              group = iommu_group_alloc();
+           if (IS_ERR(group))
+                  return NULL;
+       
+              iommu_group_set_name(group, "vfio-noiommu");
+          ret = iommu_group_add_device(group, dev);
+          iommu_group_put(group);
+          if (ret)
+                  return NULL;
+       
+              /*
+                       * Where to taint?  At this point we've added an IOMMU group for a
+                       * device that is not backed by iommu_ops, therefore any iommu_
+                       * callback using iommu_ops can legitimately Oops.  So, while we may
+                       * be about to give a DMA capable device to a user without IOMMU
+                       * protection, which is clearly taint-worthy, let's go ahead and do
+                       * it here.
+                       */
+              add_taint(TAINT_USER, LOCKDEP_STILL_OK);
+          dev_warn(dev, "Adding kernel taint for vfio-noiommu group on device\n");
+       #endif
+           
+              return group;
+       }
+       EXPORT_SYMBOL_GPL(vfio_iommu_group_get);
+   
+void vfio_iommu_group_put(struct iommu_group *group, struct device *dev)
+   {
+       #ifdef CONFIG_VFIO_NOIOMMU
+              if (!iommu_present(dev->bus))
+                  iommu_group_remove_device(dev);
+       #endif
+           
+              iommu_group_put(group);
+       }
+       EXPORT_SYMBOL_GPL(vfio_iommu_group_put);
+       
+   #ifdef CONFIG_VFIO_NOIOMMU
+static void *vfio_noiommu_open(unsigned long arg)
+   {
+          if (arg != VFIO_NOIOMMU_IOMMU)
+                  return ERR_PTR(-EINVAL);
+          if (!capable(CAP_SYS_RAWIO))
+                  return ERR_PTR(-EPERM);
+       
+              return NULL;
+   }
+   
+static void vfio_noiommu_release(void *iommu_data)
+   {
+   }
+   
+static long vfio_noiommu_ioctl(void *iommu_data,
+                         unsigned int cmd, unsigned long arg)
+   {
+       if (cmd == VFIO_CHECK_EXTENSION)
+                  return noiommu && (arg == VFIO_NOIOMMU_IOMMU) ? 1 : 0;
+       
+              return -ENOTTY;
+   }
+   
+static int vfio_iommu_present(struct device *dev, void *unused)
+  {
+          return iommu_present(dev->bus) ? 1 : 0;
+  }
+   
+static int vfio_noiommu_attach_group(void *iommu_data,
+                           struct iommu_group *iommu_group)
+   {
+          return iommu_group_for_each_dev(iommu_group, NULL,
+                                  vfio_iommu_present) ? -EINVAL : 0;
+   }
+   
+static void vfio_noiommu_detach_group(void *iommu_data,
+                            struct iommu_group *iommu_group)
+   {
+   }
+   
+  static const struct vfio_iommu_driver_ops vfio_noiommu_ops = {
+              .name = "vfio-noiommu",
+              .owner = THIS_MODULE,
+              .open = vfio_noiommu_open,
+              .release = vfio_noiommu_release,
+              .ioctl = vfio_noiommu_ioctl,
+              .attach_group = vfio_noiommu_attach_group,
+              .detach_group = vfio_noiommu_detach_group,
+           };
+#endif
+
+
+
+
 /**
  * IOMMU driver registration
  */
@@ -191,7 +317,8 @@ static void vfio_group_unlock_and_free(s
 /**
  * Group objects - create, release, get, put, search
  */
-static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group)
+static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group,
+                                            bool iommu_present)
 {
 	struct vfio_group *group, *tmp;
 	struct device *dev;
@@ -207,6 +334,8 @@ static struct vfio_group *vfio_create_gr
 	atomic_set(&group->container_users, 0);
 	atomic_set(&group->opened, 0);
 	group->iommu_group = iommu_group;
+    group->noiommu = !iommu_present;
+
 
 	group->nb.notifier_call = vfio_iommu_group_notifier;
 
@@ -243,7 +372,8 @@ static struct vfio_group *vfio_create_gr
 
 	dev = device_create(vfio.class, NULL,
 			    MKDEV(MAJOR(vfio.group_devt), minor),
-			    group, "%d", iommu_group_id(iommu_group));
+			    group, "%s%d", group->noiommu ? "no-iommu-" : "",
+                iommu_group_id(iommu_group));
 	if (IS_ERR(dev)) {
 		vfio_free_group_minor(minor);
 		vfio_group_unlock_and_free(group);
@@ -579,7 +709,7 @@ int vfio_add_group_dev(struct device *de
 
 	group = vfio_group_get_from_iommu(iommu_group);
 	if (!group) {
-		group = vfio_create_group(iommu_group);
+		group = vfio_create_group(iommu_group, iommu_present(dev->bus));
 		if (IS_ERR(group)) {
 			iommu_group_put(iommu_group);
 			return PTR_ERR(group);
@@ -709,6 +839,12 @@ static long vfio_ioctl_check_extension(s
 			mutex_lock(&vfio.iommu_drivers_lock);
 			list_for_each_entry(driver, &vfio.iommu_drivers_list,
 					    vfio_next) {
+#ifdef CONFIG_VFIO_NOIOMMU
+               if (!list_empty(&container->group_list) &&
+                       (container->noiommu !=
+                       (driver->ops == &vfio_noiommu_ops)))
+                           continue;
+#endif
 				if (!try_module_get(driver->ops->owner))
 					continue;
 
@@ -779,6 +915,14 @@ static long vfio_ioctl_set_iommu(struct
 	mutex_lock(&vfio.iommu_drivers_lock);
 	list_for_each_entry(driver, &vfio.iommu_drivers_list, vfio_next) {
 		void *data;
+#ifdef CONFIG_VFIO_NOIOMMU
+       /*
+        * Only noiommu containers can use vfio-noiommu and noiommu
+        * containers can only use vfio-noiommu.
+        */
+       if (container->noiommu != (driver->ops == &vfio_noiommu_ops))
+           continue;
+#endif
 
 		if (!try_module_get(driver->ops->owner))
 			continue;
@@ -1042,6 +1186,9 @@ static int vfio_group_set_container(stru
 	if (atomic_read(&group->container_users))
 		return -EINVAL;
 
+    if (group->noiommu && !capable(CAP_SYS_RAWIO))
+               return -EPERM;
+    
 	f = fdget(container_fd);
 	if (!f.file)
 		return -EBADF;
@@ -1057,6 +1204,13 @@ static int vfio_group_set_container(stru
 
 	down_write(&container->group_lock);
 
+    /* Real groups and fake groups cannot mix */
+    if (!list_empty(&container->group_list) &&
+            container->noiommu != group->noiommu) {
+                ret = -EPERM;
+                goto unlock_out;
+    }
+
 	driver = container->iommu_driver;
 	if (driver) {
 		ret = driver->ops->attach_group(container->iommu_data,
@@ -1066,6 +1220,7 @@ static int vfio_group_set_container(stru
 	}
 
 	group->container = container;
+    container->noiommu = group->noiommu;
 	list_add(&group->container_next, &container->group_list);
 
 	/* Get a reference on the container and mark a user within the group */
@@ -1096,7 +1251,10 @@ static int vfio_group_get_device_fd(stru
 	    !group->container->iommu_driver || !vfio_group_viable(group))
 		return -EINVAL;
 
-	mutex_lock(&group->device_lock);
+	if (group->noiommu && !capable(CAP_SYS_RAWIO))
+            return -EPERM;
+    
+    mutex_lock(&group->device_lock);
 	list_for_each_entry(device, &group->device_list, group_next) {
 		if (strcmp(dev_name(device->dev), buf))
 			continue;
@@ -1134,6 +1292,11 @@ static int vfio_group_get_device_fd(stru
 		atomic_inc(&group->container_users);
 
 		fd_install(ret, filep);
+        if (group->noiommu){
+              dev_warn(device->dev, "vfio-noiommu device opened by user "
+                 "(%s:%d)\n", current->comm, task_pid_nr(current));
+        }
+
 		break;
 	}
 	mutex_unlock(&group->device_lock);
@@ -1226,6 +1389,11 @@ static int vfio_group_fops_open(struct i
 	if (!group)
 		return -ENODEV;
 
+    if (group->noiommu && !capable(CAP_SYS_RAWIO)) {
+           vfio_group_put(group);
+           return -EPERM;
+    }
+
 	/* Do we need multiple instances of the group open?  Seems not. */
 	opened = atomic_cmpxchg(&group->opened, 0, 1);
 	if (opened) {
@@ -1388,6 +1556,11 @@ struct vfio_group *vfio_group_get_extern
 	if (!atomic_inc_not_zero(&group->container_users))
 		return ERR_PTR(-EINVAL);
 
+    if (group->noiommu) {
+        atomic_dec(&group->container_users);
+        return ERR_PTR(-EPERM);
+    }
+
 	if (!group->container->iommu_driver ||
 			!vfio_group_viable(group)) {
 		atomic_dec(&group->container_users);
@@ -1472,6 +1645,9 @@ static int __init vfio_init(void)
 	 * drivers.
 	 */
 	request_module_nowait("vfio_iommu_type1");
+#ifdef CONFIG_VFIO_NOIOMMU
+    vfio_register_iommu_driver(&vfio_noiommu_ops);
+#endif
 
 	return 0;
 
@@ -1489,6 +1665,9 @@ static void __exit vfio_cleanup(void)
 {
 	WARN_ON(!list_empty(&vfio.group_list));
 
+#ifdef CONFIG_VFIO_NOIOMMU
+    vfio_unregister_iommu_driver(&vfio_noiommu_ops);
+#endif
 	idr_destroy(&vfio.group_idr);
 	cdev_del(&vfio.group_cdev);
 	unregister_chrdev_region(vfio.group_devt, MINORMASK);
diff -purN linux-3.10.0-123.el7/include/linux/vfio.h linux-3.10.0-123.el7_vfio_changes/include/linux/vfio.h
--- linux-3.10.0-123.el7/include/linux/vfio.h	2014-05-05 20:28:32.000000000 +0530
+++ linux-3.10.0-123.el7_vfio_changes/include/linux/vfio.h	2016-04-25 19:05:52.733123276 +0530
@@ -40,6 +40,8 @@ struct vfio_device_ops {
 	int	(*mmap)(void *device_data, struct vm_area_struct *vma);
 };
 
+extern struct iommu_group *vfio_iommu_group_get(struct device *dev);
+extern void vfio_iommu_group_put(struct iommu_group *group, struct device *dev);
 extern int vfio_add_group_dev(struct device *dev,
 			      const struct vfio_device_ops *ops,
 			      void *device_data);
diff -purN linux-3.10.0-123.el7/include/uapi/linux/vfio.h linux-3.10.0-123.el7_vfio_changes/include/uapi/linux/vfio.h
--- linux-3.10.0-123.el7/include/uapi/linux/vfio.h	2014-05-05 20:28:32.000000000 +0530
+++ linux-3.10.0-123.el7_vfio_changes/include/uapi/linux/vfio.h	2016-04-25 19:06:59.765896771 +0530
@@ -22,7 +22,12 @@
 /* Extensions */
 
 #define VFIO_TYPE1_IOMMU		1
-
+/*
+ * * The No-IOMMU IOMMU offers no translation or isolation for devices and
+ * * supports no ioctls outside of VFIO_CHECK_EXTENSION.  Use of VFIO's  No-IOMMU
+ * * code will taint the host kernel and should be used with extreme caution.
+ * */
+#define VFIO_NOIOMMU_IOMMU     8
 /*
  * The IOCTL interface is designed for extensibility by embedding the
  * structure length (argsz) and flags into structures passed between

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: VFIO MSIX Query
       [not found]   ` <CAB995MsFjv=GvHSmo029fmuTWhGiYev3-2A358orFz_dq724Fw@mail.gmail.com>
  2016-05-07 15:50     ` Fwd: " Nitin Saxena
@ 2016-05-08  1:50     ` Alex Williamson
       [not found]       ` <CAB995MvxU+mN9uxr_20nLkoezyhedUAHdFWawk-GG_bj0YjSYw@mail.gmail.com>
  1 sibling, 1 reply; 6+ messages in thread
From: Alex Williamson @ 2016-05-08  1:50 UTC (permalink / raw)
  To: Nitin Saxena; +Cc: linux-kernel, linux-arm-kernel@lists.infradead.org

Hi Nitin,

On Sat, 7 May 2016 21:09:09 +0530
Nitin Saxena <nitin.lnx@gmail.com> wrote:

> Hi Alex,
> 
> I have specific questions related to the problem I am facing in using VFIO
> for MSIX interrupt.
> 
> My userspace program running on x86 talks to NIC connected via PCIe
> interface. Current program works fine in poll mode with no VF and SRIOV
> requirement. I am changing my app from poll mode to interrupt mode using
> VFIO NOIOMMU feature as I am interested in getting MSIX interrupts only
> (For this I have ported VFIO-NOIOMMU feature patch from kernel 4.5 to 3.10
> SL7).  After applying the patch I can see /dev/vfio/noiommu-X and I could
> assign my PCI device to /dev/vfio/noiommu-0. Seems like kernel patch
> applied from 4.5 to 3.10 working fine. Also MSIX interrupts works fine with
> another kernel application but I am facing issues in using MSIX interrupts
> with my userspace program integrated with VFIO. I am running single process
> only  with one MSIX interrupt requirement
> 
> Few Questions:
> 
> Q1) VFIO_GROUP_GET_DEVICE_FD ioctl call stops poll mode data traffic. Why?
> I used my poll mode data path loop (where I am not waiting for MSIX) with
> my vfio_init() API I found that if I call follow VFIO sequence in
> vfio_init():
> a) create container fd (b) get group_fd (c) VFIO_SET_IOMMU as NOIOMMU ==>
> then my poll mode data path works fine
> but if I call a) create container fd (b) get group_fd (c) VFIO_SET_IOMMU as
> NOIOMMU (d) VFIO_GROUP_GET_DEVICE_FD  ==> then my poll mode data path does
> not work. I don't know why

It sounds like you're trying to mix VFIO with some other access to the
device, perhaps through pci-sysfs.  This is not compatible with VFIO.
Use VFIO for all access to the device or none.  When the device is
initially opened via the GET_DEVICE_FD ioctl, we go through an
initialization sequence on the device, including a device reset.  VFIO
is not meant to only provide MSI-X on top of a some other access
mechanism.

> Q2)Does MSIX bar/table mapping required for my case. Why?
> Even if VFIO is not used in my userspace program, BAR0 and BAR1 was already
> mmapped to process but not MSIX table. Do I need to mmap MSIX table in my
> interrupt based data path as I saw this mmaping of MSIX table in QEMU and
> dpdk both but I think these software must be using VFIO_IOMMU_TYPE1 and not
> VFIO_NOIOMMU which is my case. So can you confirm is it needed for me to
> mmap MSIX table.

Again, no mixing interfaces.  Use VFIO for all access to the device or
don't use VFIO at all.  Regardless, there's no need to mmap the MSI-X
table.  VFIO does not currently allow mmap of the MSI-X table.  If at
some point it is allowed, attempting to manipulate MSI-X via the device
vector table directly is not supported.  Only the SET_IRQS ioctl is
used for configuring device interrupts.

> Q3) Does VFIO_DEVICE_RESET required, when and why ?
> My userspace program implicitly resets PCI hardware and then I call my
> vfio_init() API and then start data path. So at what point I should can
> VFIO_DEVICE_RESET?

This is wrong, VFIO is the access to the device, any combination of
other access mechanisms plus VFIO is not supported and not intended to
work.  VFIO will attempt to reset the device when opened, any other
reset is at the user driver's discretion.

> Q4) Is there any Do's and Dont's document for VFIO so that I can follow
> proper sequence.

Don't #1: Don't attempt to add VFIO only for MSI-X support, use it for
all access to the device or don't use it at all.  There's
Documentation/vfio.txt and QEMU as the de facto standard usage example,
that's it afaik.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: VFIO MSIX Query
       [not found]       ` <CAB995MvxU+mN9uxr_20nLkoezyhedUAHdFWawk-GG_bj0YjSYw@mail.gmail.com>
@ 2016-05-09 15:20         ` Alex Williamson
  0 siblings, 0 replies; 6+ messages in thread
From: Alex Williamson @ 2016-05-09 15:20 UTC (permalink / raw)
  To: Nitin Saxena
  Cc: linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org

On Sun, 8 May 2016 20:45:10 +0530
Nitin Saxena <nitin.lnx@gmail.com> wrote:

> Hi Alex,
> 
> Thanks for your answer. I got the point that all access to device must be
> VFIO.
> I have few more questions.
> 
> >> When the device is initially opened via the VFIO_GET_DEVICE_FD ioctl, we  
> go through an initialization sequence on the device, including a device
> reset
> Does this mean only one process (primary in multiprocess app) can only
> perform VFIO_GROUP_GET_DEVICE_FD ioctl ? which I doubt as I saw in dpdk
> that each process (regardless primary or not) does VFIO_GROUP_GET_DEVICE_FD
> ioctl. Kindly confirm.

Users can call GET_DEVICE_FD multiple times, but note that the group
file descriptor can only be opened once, so you'd need to pass
device file descriptors off to other threads.

> >> Use VFIO for all access to the device or don't use VFIO at all.  
> I will map PCI BARS via VFIO but what about hugepages for each process.
> Since I am using VFIO_NOIOMMU do I need to call VFIO_IOMMU_MAP_DMA ioctl
> for each hugepage of each process? I think this is not needed in my case.
> Please confirm.

No-IOMMU mode does not have MAP_DMA or UNMAP_DMA ioctls.  If you want
to support a safe operating mode, with an IOMMU providing DMA
isolation, then yes, you would need to map DMA targets for the device
using those ioctls for the type1 IOMMU backend.

> Can you please also confirm about the patch I attached. Is there any thing
> missing in the patch that I needed to incorporate from kernel 4.5 to make
> VFO NOIOMMU work? Patch

I can't confirm or deny whether the backport is accurate.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-05-09 15:20 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAB995Mu0==PJ0iU4h_3oAMGo-VYDFrjOg_BO757M=PDWgMWtCw@mail.gmail.com>
2016-05-03 20:02 ` VFIO MSIX Query Alex Williamson
2016-05-04  2:12   ` Saxena, Nitin
     [not found]   ` <CAB995MsFjv=GvHSmo029fmuTWhGiYev3-2A358orFz_dq724Fw@mail.gmail.com>
2016-05-07 15:50     ` Fwd: " Nitin Saxena
2016-05-08  1:50     ` Alex Williamson
     [not found]       ` <CAB995MvxU+mN9uxr_20nLkoezyhedUAHdFWawk-GG_bj0YjSYw@mail.gmail.com>
2016-05-09 15:20         ` Alex Williamson
2016-05-03 18:42 Saxena, Nitin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).