RFC: Device isolation infrastructure

All of lore.kernel.org
 help / color / mirror / Atom feed

* RFC: Device isolation infrastructure
@ 2011-12-07  3:58 David Gibson
  2011-12-07  6:22 ` Benjamin Herrenschmidt
  2011-12-07  6:46 ` Alex Williamson
  0 siblings, 2 replies; 16+ messages in thread
From: David Gibson @ 2011-12-07  3:58 UTC (permalink / raw)
  To: Alex Williamson, joerg.roedel, dwmw2, iommu, aik
  Cc: linux-kernel, chrisw, agraf, scottwood, B08248, benh


Alex Williamson recently posted some patches adding a new hook to
iommu_ops to identify groups of devices which cannot reliably be
distinguished by the iommu, and therefore can only safely be assigned
as a unit to a guest or userspace driver.

I'm not very fond of that design, because:
    - It requires something else to scan the group ids to present the
groups in a userspace discoverable manner, rather than handling that
in the core.
    - It encorages manipulating devices individually, but only
permitting certain actions if the whole group is together.  This seems
confusing and arse-backwards to me.
    - It does not provide an obvious means to prevent a kernel
driver unsafely binding to a new device which is hotplugged into an
iommu group already assigned to a guest.
    - It is inherently centred arounf the iommu structure, however
there might be other platform dependent factors which also affect
which devices can be safely isolated from others.


These could probably each be addressed with various tweaks, but
basically, I'm just not convinced that it represents the right
structure for handling this.  So, Alexey Kardashevskiy and myself havr
made this patches, representing the outline of how I think this should
be handled.


Notes:
    * All names are negotiable.  At the moment I'm thinking of this as
the "device isolation subsystem" which handles "device isolation
groups" (not to be confused with an iommu domain, which could contain
one or more groups).  A group is considered "detached" when it has
been removed from the normal driver binding process, and therefore
ready for assignment to a guest or userspace.  I'm more than open to
suggestions of better names.
    * This is just the generic infrastructure.  Obviously, platform
specific code will be needed as well to create the groups and assign
devices to them.  We have a proof of concept implementation for the
power p5ioc2 pci host bridge, and I hope to post code for that and the
p7ioc bridge shortly.  I'll also be looking at how this would be used
by Intel VTd, and AMD iommus.
    * isolation groups appear in /sys/isolation.  Isolatable devices
have a link to their group.  The group contains links to all their
devices, and a "detached" attribute, which can be used to view and
control whether the group is detached.
    * At present groups are detached manually via sysfs, but the idea
is a kernel system like vfio, could also directly request a group to
be detached for its use.
    * when detached, all kernel drivers are unbound from every device
in the group.  No driver will match any device in the group until it
is reattached.

    * There are probably stupid bugs, in particular I know some error
paths are broken.  By all means point these out to expedite fixing
them.
    * The isolation_group structure is more or less a stub at this
point.  It will probably want a pointer to an iommu_ops and/or its own
ops structure.  The 'detached' bool may need to become some sort of
state variable.  We'll probably want some sort of "owner" field for
what (e.g. vfio) is using the group when its detached.
    * I've also considered adding an "isolation strength" bitmask to
indicate how well isolated the group is - e.g DMA isolated (iommu),
irq isolated (irq remapping iommu, or other mechanism), error isolated
(bus errors from this device can't affect other devices), ..

Anyway, on with the code..

From: David Gibson <david@gibson.dropbear.id.au>
To: Alex Williamson <alex.williamson@redhat.com>,
	joerg.roedel@amd.com, dwmw2@infradead.org,
	iommu@lists.linux-foundation.org,
	aik@au1.ibm.com
Cc: linux-kernel@vger.kernel.org, chrisw@redhat.com,
    agraf@suse.de, scottwood@freescale.com, B08248@freescale.com,
    benh@kernel.crashing.org
Bcc: 
Subject: RFC: Device isolation infrastructure
Reply-To: 


Alex Williamson recently posted some patches adding a new hook to
iommu_ops to identify groups of devices which cannot reliably be
distinguished by the iommu, and therefore can only safely be assigned
as a unit to a guest or userspace driver.

I'm not very fond of that design, because:
    - It requires something else to scan the group ids to present the
groups in a userspace discoverable manner, rather than handling that
in the core.
    - It encorages manipulating devices individually, but only
permitting certain actions if the whole group is together.  This seems
confusing and arse-backwards to me.
    - It does not provide an obvious means to prevent a kernel
driver unsafely binding to a new device which is hotplugged into an
iommu group already assigned to a guest.
    - It is inherently centred arounf the iommu structure, however
there might be other platform dependent factors which also affect
which devices can be safely isolated from others.


These could probably each be addressed with various tweaks, but
basically, I'm just not convinced that it represents the right
structure for handling this.  So, Alexey Kardashevskiy and myself havr
made this patches, representing the outline of how I think this should
be handled.


Notes:
    * All names are negotiable.  At the moment I'm thinking of this as
the "device isolation subsystem" which handles "device isolation
groups" (not to be confused with an iommu domain, which could contain
one or more groups).  A group is considered "detached" when it has
been removed from the normal driver binding process, and therefore
ready for assignment to a guest or userspace.  I'm more than open to
suggestions of better names.
    * This is just the generic infrastructure.  Obviously, platform
specific code will be needed as well to create the groups and assign
devices to them.  We have a proof of concept implementation for the
power p5ioc2 pci host bridge, and I hope to post code for that and the
p7ioc bridge shortly.  I'll also be looking at how this would be used
by Intel VTd, and AMD iommus.
    * isolation groups appear in /sys/isolation.  Isolatable devices
have a link to their group.  The group contains links to all their
devices, and a "detached" attribute, which can be used to view and
control whether the group is detached.
    * At present groups are detached manually via sysfs, but the idea
is a kernel system like vfio, could also directly request a group to
be detached for its use.
    * when detached, all kernel drivers are unbound from every device
in the group.  No driver will match any device in the group until it
is reattached.

    * There are probably stupid bugs, in particular I know some error
paths are broken.  By all means point these out to expedite fixing
them.
    * The isolation_group structure is more or less a stub at this
point.  It will probably want a pointer to an iommu_ops and/or its own
ops structure.  The 'detached' bool may need to become some sort of
state variable.  We'll probably want some sort of "owner" field for
what (e.g. vfio) is using the group when its detached.
    * I've also considered adding an "isolation strength" bitmask to
indicate how well isolated the group is - e.g DMA isolated (iommu),
irq isolated (irq remapping iommu, or other mechanism), error isolated
(bus errors from this device can't affect other devices), ..

Anyway, on with the code..

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>

---
 drivers/Kconfig                      |    2 +
 drivers/Makefile                     |    2 +
 drivers/base/base.h                  |    5 +
 drivers/base/core.c                  |    7 +
 drivers/base/init.c                  |    2 +
 drivers/isolation/Kconfig            |    3 +
 drivers/isolation/Makefile           |    2 +
 drivers/isolation/device_isolation.c |  216 ++++++++++++++++++++++++++++++++++
 include/linux/device.h               |    5 +
 include/linux/device_isolation.h     |   79 ++++++++++++
 10 files changed, 323 insertions(+), 0 deletions(-)
 create mode 100644 drivers/isolation/Kconfig
 create mode 100644 drivers/isolation/Makefile
 create mode 100644 drivers/isolation/device_isolation.c
 create mode 100644 include/linux/device_isolation.h

diff --git a/drivers/Kconfig b/drivers/Kconfig
index 95b9e7e..2ce5717 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -130,4 +130,6 @@ source "drivers/iommu/Kconfig"
 
 source "drivers/virt/Kconfig"
 
+source "drivers/isolation/Kconfig"
+
 endmenu
diff --git a/drivers/Makefile b/drivers/Makefile
index 7fa433a..741874a 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -127,3 +127,5 @@ obj-$(CONFIG_IOMMU_SUPPORT)	+= iommu/
 
 # Virtualization drivers
 obj-$(CONFIG_VIRT_DRIVERS)	+= virt/
+
+obj-$(CONFIG_DEVICE_ISOLATION)	+= isolation/
diff --git a/drivers/base/base.h b/drivers/base/base.h
index a34dca0..66a4ed2 100644
--- a/drivers/base/base.h
+++ b/drivers/base/base.h
@@ -24,6 +24,9 @@
  * bus_type/class to be statically allocated safely.  Nothing outside of the
  * driver core should ever touch these fields.
  */
+
+#include <linux/device_isolation.h>
+
 struct subsys_private {
 	struct kset subsys;
 	struct kset *devices_kset;
@@ -108,6 +111,8 @@ extern int driver_probe_device(struct device_driver *drv, struct device *dev);
 static inline int driver_match_device(struct device_driver *drv,
 				      struct device *dev)
 {
+	if (isolation_device_is_detached(dev))
+		return 0;
 	return drv->bus->match ? drv->bus->match(dev, drv) : 1;
 }
 
diff --git a/drivers/base/core.c b/drivers/base/core.c
index bc8729d..45201c5 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -22,6 +22,7 @@
 #include <linux/kallsyms.h>
 #include <linux/mutex.h>
 #include <linux/async.h>
+#include <linux/device_isolation.h>
 
 #include "base.h"
 #include "power/power.h"
@@ -593,6 +594,10 @@ void device_initialize(struct device *dev)
 	lockdep_set_novalidate_class(&dev->mutex);
 	spin_lock_init(&dev->devres_lock);
 	INIT_LIST_HEAD(&dev->devres_head);
+#ifdef CONFIG_DEVICE_ISOLATION
+	INIT_LIST_HEAD(&dev->isolation_list);
+	dev->isolation_group = NULL;
+#endif
 	device_pm_init(dev);
 	set_dev_node(dev, -1);
 }
@@ -993,6 +998,8 @@ int device_add(struct device *dev)
 				class_intf->add_dev(dev, class_intf);
 		mutex_unlock(&dev->class->p->class_mutex);
 	}
+
+	isolation_group_update_links(dev);
 done:
 	put_device(dev);
 	return error;
diff --git a/drivers/base/init.c b/drivers/base/init.c
index c8a934e..89f3e94 100644
--- a/drivers/base/init.c
+++ b/drivers/base/init.c
@@ -8,6 +8,7 @@
 #include <linux/device.h>
 #include <linux/init.h>
 #include <linux/memory.h>
+#include <linux/device_isolation.h>
 
 #include "base.h"
 
@@ -24,6 +25,7 @@ void __init driver_init(void)
 	devices_init();
 	buses_init();
 	classes_init();
+	isolation_init();
 	firmware_init();
 	hypervisor_init();
 
diff --git a/drivers/isolation/Kconfig b/drivers/isolation/Kconfig
new file mode 100644
index 0000000..a2a7703
--- /dev/null
+++ b/drivers/isolation/Kconfig
@@ -0,0 +1,3 @@
+config DEVICE_ISOLATION
+	bool "Enable isolating devices for safe pass-through to guests or user space."
+
diff --git a/drivers/isolation/Makefile b/drivers/isolation/Makefile
new file mode 100644
index 0000000..8f1c6ec
--- /dev/null
+++ b/drivers/isolation/Makefile
@@ -0,0 +1,2 @@
+obj-$(CONFIG_DEVICE_ISOLATION) := device_isolation.o
+
diff --git a/drivers/isolation/device_isolation.c b/drivers/isolation/device_isolation.c
new file mode 100644
index 0000000..2b00ff7
--- /dev/null
+++ b/drivers/isolation/device_isolation.c
@@ -0,0 +1,216 @@
+#include <linux/slab.h>
+#include <linux/device_isolation.h>
+
+#define KERN_DEBUG KERN_WARNING
+
+
+static struct kset *isolation_kset;
+
+#define to_isolation_attr(_attr) container_of(_attr, \
+		struct isolation_attribute, attr)
+
+static ssize_t isolation_attr_show(struct kobject *kobj,
+		struct attribute *attr, char *buf)
+{
+	struct isolation_attribute *isolation_attr = to_isolation_attr(attr);
+	struct isolation_group *group = container_of(kobj,
+			struct isolation_group, kobj);
+	ssize_t ret = -EIO;
+
+	if (isolation_attr->show)
+		ret = isolation_attr->show(group, buf);
+	return ret;
+}
+
+static ssize_t isolation_attr_store(struct kobject *kobj,
+		struct attribute *attr, const char *buf, size_t count)
+{
+	struct isolation_attribute *isolation_attr = to_isolation_attr(attr);
+	struct isolation_group *group = container_of(kobj,
+			struct isolation_group, kobj);
+	ssize_t ret = -EIO;
+
+	if (isolation_attr->store)
+		ret = isolation_attr->store(group, buf, count);
+	return ret;
+}
+
+static void isolation_release(struct kobject *kobj)
+{
+	struct isolation_group *group = container_of(kobj,
+			struct isolation_group, kobj);
+
+	pr_debug("isolation : release.\n");
+
+	kfree(group);
+}
+
+static const struct sysfs_ops isolation_sysfs_ops = {
+	.show	= isolation_attr_show,
+	.store	= isolation_attr_store,
+};
+
+static struct kobj_type isolation_ktype = {
+	.sysfs_ops	= &isolation_sysfs_ops,
+	.release	= isolation_release,
+};
+
+static ssize_t group_showdetached(struct isolation_group *group,
+		     char *buf)
+{
+	return sprintf(buf, "%u", !!group->detached);
+}
+
+static ssize_t group_setdetached(struct isolation_group *group,
+		     const char *buf, size_t count)
+{
+	switch (buf[0]) {
+	case '0':
+		isolation_group_reattach(group);
+		break;
+	case '1':
+		isolation_group_detach(group);
+		break;
+	default:
+		count = -EINVAL;
+	}
+	return count;
+}
+
+static DEVICE_ISOLATION_ATTR(detached, S_IWUSR | S_IRUSR,
+		group_showdetached, group_setdetached);
+
+struct isolation_group *isolation_group_new(const char *name)
+{
+	int ret;
+	struct isolation_group *group;
+
+	group = kzalloc(sizeof(*group), GFP_KERNEL);
+	if (!group)
+		return NULL;
+
+	spin_lock_init(&group->lock);
+
+	group->kobj.kset = isolation_kset;
+	ret = kobject_init_and_add(&group->kobj, &isolation_ktype, NULL, name);
+	if (ret < 0) {
+		printk(KERN_ERR "device_isolation: kobject_init_and_add failed "
+				"for %s\n", group->kobj.name);
+		return NULL;
+	}
+
+	ret = sysfs_create_file(&group->kobj, &isolation_attr_detached.attr);
+	if (0 > ret)
+		printk(KERN_WARNING "device_isolation: create \"detached\" failed "
+			"for %s, errno=%i\n", group->kobj.name, ret);
+
+	INIT_LIST_HEAD(&group->devices);
+	printk(KERN_DEBUG "device_isolation: group %s created\n",
+		group->kobj.name);
+
+	return group;
+}
+
+void isolation_group_dev_add(struct isolation_group *group, struct device *dev)
+{
+	printk(KERN_DEBUG "device_isolation: adding device %s to group %s\n",
+		dev->kobj.name, group->kobj.name);
+	spin_lock(&group->lock);
+
+	list_add_tail(&dev->isolation_list, &group->devices);
+	dev->isolation_group = group;
+	/* Todo: remove links we already created */
+	spin_unlock(&group->lock);
+}
+
+void isolation_group_dev_remove(struct device *dev)
+{
+	list_del(&dev->isolation_list);
+}
+
+int isolation_group_update_links(struct device *dev)
+{
+	int error;
+	char *buf;
+	struct isolation_group *group = dev->isolation_group;
+
+	if (!group) {
+		printk(KERN_DEBUG "device_isolation: no group for %s\n",
+				dev->kobj.name);
+		return 0;
+	}
+
+	printk(KERN_DEBUG "device_isolation: adding device %s to group %s\n",
+		dev->kobj.name, group->kobj.name);
+
+	spin_lock(&group->lock);
+
+	error = sysfs_create_link(&dev->kobj, &group->kobj, "isolation_group");
+	if (0 > error)
+		printk(KERN_WARNING "device_isolation: create isolation_group "
+			"link failed for %s -> %s, errno=%i\n",
+			dev->kobj.name, group->kobj.name, error);
+
+	buf = kasprintf(GFP_KERNEL, "%s-%p", dev->kobj.name, dev);
+	if (!buf) {
+		error = -ENOMEM;
+		printk(KERN_ERR "device_isolation: kasprintf failed\n");
+		goto out;
+	}
+
+	error = sysfs_create_link(&group->kobj, &dev->kobj, buf);
+	kfree(buf);
+	if (0 > error)
+		printk(KERN_WARNING "device_isolation: create %s "
+			"link failed for %s -> %s, errno=%i\n",
+			buf, dev->kobj.name, group->kobj.name, error);
+
+out:
+	/* Todo: remove links we already created */
+	spin_unlock(&group->lock);
+	return error;
+}
+
+int isolation_group_detach(struct isolation_group *group)
+{
+	int error;
+	struct device *dev;
+
+	spin_lock(&group->lock);
+
+	group->detached = true;
+	list_for_each_entry(dev, &group->devices, isolation_list) {
+		error = device_reprobe(dev);
+	}
+
+	spin_unlock(&group->lock);
+
+	return 0;
+}
+
+int isolation_group_reattach(struct isolation_group *group)
+{
+	int error;
+	struct device *dev;
+
+	spin_lock(&group->lock);
+
+	group->detached = false;
+	list_for_each_entry(dev, &group->devices, isolation_list) {
+		printk("Reprobing %s\n", dev->kobj.name);
+		error = device_reprobe(dev);
+	}
+
+	spin_unlock(&group->lock);
+
+	return 0;
+}
+
+int __init isolation_init(void)
+{
+	isolation_kset = kset_create_and_add("isolation", NULL, NULL);
+	if (!isolation_kset)
+		return -ENOMEM;
+	return 0;
+}
+
diff --git a/include/linux/device.h b/include/linux/device.h
index c20dfbf..6238a13 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -585,6 +585,11 @@ struct device {
 
 	struct dma_coherent_mem	*dma_mem; /* internal for coherent mem
 					     override */
+#ifdef CONFIG_DEVICE_ISOLATION
+	struct isolation_group  *isolation_group;
+	struct list_head 	isolation_list;
+#endif
+
 	/* arch specific additions */
 	struct dev_archdata	archdata;
 
diff --git a/include/linux/device_isolation.h b/include/linux/device_isolation.h
new file mode 100644
index 0000000..1dc7e68
--- /dev/null
+++ b/include/linux/device_isolation.h
@@ -0,0 +1,79 @@
+#ifndef _DEVICE_ISOLATION_H_
+#define _DEVICE_ISOLATION_H_
+
+#include <linux/kobject.h>
+#include <linux/list.h>
+#include <linux/device.h>
+
+struct isolation_group {
+	struct kobject 		kobj;
+	struct list_head 	devices;
+	spinlock_t		lock;
+	bool			detached;
+};
+
+struct isolation_attribute {
+	struct attribute	attr;
+	ssize_t (*show)(struct isolation_group *group, char *buf);
+	ssize_t (*store)(struct isolation_group *group, const char *buf,
+			size_t count);
+};
+
+#ifdef CONFIG_DEVICE_ISOLATION
+
+int __init isolation_init(void);
+
+struct isolation_group *isolation_group_new(const char *name);
+int isolation_group_delete(struct isolation_group *group);
+
+void isolation_group_dev_add(struct isolation_group *group,
+		struct device *dev);
+void isolation_group_dev_remove(struct device *dev);
+
+int isolation_group_update_links(struct device *dev);
+int isolation_group_detach(struct isolation_group *group);
+int isolation_group_reattach(struct isolation_group *group);
+
+#define DEVICE_ISOLATION_ATTR(_name, _mode, _show, _store)	\
+	struct isolation_attribute isolation_attr_##_name =	\
+		__ATTR(_name, _mode, _show, _store)
+
+static inline int isolation_device_is_detached(struct device *dev)
+{
+	return dev->isolation_group && dev->isolation_group->detached;
+}
+
+#else
+
+static inline int __init isolation_init(void)
+{
+	return 0;
+}
+
+static inline struct isolation_group *isolation_group_new(const char *name)
+{
+	return NULL;
+}
+
+/*void isolation_group_dev_add(struct isolation_group *group,
+		struct device *dev)
+{
+}*/
+
+static inline void isolation_group_dev_remove(struct device *dev)
+{
+}
+
+static inline int isolation_group_update_links(struct device *dev)
+{
+	return 0;
+}
+
+static inline int isolation_is_detached(struct device *dev)
+{
+	return 0;
+}
+
+#endif
+
+#endif
-- 
1.7.7.3


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: RFC: Device isolation infrastructure
  2011-12-07  3:58 RFC: Device isolation infrastructure David Gibson
@ 2011-12-07  6:22 ` Benjamin Herrenschmidt
  2011-12-07  6:46 ` Alex Williamson
  1 sibling, 0 replies; 16+ messages in thread
From: Benjamin Herrenschmidt @ 2011-12-07  6:22 UTC (permalink / raw)
  To: David Gibson
  Cc: Alex Williamson, joerg.roedel, dwmw2, iommu, aik, linux-kernel,
	chrisw, agraf, scottwood, B08248

On Wed, 2011-12-07 at 14:58 +1100, David Gibson wrote:
> Alex Williamson recently posted some patches adding a new hook to
> iommu_ops to identify groups of devices which cannot reliably be
> distinguished by the iommu, and therefore can only safely be assigned
> as a unit to a guest or userspace driver.

Please ignore the numeros typos, David is typing all of that with a
broken hand in a cast... 

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RFC: Device isolation infrastructure
  2011-12-07  3:58 RFC: Device isolation infrastructure David Gibson
  2011-12-07  6:22 ` Benjamin Herrenschmidt
@ 2011-12-07  6:46 ` Alex Williamson
  2011-12-07 14:22   ` David Gibson
  1 sibling, 1 reply; 16+ messages in thread
From: Alex Williamson @ 2011-12-07  6:46 UTC (permalink / raw)
  To: David Gibson
  Cc: joerg.roedel, dwmw2, iommu, aik, linux-kernel, chrisw, agraf,
	scottwood, B08248, benh

On Wed, 2011-12-07 at 14:58 +1100, David Gibson wrote:
> 
> Alex Williamson recently posted some patches adding a new hook to
> iommu_ops to identify groups of devices which cannot reliably be
> distinguished by the iommu, and therefore can only safely be assigned
> as a unit to a guest or userspace driver.
> 
> I'm not very fond of that design, because:
>     - It requires something else to scan the group ids to present the
> groups in a userspace discoverable manner, rather than handling that
> in the core.

The thought was that the vast majority of users won't make use of
groups, so why bloat the core when the only user is vfio.  We need bus
specific drivers to expose devices in vfio, so it's fairly natural to
have the bus drivers enumerate devices on their bus, register them with
the vfio-core, which then exposes groups in a convenient manner for
userspace.

>     - It encorages manipulating devices individually, but only
> permitting certain actions if the whole group is together.  This seems
> confusing and arse-backwards to me.

Hmm, this patch introduces a group level detach and reattach of devices,
but seems to lack any support for actually making use of devices in the
detached state, so it's rather hard to compare.  How does someone
wanting to interact with a device within the group do so?  This appears
to put devices entirely outside of the driver model when detached, which
seems just as awkward.

>     - It does not provide an obvious means to prevent a kernel
> driver unsafely binding to a new device which is hotplugged into an
> iommu group already assigned to a guest.

I've added this for PCI, triggered via the device add notifier.

>     - It is inherently centred arounf the iommu structure, however
> there might be other platform dependent factors which also affect
> which devices can be safely isolated from others.

Yep, MSI isolation being an example.  I tend to think the iommu driver
is pretty closely tied to the platform (all current examples are), so
the iommu driver should take those factors into account when advertising
groups.

> 
> These could probably each be addressed with various tweaks, but
> basically, I'm just not convinced that it represents the right
> structure for handling this.  So, Alexey Kardashevskiy and myself havr
> made this patches, representing the outline of how I think this should
> be handled.
> 
> 
> Notes:
>     * All names are negotiable.  At the moment I'm thinking of this as
> the "device isolation subsystem" which handles "device isolation
> groups" (not to be confused with an iommu domain, which could contain
> one or more groups).  A group is considered "detached" when it has
> been removed from the normal driver binding process, and therefore
> ready for assignment to a guest or userspace.  I'm more than open to
> suggestions of better names.
>     * This is just the generic infrastructure.  Obviously, platform
> specific code will be needed as well to create the groups and assign
> devices to them.  We have a proof of concept implementation for the
> power p5ioc2 pci host bridge, and I hope to post code for that and the
> p7ioc bridge shortly.  I'll also be looking at how this would be used
> by Intel VTd, and AMD iommus.
>     * isolation groups appear in /sys/isolation.  Isolatable devices
> have a link to their group.  The group contains links to all their
> devices, and a "detached" attribute, which can be used to view and
> control whether the group is detached.
>     * At present groups are detached manually via sysfs, but the idea
> is a kernel system like vfio, could also directly request a group to
> be detached for its use.
>     * when detached, all kernel drivers are unbound from every device
> in the group.  No driver will match any device in the group until it
> is reattached.
> 
>     * There are probably stupid bugs, in particular I know some error
> paths are broken.  By all means point these out to expedite fixing
> them.
>     * The isolation_group structure is more or less a stub at this
> point.  It will probably want a pointer to an iommu_ops and/or its own
> ops structure.  The 'detached' bool may need to become some sort of
> state variable.  We'll probably want some sort of "owner" field for
> what (e.g. vfio) is using the group when its detached.

This makes me nervous.  One of the flaws I'm trying to correct with vfio
vs existing kvm PCI device assignment is the driver ownership model.
KVM will currently try to make use of any device it can and is only
stopped by drivers having already claimed various resources.  VFIO has a
strong device-driver-group ownership model.

Isolation groups facilitate ejecting the normal device drivers (which
actually seems somewhat dangerous) but isn't yet introducing any clear
ownership model to replace *a* device being bound to *a* driver.
Devices seem to be in limbo once detached.  We're not even outlining
here how we're going to attach group-drivers to groups or facilitate
matching individual devices to bus specific drivers, which seems like
it'll get us back into the individual device manipulation you complain
about above.

>     * I've also considered adding an "isolation strength" bitmask to
> indicate how well isolated the group is - e.g DMA isolated (iommu),
> irq isolated (irq remapping iommu, or other mechanism), error isolated
> (bus errors from this device can't affect other devices), ..

This sounds like a maintenance burden, which is why I've been trying to
push that iommu drivers should not expose groups that aren't fully
isolated.  It's easy to add parameters per iommu driver where the user
can opt-in to certain deficiencies.

> Anyway, on with the code..
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> 
> ---
>  drivers/Kconfig                      |    2 +
>  drivers/Makefile                     |    2 +
>  drivers/base/base.h                  |    5 +
>  drivers/base/core.c                  |    7 +
>  drivers/base/init.c                  |    2 +
>  drivers/isolation/Kconfig            |    3 +
>  drivers/isolation/Makefile           |    2 +
>  drivers/isolation/device_isolation.c |  216 ++++++++++++++++++++++++++++++++++
>  include/linux/device.h               |    5 +
>  include/linux/device_isolation.h     |   79 ++++++++++++
>  10 files changed, 323 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/isolation/Kconfig
>  create mode 100644 drivers/isolation/Makefile
>  create mode 100644 drivers/isolation/device_isolation.c
>  create mode 100644 include/linux/device_isolation.h
> 
> diff --git a/drivers/Kconfig b/drivers/Kconfig
> index 95b9e7e..2ce5717 100644
> --- a/drivers/Kconfig
> +++ b/drivers/Kconfig
> @@ -130,4 +130,6 @@ source "drivers/iommu/Kconfig"
>  
>  source "drivers/virt/Kconfig"
>  
> +source "drivers/isolation/Kconfig"
> +
>  endmenu
> diff --git a/drivers/Makefile b/drivers/Makefile
> index 7fa433a..741874a 100644
> --- a/drivers/Makefile
> +++ b/drivers/Makefile
> @@ -127,3 +127,5 @@ obj-$(CONFIG_IOMMU_SUPPORT)	+= iommu/
>  
>  # Virtualization drivers
>  obj-$(CONFIG_VIRT_DRIVERS)	+= virt/
> +
> +obj-$(CONFIG_DEVICE_ISOLATION)	+= isolation/
> diff --git a/drivers/base/base.h b/drivers/base/base.h
> index a34dca0..66a4ed2 100644
> --- a/drivers/base/base.h
> +++ b/drivers/base/base.h
> @@ -24,6 +24,9 @@
>   * bus_type/class to be statically allocated safely.  Nothing outside of the
>   * driver core should ever touch these fields.
>   */
> +
> +#include <linux/device_isolation.h>
> +
>  struct subsys_private {
>  	struct kset subsys;
>  	struct kset *devices_kset;
> @@ -108,6 +111,8 @@ extern int driver_probe_device(struct device_driver *drv, struct device *dev);
>  static inline int driver_match_device(struct device_driver *drv,
>  				      struct device *dev)
>  {
> +	if (isolation_device_is_detached(dev))
> +		return 0;
>  	return drv->bus->match ? drv->bus->match(dev, drv) : 1;
>  }
>  
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index bc8729d..45201c5 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -22,6 +22,7 @@
>  #include <linux/kallsyms.h>
>  #include <linux/mutex.h>
>  #include <linux/async.h>
> +#include <linux/device_isolation.h>
>  
>  #include "base.h"
>  #include "power/power.h"
> @@ -593,6 +594,10 @@ void device_initialize(struct device *dev)
>  	lockdep_set_novalidate_class(&dev->mutex);
>  	spin_lock_init(&dev->devres_lock);
>  	INIT_LIST_HEAD(&dev->devres_head);
> +#ifdef CONFIG_DEVICE_ISOLATION
> +	INIT_LIST_HEAD(&dev->isolation_list);
> +	dev->isolation_group = NULL;
> +#endif
>  	device_pm_init(dev);
>  	set_dev_node(dev, -1);
>  }
> @@ -993,6 +998,8 @@ int device_add(struct device *dev)
>  				class_intf->add_dev(dev, class_intf);
>  		mutex_unlock(&dev->class->p->class_mutex);
>  	}

At this point we've already attached a native driver to a potentially
in-use isolation group, so we haven't really addressed the hotplug
issue.  If we're creating groups and adding devices to them via the
device add notifier, that looks pretty similar to how vfio does it, and
there we only need the trivial iommu_device_group hook.  I'll give you
that the "detached" hook in the match function makes it easy to take a
device out of the driver model, but then what do you replace the driver
model with?

> +
> +	isolation_group_update_links(dev);
>  done:
>  	put_device(dev);
>  	return error;

We'd need something in the device_del path to block removal if the
device is in use via the group.  The equivalent of the .remove for the
driver it can no longer attach to.

> diff --git a/drivers/base/init.c b/drivers/base/init.c
> index c8a934e..89f3e94 100644
> --- a/drivers/base/init.c
> +++ b/drivers/base/init.c
> @@ -8,6 +8,7 @@
>  #include <linux/device.h>
>  #include <linux/init.h>
>  #include <linux/memory.h>
> +#include <linux/device_isolation.h>
>  
>  #include "base.h"
>  
> @@ -24,6 +25,7 @@ void __init driver_init(void)
>  	devices_init();
>  	buses_init();
>  	classes_init();
> +	isolation_init();
>  	firmware_init();
>  	hypervisor_init();
>  
> diff --git a/drivers/isolation/Kconfig b/drivers/isolation/Kconfig
> new file mode 100644
> index 0000000..a2a7703
> --- /dev/null
> +++ b/drivers/isolation/Kconfig
> @@ -0,0 +1,3 @@
> +config DEVICE_ISOLATION
> +	bool "Enable isolating devices for safe pass-through to guests or user space."
> +
> diff --git a/drivers/isolation/Makefile b/drivers/isolation/Makefile
> new file mode 100644
> index 0000000..8f1c6ec
> --- /dev/null
> +++ b/drivers/isolation/Makefile
> @@ -0,0 +1,2 @@
> +obj-$(CONFIG_DEVICE_ISOLATION) := device_isolation.o
> +
> diff --git a/drivers/isolation/device_isolation.c b/drivers/isolation/device_isolation.c
> new file mode 100644
> index 0000000..2b00ff7
> --- /dev/null
> +++ b/drivers/isolation/device_isolation.c
> @@ -0,0 +1,216 @@
> +#include <linux/slab.h>
> +#include <linux/device_isolation.h>
> +
> +#define KERN_DEBUG KERN_WARNING
> +
> +
> +static struct kset *isolation_kset;
> +
> +#define to_isolation_attr(_attr) container_of(_attr, \
> +		struct isolation_attribute, attr)
> +
> +static ssize_t isolation_attr_show(struct kobject *kobj,
> +		struct attribute *attr, char *buf)
> +{
> +	struct isolation_attribute *isolation_attr = to_isolation_attr(attr);
> +	struct isolation_group *group = container_of(kobj,
> +			struct isolation_group, kobj);
> +	ssize_t ret = -EIO;
> +
> +	if (isolation_attr->show)
> +		ret = isolation_attr->show(group, buf);
> +	return ret;
> +}
> +
> +static ssize_t isolation_attr_store(struct kobject *kobj,
> +		struct attribute *attr, const char *buf, size_t count)
> +{
> +	struct isolation_attribute *isolation_attr = to_isolation_attr(attr);
> +	struct isolation_group *group = container_of(kobj,
> +			struct isolation_group, kobj);
> +	ssize_t ret = -EIO;
> +
> +	if (isolation_attr->store)
> +		ret = isolation_attr->store(group, buf, count);
> +	return ret;
> +}
> +
> +static void isolation_release(struct kobject *kobj)
> +{
> +	struct isolation_group *group = container_of(kobj,
> +			struct isolation_group, kobj);
> +
> +	pr_debug("isolation : release.\n");
> +
> +	kfree(group);
> +}
> +
> +static const struct sysfs_ops isolation_sysfs_ops = {
> +	.show	= isolation_attr_show,
> +	.store	= isolation_attr_store,
> +};
> +
> +static struct kobj_type isolation_ktype = {
> +	.sysfs_ops	= &isolation_sysfs_ops,
> +	.release	= isolation_release,
> +};
> +
> +static ssize_t group_showdetached(struct isolation_group *group,
> +		     char *buf)
> +{
> +	return sprintf(buf, "%u", !!group->detached);
> +}
> +
> +static ssize_t group_setdetached(struct isolation_group *group,
> +		     const char *buf, size_t count)
> +{
> +	switch (buf[0]) {
> +	case '0':
> +		isolation_group_reattach(group);
> +		break;
> +	case '1':
> +		isolation_group_detach(group);
> +		break;
> +	default:
> +		count = -EINVAL;
> +	}
> +	return count;
> +}
> +
> +static DEVICE_ISOLATION_ATTR(detached, S_IWUSR | S_IRUSR,
> +		group_showdetached, group_setdetached);
> +
> +struct isolation_group *isolation_group_new(const char *name)
> +{
> +	int ret;
> +	struct isolation_group *group;
> +
> +	group = kzalloc(sizeof(*group), GFP_KERNEL);
> +	if (!group)
> +		return NULL;
> +
> +	spin_lock_init(&group->lock);
> +
> +	group->kobj.kset = isolation_kset;
> +	ret = kobject_init_and_add(&group->kobj, &isolation_ktype, NULL, name);
> +	if (ret < 0) {
> +		printk(KERN_ERR "device_isolation: kobject_init_and_add failed "
> +				"for %s\n", group->kobj.name);
> +		return NULL;
> +	}
> +
> +	ret = sysfs_create_file(&group->kobj, &isolation_attr_detached.attr);
> +	if (0 > ret)
> +		printk(KERN_WARNING "device_isolation: create \"detached\" failed "
> +			"for %s, errno=%i\n", group->kobj.name, ret);
> +
> +	INIT_LIST_HEAD(&group->devices);
> +	printk(KERN_DEBUG "device_isolation: group %s created\n",
> +		group->kobj.name);
> +
> +	return group;
> +}
> +
> +void isolation_group_dev_add(struct isolation_group *group, struct device *dev)
> +{
> +	printk(KERN_DEBUG "device_isolation: adding device %s to group %s\n",
> +		dev->kobj.name, group->kobj.name);
> +	spin_lock(&group->lock);
> +
> +	list_add_tail(&dev->isolation_list, &group->devices);
> +	dev->isolation_group = group;
> +	/* Todo: remove links we already created */
> +	spin_unlock(&group->lock);
> +}
> +
> +void isolation_group_dev_remove(struct device *dev)
> +{
> +	list_del(&dev->isolation_list);
> +}
> +
> +int isolation_group_update_links(struct device *dev)
> +{
> +	int error;
> +	char *buf;
> +	struct isolation_group *group = dev->isolation_group;
> +
> +	if (!group) {
> +		printk(KERN_DEBUG "device_isolation: no group for %s\n",
> +				dev->kobj.name);
> +		return 0;
> +	}
> +
> +	printk(KERN_DEBUG "device_isolation: adding device %s to group %s\n",
> +		dev->kobj.name, group->kobj.name);
> +
> +	spin_lock(&group->lock);
> +
> +	error = sysfs_create_link(&dev->kobj, &group->kobj, "isolation_group");
> +	if (0 > error)
> +		printk(KERN_WARNING "device_isolation: create isolation_group "
> +			"link failed for %s -> %s, errno=%i\n",
> +			dev->kobj.name, group->kobj.name, error);
> +
> +	buf = kasprintf(GFP_KERNEL, "%s-%p", dev->kobj.name, dev);
> +	if (!buf) {
> +		error = -ENOMEM;
> +		printk(KERN_ERR "device_isolation: kasprintf failed\n");
> +		goto out;
> +	}
> +
> +	error = sysfs_create_link(&group->kobj, &dev->kobj, buf);
> +	kfree(buf);
> +	if (0 > error)
> +		printk(KERN_WARNING "device_isolation: create %s "
> +			"link failed for %s -> %s, errno=%i\n",
> +			buf, dev->kobj.name, group->kobj.name, error);
> +
> +out:
> +	/* Todo: remove links we already created */
> +	spin_unlock(&group->lock);
> +	return error;
> +}
> +
> +int isolation_group_detach(struct isolation_group *group)
> +{
> +	int error;
> +	struct device *dev;
> +
> +	spin_lock(&group->lock);
> +
> +	group->detached = true;
> +	list_for_each_entry(dev, &group->devices, isolation_list) {
> +		error = device_reprobe(dev);

How do we handle blocking .remove functions?  That's one reason vfio
doesn't attempt to automatically detach drivers.  We can treat the
groups as a unit here, but it's hard to get away from the fact that
potentially several individual devices are being removed from their
drivers here.

> +	}
> +
> +	spin_unlock(&group->lock);
> +
> +	return 0;
> +}
> +
> +int isolation_group_reattach(struct isolation_group *group)
> +{
> +	int error;
> +	struct device *dev;
> +
> +	spin_lock(&group->lock);

Obviously some ownership check would have to exist here if we knew how
to claim a group.

> +	group->detached = false;
> +	list_for_each_entry(dev, &group->devices, isolation_list) {
> +		printk("Reprobing %s\n", dev->kobj.name);
> +		error = device_reprobe(dev);
> +	}
> +
> +	spin_unlock(&group->lock);
> +
> +	return 0;
> +}
> +
> +int __init isolation_init(void)
> +{
> +	isolation_kset = kset_create_and_add("isolation", NULL, NULL);
> +	if (!isolation_kset)
> +		return -ENOMEM;
> +	return 0;
> +}
> +
> diff --git a/include/linux/device.h b/include/linux/device.h
> index c20dfbf..6238a13 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -585,6 +585,11 @@ struct device {
>  
>  	struct dma_coherent_mem	*dma_mem; /* internal for coherent mem
>  					     override */
> +#ifdef CONFIG_DEVICE_ISOLATION
> +	struct isolation_group  *isolation_group;
> +	struct list_head 	isolation_list;
> +#endif
> +
>  	/* arch specific additions */
>  	struct dev_archdata	archdata;
>  
> diff --git a/include/linux/device_isolation.h b/include/linux/device_isolation.h
> new file mode 100644
> index 0000000..1dc7e68
> --- /dev/null
> +++ b/include/linux/device_isolation.h
> @@ -0,0 +1,79 @@
> +#ifndef _DEVICE_ISOLATION_H_
> +#define _DEVICE_ISOLATION_H_
> +
> +#include <linux/kobject.h>
> +#include <linux/list.h>
> +#include <linux/device.h>
> +
> +struct isolation_group {
> +	struct kobject 		kobj;
> +	struct list_head 	devices;
> +	spinlock_t		lock;
> +	bool			detached;
> +};
> +
> +struct isolation_attribute {
> +	struct attribute	attr;
> +	ssize_t (*show)(struct isolation_group *group, char *buf);
> +	ssize_t (*store)(struct isolation_group *group, const char *buf,
> +			size_t count);
> +};
> +
> +#ifdef CONFIG_DEVICE_ISOLATION
> +
> +int __init isolation_init(void);
> +
> +struct isolation_group *isolation_group_new(const char *name);
> +int isolation_group_delete(struct isolation_group *group);
> +
> +void isolation_group_dev_add(struct isolation_group *group,
> +		struct device *dev);
> +void isolation_group_dev_remove(struct device *dev);
> +
> +int isolation_group_update_links(struct device *dev);
> +int isolation_group_detach(struct isolation_group *group);
> +int isolation_group_reattach(struct isolation_group *group);
> +
> +#define DEVICE_ISOLATION_ATTR(_name, _mode, _show, _store)	\
> +	struct isolation_attribute isolation_attr_##_name =	\
> +		__ATTR(_name, _mode, _show, _store)
> +
> +static inline int isolation_device_is_detached(struct device *dev)
> +{
> +	return dev->isolation_group && dev->isolation_group->detached;
> +}
> +
> +#else
> +
> +static inline int __init isolation_init(void)
> +{
> +	return 0;
> +}
> +
> +static inline struct isolation_group *isolation_group_new(const char *name)
> +{
> +	return NULL;
> +}
> +
> +/*void isolation_group_dev_add(struct isolation_group *group,
> +		struct device *dev)
> +{
> +}*/
> +
> +static inline void isolation_group_dev_remove(struct device *dev)
> +{
> +}
> +
> +static inline int isolation_group_update_links(struct device *dev)
> +{
> +	return 0;
> +}
> +
> +static inline int isolation_is_detached(struct device *dev)
> +{
> +	return 0;
> +}
> +
> +#endif
> +
> +#endif

So what this boils down to so far are some group objects with lists of
devices that allow us to pop off all the native drivers and prevent them
from reattaching.  And for that convenience, we're going to need a whole
new group driver model to allow something like vfio to claim the group
and then maybe some group-device model to allow bus specific drivers to
easily make use of the devices within the group?  And somehow along the
way we're going to avoid the group vs device awkwardness?  And again,
all for an interface that the vast majority of users won't use.  I'm not
convinced.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RFC: Device isolation infrastructure
  2011-12-07  6:46 ` Alex Williamson
@ 2011-12-07 14:22   ` David Gibson
  2011-12-07 19:45     ` Alex Williamson
  0 siblings, 1 reply; 16+ messages in thread
From: David Gibson @ 2011-12-07 14:22 UTC (permalink / raw)
  To: Alex Williamson
  Cc: joerg.roedel, dwmw2, iommu, aik, linux-kernel, chrisw, agraf,
	scottwood, B08248, benh

On Tue, Dec 06, 2011 at 11:46:42PM -0700, Alex Williamson wrote:
> On Wed, 2011-12-07 at 14:58 +1100, David Gibson wrote:
> > 
> > Alex Williamson recently posted some patches adding a new hook to
> > iommu_ops to identify groups of devices which cannot reliably be
> > distinguished by the iommu, and therefore can only safely be assigned
> > as a unit to a guest or userspace driver.
> > 
> > I'm not very fond of that design, because:
> >     - It requires something else to scan the group ids to present the
> > groups in a userspace discoverable manner, rather than handling that
> > in the core.
> 
> The thought was that the vast majority of users won't make use of
> groups, so why bloat the core when the only user is vfio.  

Well, "won't make use of" actually means "are on a system with
all-singleton groups".  And the answer is because it actually doesn't
bloat, but simplifies the core.  We need the group concept _at
minimum_ to handle bridges and buggy multifunction devices.  So it
makes life easier to assume they're always there, even if they're
often singletons.

> We need bus
> specific drivers to expose devices in vfio, so it's fairly natural to
> have the bus drivers enumerate devices on their bus, register them with
> the vfio-core, which then exposes groups in a convenient manner for
> userspace.
> 
> >     - It encorages manipulating devices individually, but only
> > permitting certain actions if the whole group is together.  This seems
> > confusing and arse-backwards to me.
> 
> Hmm, this patch introduces a group level detach and reattach of devices,
> but seems to lack any support for actually making use of devices in the
> detached state, so it's rather hard to compare.

well, yes, that's the next layer up.  Working on it..

>  How does someone
> wanting to interact with a device within the group do so?  This appears
> to put devices entirely outside of the driver model when detached, which
> seems just as awkward.
> 
> >     - It does not provide an obvious means to prevent a kernel
> > driver unsafely binding to a new device which is hotplugged into an
> > iommu group already assigned to a guest.
> 
> I've added this for PCI, triggered via the device add notifier.

Hrm.  Does the notifier run early enough to guarantee that a driver
can't bind before it's complete?

> >     - It is inherently centred arounf the iommu structure, however
> > there might be other platform dependent factors which also affect
> > which devices can be safely isolated from others.
> 
> Yep, MSI isolation being an example.  I tend to think the iommu driver
> is pretty closely tied to the platform (all current examples are), so
> the iommu driver should take those factors into account when advertising
> groups.
> 
> > These could probably each be addressed with various tweaks, but
> > basically, I'm just not convinced that it represents the right
> > structure for handling this.  So, Alexey Kardashevskiy and myself havr
> > made this patches, representing the outline of how I think this should
> > be handled.
> > 
> > 
> > Notes:
> >     * All names are negotiable.  At the moment I'm thinking of this as
> > the "device isolation subsystem" which handles "device isolation
> > groups" (not to be confused with an iommu domain, which could contain
> > one or more groups).  A group is considered "detached" when it has
> > been removed from the normal driver binding process, and therefore
> > ready for assignment to a guest or userspace.  I'm more than open to
> > suggestions of better names.
> >     * This is just the generic infrastructure.  Obviously, platform
> > specific code will be needed as well to create the groups and assign
> > devices to them.  We have a proof of concept implementation for the
> > power p5ioc2 pci host bridge, and I hope to post code for that and the
> > p7ioc bridge shortly.  I'll also be looking at how this would be used
> > by Intel VTd, and AMD iommus.
> >     * isolation groups appear in /sys/isolation.  Isolatable devices
> > have a link to their group.  The group contains links to all their
> > devices, and a "detached" attribute, which can be used to view and
> > control whether the group is detached.
> >     * At present groups are detached manually via sysfs, but the idea
> > is a kernel system like vfio, could also directly request a group to
> > be detached for its use.
> >     * when detached, all kernel drivers are unbound from every device
> > in the group.  No driver will match any device in the group until it
> > is reattached.
> > 
> >     * There are probably stupid bugs, in particular I know some error
> > paths are broken.  By all means point these out to expedite fixing
> > them.
> >     * The isolation_group structure is more or less a stub at this
> > point.  It will probably want a pointer to an iommu_ops and/or its own
> > ops structure.  The 'detached' bool may need to become some sort of
> > state variable.  We'll probably want some sort of "owner" field for
> > what (e.g. vfio) is using the group when its detached.
> 
> This makes me nervous.  One of the flaws I'm trying to correct with vfio
> vs existing kvm PCI device assignment is the driver ownership model.
> KVM will currently try to make use of any device it can and is only
> stopped by drivers having already claimed various resources.  VFIO has a
> strong device-driver-group ownership model.

Right... I'm not sure what aspect makes you nervous.  Exactly what I'm
saying here is that we will need an ownership model for detached groups.

> Isolation groups facilitate ejecting the normal device drivers (which
> actually seems somewhat dangerous)

Why dangerous?  You have to do exactly the same thing in more steps in
order to use a group with vfio.

> but isn't yet introducing any clear
> ownership model to replace *a* device being bound to *a* driver.
> Devices seem to be in limbo once detached.  We're not even outlining
> here how we're going to attach group-drivers to groups or facilitate
> matching individual devices to bus specific drivers, which seems like
> it'll get us back into the individual device manipulation you complain
> about above.

I'm envisaging something like
	struct isolation_group_owner *owner;

in the struct i_g, which is NULL when the group is attached.  It would
be a parameter to the detach call, so a subsystem (like vfio) could
detach and claim ownership atomically.  The struct i_g_o would have
some generic fields and some ops, and the owner code could
container_of it to get private, more detailed info
e.g. uid/gid/pid/mm* or whatever for vfio.

> >     * I've also considered adding an "isolation strength" bitmask to
> > indicate how well isolated the group is - e.g DMA isolated (iommu),
> > irq isolated (irq remapping iommu, or other mechanism), error isolated
> > (bus errors from this device can't affect other devices), ..
> 
> This sounds like a maintenance burden, which is why I've been trying to
> push that iommu drivers should not expose groups that aren't fully
> isolated.  It's easy to add parameters per iommu driver where the user
> can opt-in to certain deficiencies.

True.  In fact the only real way to use such a bitmask, I think, would
be to have it as a global, defining the minimum isolation
characteristics to the iommu drivers.  That might help to unify
configuration of what flaws in isolation are acceptable.

> > Anyway, on with the code..
> > 
> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > 
> > ---
> >  drivers/Kconfig                      |    2 +
> >  drivers/Makefile                     |    2 +
> >  drivers/base/base.h                  |    5 +
> >  drivers/base/core.c                  |    7 +
> >  drivers/base/init.c                  |    2 +
> >  drivers/isolation/Kconfig            |    3 +
> >  drivers/isolation/Makefile           |    2 +
> >  drivers/isolation/device_isolation.c |  216 ++++++++++++++++++++++++++++++++++
> >  include/linux/device.h               |    5 +
> >  include/linux/device_isolation.h     |   79 ++++++++++++
> >  10 files changed, 323 insertions(+), 0 deletions(-)
> >  create mode 100644 drivers/isolation/Kconfig
> >  create mode 100644 drivers/isolation/Makefile
> >  create mode 100644 drivers/isolation/device_isolation.c
> >  create mode 100644 include/linux/device_isolation.h
> > 
> > diff --git a/drivers/Kconfig b/drivers/Kconfig
> > index 95b9e7e..2ce5717 100644
> > --- a/drivers/Kconfig
> > +++ b/drivers/Kconfig
> > @@ -130,4 +130,6 @@ source "drivers/iommu/Kconfig"
> >  
> >  source "drivers/virt/Kconfig"
> >  
> > +source "drivers/isolation/Kconfig"
> > +
> >  endmenu
> > diff --git a/drivers/Makefile b/drivers/Makefile
> > index 7fa433a..741874a 100644
> > --- a/drivers/Makefile
> > +++ b/drivers/Makefile
> > @@ -127,3 +127,5 @@ obj-$(CONFIG_IOMMU_SUPPORT)	+= iommu/
> >  
> >  # Virtualization drivers
> >  obj-$(CONFIG_VIRT_DRIVERS)	+= virt/
> > +
> > +obj-$(CONFIG_DEVICE_ISOLATION)	+= isolation/
> > diff --git a/drivers/base/base.h b/drivers/base/base.h
> > index a34dca0..66a4ed2 100644
> > --- a/drivers/base/base.h
> > +++ b/drivers/base/base.h
> > @@ -24,6 +24,9 @@
> >   * bus_type/class to be statically allocated safely.  Nothing outside of the
> >   * driver core should ever touch these fields.
> >   */
> > +
> > +#include <linux/device_isolation.h>
> > +
> >  struct subsys_private {
> >  	struct kset subsys;
> >  	struct kset *devices_kset;
> > @@ -108,6 +111,8 @@ extern int driver_probe_device(struct device_driver *drv, struct device *dev);
> >  static inline int driver_match_device(struct device_driver *drv,
> >  				      struct device *dev)
> >  {
> > +	if (isolation_device_is_detached(dev))
> > +		return 0;
> >  	return drv->bus->match ? drv->bus->match(dev, drv) : 1;
> >  }
> >  
> > diff --git a/drivers/base/core.c b/drivers/base/core.c
> > index bc8729d..45201c5 100644
> > --- a/drivers/base/core.c
> > +++ b/drivers/base/core.c
> > @@ -22,6 +22,7 @@
> >  #include <linux/kallsyms.h>
> >  #include <linux/mutex.h>
> >  #include <linux/async.h>
> > +#include <linux/device_isolation.h>
> >  
> >  #include "base.h"
> >  #include "power/power.h"
> > @@ -593,6 +594,10 @@ void device_initialize(struct device *dev)
> >  	lockdep_set_novalidate_class(&dev->mutex);
> >  	spin_lock_init(&dev->devres_lock);
> >  	INIT_LIST_HEAD(&dev->devres_head);
> > +#ifdef CONFIG_DEVICE_ISOLATION
> > +	INIT_LIST_HEAD(&dev->isolation_list);
> > +	dev->isolation_group = NULL;
> > +#endif
> >  	device_pm_init(dev);
> >  	set_dev_node(dev, -1);
> >  }
> > @@ -993,6 +998,8 @@ int device_add(struct device *dev)
> >  				class_intf->add_dev(dev, class_intf);
> >  		mutex_unlock(&dev->class->p->class_mutex);
> >  	}
> 
> At this point we've already attached a native driver to a potentially
> in-use isolation group, so we haven't really addressed the hotplug
> issue.  If we're creating groups and adding devices to them via the
> device add notifier, that looks pretty similar to how vfio does it, and
> there we only need the trivial iommu_device_group hook.

Uh, no.  This probably isn't clear without an example iommu
implementation, but the idea is that group_dev_add is called before
device_add, so group membership is established before the device
appears in sysfs, and before a driver will bind.

That works nicely for our iommu structure (built into the host bridge,
essentially).  It's a bit nastier on x86, because the iommus are
initialized relatively late.  I'm still getting my head around that
code, but I think it will work, although it might need slightly
different handling for the boot-time and hotplug cases.

>  I'll give you
> that the "detached" hook in the match function makes it easy to take a
> device out of the driver model, but then what do you replace the driver
> model with?

Other, vfio-like, subsystems explicitly claiming groups for
themselves.

> > +
> > +	isolation_group_update_links(dev);
> >  done:
> >  	put_device(dev);
> >  	return error;
> 
> We'd need something in the device_del path to block removal if the
> device is in use via the group.  The equivalent of the .remove for the
> driver it can no longer attach to.

Good point.  At least for the detach-via-sysfs case (which may not
even be a good idea to keep).  In general, I'd expect reattach() to be
called by the subsystem which detached the group in the first place,
when it's done with it.

> > diff --git a/drivers/base/init.c b/drivers/base/init.c
> > index c8a934e..89f3e94 100644
> > --- a/drivers/base/init.c
> > +++ b/drivers/base/init.c
> > @@ -8,6 +8,7 @@
> >  #include <linux/device.h>
> >  #include <linux/init.h>
> >  #include <linux/memory.h>
> > +#include <linux/device_isolation.h>
> >  
> >  #include "base.h"
> >  
> > @@ -24,6 +25,7 @@ void __init driver_init(void)
> >  	devices_init();
> >  	buses_init();
> >  	classes_init();
> > +	isolation_init();
> >  	firmware_init();
> >  	hypervisor_init();
> >  
> > diff --git a/drivers/isolation/Kconfig b/drivers/isolation/Kconfig
> > new file mode 100644
> > index 0000000..a2a7703
> > --- /dev/null
> > +++ b/drivers/isolation/Kconfig
> > @@ -0,0 +1,3 @@
> > +config DEVICE_ISOLATION
> > +	bool "Enable isolating devices for safe pass-through to guests or user space."
> > +
> > diff --git a/drivers/isolation/Makefile b/drivers/isolation/Makefile
> > new file mode 100644
> > index 0000000..8f1c6ec
> > --- /dev/null
> > +++ b/drivers/isolation/Makefile
> > @@ -0,0 +1,2 @@
> > +obj-$(CONFIG_DEVICE_ISOLATION) := device_isolation.o
> > +
> > diff --git a/drivers/isolation/device_isolation.c b/drivers/isolation/device_isolation.c
> > new file mode 100644
> > index 0000000..2b00ff7
> > --- /dev/null
> > +++ b/drivers/isolation/device_isolation.c
> > @@ -0,0 +1,216 @@
> > +#include <linux/slab.h>
> > +#include <linux/device_isolation.h>
> > +
> > +#define KERN_DEBUG KERN_WARNING
> > +
> > +
> > +static struct kset *isolation_kset;
> > +
> > +#define to_isolation_attr(_attr) container_of(_attr, \
> > +		struct isolation_attribute, attr)
> > +
> > +static ssize_t isolation_attr_show(struct kobject *kobj,
> > +		struct attribute *attr, char *buf)
> > +{
> > +	struct isolation_attribute *isolation_attr = to_isolation_attr(attr);
> > +	struct isolation_group *group = container_of(kobj,
> > +			struct isolation_group, kobj);
> > +	ssize_t ret = -EIO;
> > +
> > +	if (isolation_attr->show)
> > +		ret = isolation_attr->show(group, buf);
> > +	return ret;
> > +}
> > +
> > +static ssize_t isolation_attr_store(struct kobject *kobj,
> > +		struct attribute *attr, const char *buf, size_t count)
> > +{
> > +	struct isolation_attribute *isolation_attr = to_isolation_attr(attr);
> > +	struct isolation_group *group = container_of(kobj,
> > +			struct isolation_group, kobj);
> > +	ssize_t ret = -EIO;
> > +
> > +	if (isolation_attr->store)
> > +		ret = isolation_attr->store(group, buf, count);
> > +	return ret;
> > +}
> > +
> > +static void isolation_release(struct kobject *kobj)
> > +{
> > +	struct isolation_group *group = container_of(kobj,
> > +			struct isolation_group, kobj);
> > +
> > +	pr_debug("isolation : release.\n");
> > +
> > +	kfree(group);
> > +}
> > +
> > +static const struct sysfs_ops isolation_sysfs_ops = {
> > +	.show	= isolation_attr_show,
> > +	.store	= isolation_attr_store,
> > +};
> > +
> > +static struct kobj_type isolation_ktype = {
> > +	.sysfs_ops	= &isolation_sysfs_ops,
> > +	.release	= isolation_release,
> > +};
> > +
> > +static ssize_t group_showdetached(struct isolation_group *group,
> > +		     char *buf)
> > +{
> > +	return sprintf(buf, "%u", !!group->detached);
> > +}
> > +
> > +static ssize_t group_setdetached(struct isolation_group *group,
> > +		     const char *buf, size_t count)
> > +{
> > +	switch (buf[0]) {
> > +	case '0':
> > +		isolation_group_reattach(group);
> > +		break;
> > +	case '1':
> > +		isolation_group_detach(group);
> > +		break;
> > +	default:
> > +		count = -EINVAL;
> > +	}
> > +	return count;
> > +}
> > +
> > +static DEVICE_ISOLATION_ATTR(detached, S_IWUSR | S_IRUSR,
> > +		group_showdetached, group_setdetached);
> > +
> > +struct isolation_group *isolation_group_new(const char *name)
> > +{
> > +	int ret;
> > +	struct isolation_group *group;
> > +
> > +	group = kzalloc(sizeof(*group), GFP_KERNEL);
> > +	if (!group)
> > +		return NULL;
> > +
> > +	spin_lock_init(&group->lock);
> > +
> > +	group->kobj.kset = isolation_kset;
> > +	ret = kobject_init_and_add(&group->kobj, &isolation_ktype, NULL, name);
> > +	if (ret < 0) {
> > +		printk(KERN_ERR "device_isolation: kobject_init_and_add failed "
> > +				"for %s\n", group->kobj.name);
> > +		return NULL;
> > +	}
> > +
> > +	ret = sysfs_create_file(&group->kobj, &isolation_attr_detached.attr);
> > +	if (0 > ret)
> > +		printk(KERN_WARNING "device_isolation: create \"detached\" failed "
> > +			"for %s, errno=%i\n", group->kobj.name, ret);
> > +
> > +	INIT_LIST_HEAD(&group->devices);
> > +	printk(KERN_DEBUG "device_isolation: group %s created\n",
> > +		group->kobj.name);
> > +
> > +	return group;
> > +}
> > +
> > +void isolation_group_dev_add(struct isolation_group *group, struct device *dev)
> > +{
> > +	printk(KERN_DEBUG "device_isolation: adding device %s to group %s\n",
> > +		dev->kobj.name, group->kobj.name);
> > +	spin_lock(&group->lock);
> > +
> > +	list_add_tail(&dev->isolation_list, &group->devices);
> > +	dev->isolation_group = group;
> > +	/* Todo: remove links we already created */
> > +	spin_unlock(&group->lock);
> > +}
> > +
> > +void isolation_group_dev_remove(struct device *dev)
> > +{
> > +	list_del(&dev->isolation_list);
> > +}
> > +
> > +int isolation_group_update_links(struct device *dev)
> > +{
> > +	int error;
> > +	char *buf;
> > +	struct isolation_group *group = dev->isolation_group;
> > +
> > +	if (!group) {
> > +		printk(KERN_DEBUG "device_isolation: no group for %s\n",
> > +				dev->kobj.name);
> > +		return 0;
> > +	}
> > +
> > +	printk(KERN_DEBUG "device_isolation: adding device %s to group %s\n",
> > +		dev->kobj.name, group->kobj.name);
> > +
> > +	spin_lock(&group->lock);
> > +
> > +	error = sysfs_create_link(&dev->kobj, &group->kobj, "isolation_group");
> > +	if (0 > error)
> > +		printk(KERN_WARNING "device_isolation: create isolation_group "
> > +			"link failed for %s -> %s, errno=%i\n",
> > +			dev->kobj.name, group->kobj.name, error);
> > +
> > +	buf = kasprintf(GFP_KERNEL, "%s-%p", dev->kobj.name, dev);
> > +	if (!buf) {
> > +		error = -ENOMEM;
> > +		printk(KERN_ERR "device_isolation: kasprintf failed\n");
> > +		goto out;
> > +	}
> > +
> > +	error = sysfs_create_link(&group->kobj, &dev->kobj, buf);
> > +	kfree(buf);
> > +	if (0 > error)
> > +		printk(KERN_WARNING "device_isolation: create %s "
> > +			"link failed for %s -> %s, errno=%i\n",
> > +			buf, dev->kobj.name, group->kobj.name, error);
> > +
> > +out:
> > +	/* Todo: remove links we already created */
> > +	spin_unlock(&group->lock);
> > +	return error;
> > +}
> > +
> > +int isolation_group_detach(struct isolation_group *group)
> > +{
> > +	int error;
> > +	struct device *dev;
> > +
> > +	spin_lock(&group->lock);
> > +
> > +	group->detached = true;
> > +	list_for_each_entry(dev, &group->devices, isolation_list) {
> > +		error = device_reprobe(dev);
> 
> How do we handle blocking .remove functions?  That's one reason vfio
> doesn't attempt to automatically detach drivers.  We can treat the
> groups as a unit here, but it's hard to get away from the fact that
> potentially several individual devices are being removed from their
> drivers here.

Slowly.  We have to remove all the drivers, so we have to just wait
for them.  I guess we should have a timeout and reattach the group if
we can't unbind the drivers in that time.  And/or figure a way to make
it interruptible, so you can ctrl-c something midway through
triggering a detach, and it will re-attach.

Oh, right, and that means we can't take the lock around it all.  Well
that's fixable with the right rechecks of the list.  we might need to
replace the bool with some sort of state variable; ATTACHED -> DETACHING
-> DETACHED.

> > +	}
> > +
> > +	spin_unlock(&group->lock);
> > +
> > +	return 0;
> > +}
> > +
> > +int isolation_group_reattach(struct isolation_group *group)
> > +{
> > +	int error;
> > +	struct device *dev;
> > +
> > +	spin_lock(&group->lock);
> 
> Obviously some ownership check would have to exist here if we knew how
> to claim a group.
> 
> > +	group->detached = false;
> > +	list_for_each_entry(dev, &group->devices, isolation_list) {
> > +		printk("Reprobing %s\n", dev->kobj.name);
> > +		error = device_reprobe(dev);
> > +	}
> > +
> > +	spin_unlock(&group->lock);
> > +
> > +	return 0;
> > +}
> > +
> > +int __init isolation_init(void)
> > +{
> > +	isolation_kset = kset_create_and_add("isolation", NULL, NULL);
> > +	if (!isolation_kset)
> > +		return -ENOMEM;
> > +	return 0;
> > +}
> > +
> > diff --git a/include/linux/device.h b/include/linux/device.h
> > index c20dfbf..6238a13 100644
> > --- a/include/linux/device.h
> > +++ b/include/linux/device.h
> > @@ -585,6 +585,11 @@ struct device {
> >  
> >  	struct dma_coherent_mem	*dma_mem; /* internal for coherent mem
> >  					     override */
> > +#ifdef CONFIG_DEVICE_ISOLATION
> > +	struct isolation_group  *isolation_group;
> > +	struct list_head 	isolation_list;
> > +#endif
> > +
> >  	/* arch specific additions */
> >  	struct dev_archdata	archdata;
> >  
> > diff --git a/include/linux/device_isolation.h b/include/linux/device_isolation.h
> > new file mode 100644
> > index 0000000..1dc7e68
> > --- /dev/null
> > +++ b/include/linux/device_isolation.h
> > @@ -0,0 +1,79 @@
> > +#ifndef _DEVICE_ISOLATION_H_
> > +#define _DEVICE_ISOLATION_H_
> > +
> > +#include <linux/kobject.h>
> > +#include <linux/list.h>
> > +#include <linux/device.h>
> > +
> > +struct isolation_group {
> > +	struct kobject 		kobj;
> > +	struct list_head 	devices;
> > +	spinlock_t		lock;
> > +	bool			detached;
> > +};
> > +
> > +struct isolation_attribute {
> > +	struct attribute	attr;
> > +	ssize_t (*show)(struct isolation_group *group, char *buf);
> > +	ssize_t (*store)(struct isolation_group *group, const char *buf,
> > +			size_t count);
> > +};
> > +
> > +#ifdef CONFIG_DEVICE_ISOLATION
> > +
> > +int __init isolation_init(void);
> > +
> > +struct isolation_group *isolation_group_new(const char *name);
> > +int isolation_group_delete(struct isolation_group *group);
> > +
> > +void isolation_group_dev_add(struct isolation_group *group,
> > +		struct device *dev);
> > +void isolation_group_dev_remove(struct device *dev);
> > +
> > +int isolation_group_update_links(struct device *dev);
> > +int isolation_group_detach(struct isolation_group *group);
> > +int isolation_group_reattach(struct isolation_group *group);
> > +
> > +#define DEVICE_ISOLATION_ATTR(_name, _mode, _show, _store)	\
> > +	struct isolation_attribute isolation_attr_##_name =	\
> > +		__ATTR(_name, _mode, _show, _store)
> > +
> > +static inline int isolation_device_is_detached(struct device *dev)
> > +{
> > +	return dev->isolation_group && dev->isolation_group->detached;
> > +}
> > +
> > +#else
> > +
> > +static inline int __init isolation_init(void)
> > +{
> > +	return 0;
> > +}
> > +
> > +static inline struct isolation_group *isolation_group_new(const char *name)
> > +{
> > +	return NULL;
> > +}
> > +
> > +/*void isolation_group_dev_add(struct isolation_group *group,
> > +		struct device *dev)
> > +{
> > +}*/
> > +
> > +static inline void isolation_group_dev_remove(struct device *dev)
> > +{
> > +}
> > +
> > +static inline int isolation_group_update_links(struct device *dev)
> > +{
> > +	return 0;
> > +}
> > +
> > +static inline int isolation_is_detached(struct device *dev)
> > +{
> > +	return 0;
> > +}
> > +
> > +#endif
> > +
> > +#endif
> 
> So what this boils down to so far are some group objects with lists of
> devices that allow us to pop off all the native drivers and prevent them
> from reattaching.  And for that convenience, we're going to need a whole
> new group driver model to allow something like vfio to claim the group

Well, we need the group model anyway.  Having just ids at the iommu
level merely pushes the work of tracking and exposing them up to
higher levels.

> and then maybe some group-device model to allow bus specific drivers to
> easily make use of the devices within the group?

Huh?

>  And somehow along the
> way we're going to avoid the group vs device awkwardness?

Groups are there whether we like it or not.  This is just about
exposing it cleanly.

>  And again,
> all for an interface that the vast majority of users won't use.  I'm not
> convinced.  Thanks,

You don't get a choice about whether you use groups.  If you have any
two devices (or functions) that can't be isolated from each other, you
have a group, and the kernel code (and the user code, for that matter)
has to deal with it.  So you have to be able to see the groups, or
things will just mysteriously not work when you discover you had two
not-mutually isolatable devices you hadn't realised you had.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RFC: Device isolation infrastructure
  2011-12-07 14:22   ` David Gibson
@ 2011-12-07 19:45     ` Alex Williamson
  2011-12-07 23:41       ` Benjamin Herrenschmidt
  2011-12-08  2:43       ` David Gibson
  0 siblings, 2 replies; 16+ messages in thread
From: Alex Williamson @ 2011-12-07 19:45 UTC (permalink / raw)
  To: David Gibson
  Cc: joerg.roedel, dwmw2, iommu, aik, linux-kernel, chrisw, agraf,
	scottwood, B08248, benh

On Thu, 2011-12-08 at 01:22 +1100, David Gibson wrote:
> On Tue, Dec 06, 2011 at 11:46:42PM -0700, Alex Williamson wrote:
> > On Wed, 2011-12-07 at 14:58 +1100, David Gibson wrote:
> > > 
> > > Alex Williamson recently posted some patches adding a new hook to
> > > iommu_ops to identify groups of devices which cannot reliably be
> > > distinguished by the iommu, and therefore can only safely be assigned
> > > as a unit to a guest or userspace driver.
> > > 
> > > I'm not very fond of that design, because:
> > >     - It requires something else to scan the group ids to present the
> > > groups in a userspace discoverable manner, rather than handling that
> > > in the core.
> > 
> > The thought was that the vast majority of users won't make use of
> > groups, so why bloat the core when the only user is vfio.  
> 
> Well, "won't make use of" actually means "are on a system with
> all-singleton groups".  And the answer is because it actually doesn't
> bloat, but simplifies the core.  We need the group concept _at
> minimum_ to handle bridges and buggy multifunction devices.  So it
> makes life easier to assume they're always there, even if they're
> often singletons.

Ok, let's back up from the patch because I think its clouding the driver
model with a detached device state that might be detracting from the
useful bits.

An isolation group is a generic association of devices into a set based
on, well... isolation.  Presumably something like an iommu driver does a
bus_register_notifier() and bus_for_each_dev() and when a new device is
added we do isolation_group_new() and isolation_group_dev_add() as
necessary.  When the iommu driver needs to do a dma_ops call for a
device, it can traverse up to the isolation group and find something
that points it to the shared iommu context for that group.  While
dma_ops can share the iommu context among the drivers attached to the
various devices, iommu_ops wants a single meta-driver for the group
because it's predominantly designed for use when devices are exposed
directly to the user.  We trust native kernel drivers to share context
between each other, we do not trust user level drivers to share context
with anyone but themselves.  Ok so far?

The first question we run into is what does the iommu driver do if it
does not provide isolation?  If it's a translation-only iommu, maybe it
has a single context and can manage to maintain it internally and should
opt-out of isolation groups.  But we also have things like MSI isolation
where we have multiple contexts and the isolation strength is sufficient
for dma_ops, but not iommu_ops.  I don't know if we want to make that
determination via a strength bitmask at the group level or simply a bool
that the iommu driver sets when it registers a group.

Once we store iommu context at the group, it makes sense to start using
it as the primary object the iommu drivers operate on.  There's no need
to break everyone for dma_ops, so isolation aware iommu drivers can
simply follow the link from the device to the group with no change to
callers.  iommu_ops though can switch to being entirely group based [not
looking forward to retrofitting groups into kvm device assignment].

So the next problem is that while the group is the minimum granularity
for the iommu, it's not necessarily the desired granularity.  iommus
like VT-d have per PCI BDF context entries that can point to shared page
tables.  On such systems we also typically have singleton isolation
groups, so when multiple devices are used by a single user, we have a
lot of duplication in time and space.  VFIO handles this by allowing
groups to be "merged".  When this happens, the merged groups point to
the same iommu context.  I'm not sure what the plan is with isolation
groups, but we need some way to reduce that overhead.

Finally, we get to how do we manage the devices of the group when we
switch to iommu_ops mode vs dma_ops, which is what this patch seems to
have started with.  For VFIO we have bus specific drivers which attach
to individual devices.  When all the devices in the group are attached
to the bus driver (by the user), iommu_ops, and therefore access to
devices from userspace becomes available.

This is the level that the vast majority of users are not going to make
use of, so while the above does feel cleaner and justifies it's
addition, we need to be careful here.

My concern with the "detached state" approach is that we're removing
devices from the existing driver model.  Who does things like pm_runtime
callbacks when we start using devices?  How does a driver enumerate and
attach to devices within the group?  How do we even enumerate and attach
to isolation groups for the group level metadriver?  This is where I'm
nervous we're just making a requirement for a group-driver model that
lives alongside the existing device driver model.

I actually like the approach that individual devices are bound by the
user to vfio bus drivers (group/iommu_ops drivers).  This is tedious,
but easily automate-able in userspace and puts the responsibility of
unbinding each device on the user.  Imagine the chaos a user could cause
if vfio creates a chardev for accessing a group, we give a user access
to it, and each time it's opened all the group devices automatically get
unbound from the host drivers and re-bound on close.  'while (true); do
cat /dev/vfio/$GROUP; done' could cause a denial of service on the
system.

We just need to figure out how binding vfio bus drivers to all the
isolation group devices turns the robot lions into Voltron ;)

> > >     - It does not provide an obvious means to prevent a kernel
> > > driver unsafely binding to a new device which is hotplugged into an
> > > iommu group already assigned to a guest.
> > 
> > I've added this for PCI, triggered via the device add notifier.
> 
> Hrm.  Does the notifier run early enough to guarantee that a driver
> can't bind before it's complete?

Yes, like I assume you're planning to do with setting up groups, it runs
from the device add bus notifier.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RFC: Device isolation infrastructure
  2011-12-07 19:45     ` Alex Williamson
@ 2011-12-07 23:41       ` Benjamin Herrenschmidt
  2011-12-08  0:16         ` Chris Wright
  2011-12-08  5:51         ` Alex Williamson
  2011-12-08  2:43       ` David Gibson
  1 sibling, 2 replies; 16+ messages in thread
From: Benjamin Herrenschmidt @ 2011-12-07 23:41 UTC (permalink / raw)
  To: Alex Williamson
  Cc: David Gibson, joerg.roedel, dwmw2, iommu, aik, linux-kernel,
	chrisw, agraf, scottwood, B08248

On Wed, 2011-12-07 at 12:45 -0700, Alex Williamson wrote:

> Ok, let's back up from the patch because I think its clouding the driver
> model with a detached device state that might be detracting from the
> useful bits.
> 
> An isolation group is a generic association of devices into a set based
> on, well... isolation.  Presumably something like an iommu driver does a
> bus_register_notifier() and bus_for_each_dev() and when a new device is
> added we do isolation_group_new() and isolation_group_dev_add() as
> necessary.  When the iommu driver needs to do a dma_ops call for a
> device, it can traverse up to the isolation group and find something
> that points it to the shared iommu context for that group.  While
> dma_ops can share the iommu context among the drivers attached to the
> various devices, iommu_ops wants a single meta-driver for the group
> because it's predominantly designed for use when devices are exposed
> directly to the user.  We trust native kernel drivers to share context
> between each other, we do not trust user level drivers to share context
> with anyone but themselves.  Ok so far?
> 
> The first question we run into is what does the iommu driver do if it
> does not provide isolation?  If it's a translation-only iommu, maybe it
> has a single context and can manage to maintain it internally and should
> opt-out of isolation groups.  But we also have things like MSI isolation
> where we have multiple contexts and the isolation strength is sufficient
> for dma_ops, but not iommu_ops.  I don't know if we want to make that
> determination via a strength bitmask at the group level or simply a bool
> that the iommu driver sets when it registers a group.

 .../...

I think we are trying to solve too many problems at once.

I don't believe we should have the normal dma ops path in the kernel be
rewritten on top of some new iommu isolation layer or anything like that
at this stage.

Proceed step by step or we'll never be finished. What that means imho
is:

 - Existing dma/iommu arch layers mostly remain as-is. It has its own
internal representation of devices, domains and possibly grouping
information, and things like the buggy RICOH devices are handled there
(via a flag set in a generic header quirk)

 - Isolation groups are a new high-level thing that is not used by the
iommu/dma code for normal in-kernel operations at this stage. The
platform iommu/dma code creates the isolation group objects and
populates them, it doesn't initially -use- them.

 - We can then use that for the needs of exposing the groups to user
space, perform driver level isolation, and provide the grouping
information to vfio.

 - In the -long- run, individual iommu/dma layers might be modified to
instead use the isolation layer as their own building block if that
makes sense, this can be looked at on a case by case basis but is not a
pre-req.

That leaves us with what I think is the main disagreement between David
and Alex approaches which is whether devices in a group are "detached"
completely (in some kind of limbo) while they are used by vfio or
something else vs. assigned to a different driver (vfio).

After much thinking, I believe that the driver ownership model is the
right one. However, we could still use the work David did here. IE. I
_do_ think that exposing groups as the individual unit of ownership to
user space is the right thing to do.

That means that instead of that current detached flag alone, what we
could do instead is have an in-kernel mechanism to request isolation
which works by having the isolation group kernel code perform:

 - Rename the detached flag to "exclusive". This flag prevents automatic
& user driven attachment of a driver by the generic code. Set that flag
on all devices in the group (need a hook for hotplug).

 - Detach existing drivers.

 - Once everything is detached, attach the requested client driver (vfio
being the one we initially provide, maybe uio could use that framework
as well). This can be done using an explicit in-kernel attach API that
can be used even when the exclusive flag is set.

 - Win

 - When releasing the group, we detach the client driver, clear the
exclusive flag and trigger a re-probe (automatic matching of kernel
drivers).

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RFC: Device isolation infrastructure
  2011-12-07 23:41       ` Benjamin Herrenschmidt
@ 2011-12-08  0:16         ` Chris Wright
  2011-12-08  1:32           ` Benjamin Herrenschmidt
  2011-12-08  5:51         ` Alex Williamson
  1 sibling, 1 reply; 16+ messages in thread
From: Chris Wright @ 2011-12-08  0:16 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alex Williamson, David Gibson, joerg.roedel, dwmw2, iommu, aik,
	linux-kernel, chrisw, agraf, scottwood, B08248

* Benjamin Herrenschmidt (benh@kernel.crashing.org) wrote:
> After much thinking, I believe that the driver ownership model is the
> right one. However, we could still use the work David did here. IE. I
> _do_ think that exposing groups as the individual unit of ownership to
> user space is the right thing to do.

OK, I think I agree at least w/ the ownership model...

> That means that instead of that current detached flag alone, what we
> could do instead is have an in-kernel mechanism to request isolation
> which works by having the isolation group kernel code perform:
> 
>  - Rename the detached flag to "exclusive". This flag prevents automatic
> & user driven attachment of a driver by the generic code. Set that flag
> on all devices in the group (need a hook for hotplug).
> 
>  - Detach existing drivers.
> 
>  - Once everything is detached, attach the requested client driver (vfio
> being the one we initially provide, maybe uio could use that framework
> as well). This can be done using an explicit in-kernel attach API that
> can be used even when the exclusive flag is set.

Since users will want to drive devices, not groups, how would you
enumerate and expose devices here if the user visible construct exposed
is groups?

thanks,
-chris

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RFC: Device isolation infrastructure
  2011-12-08  0:16         ` Chris Wright
@ 2011-12-08  1:32           ` Benjamin Herrenschmidt
  2011-12-08  1:36             ` Chris Wright
  0 siblings, 1 reply; 16+ messages in thread
From: Benjamin Herrenschmidt @ 2011-12-08  1:32 UTC (permalink / raw)
  To: Chris Wright
  Cc: Alex Williamson, David Gibson, joerg.roedel, dwmw2, iommu, aik,
	linux-kernel, agraf, scottwood, B08248

On Wed, 2011-12-07 at 16:16 -0800, Chris Wright wrote:
> Since users will want to drive devices, not groups, how would you
> enumerate and expose devices here if the user visible construct
> exposed is groups?

We do expose the individual devices in the group, which can then be used
for things like mmap'ing of BARs etc...

However, at least, the ownership transition is done on a whole group
basis

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RFC: Device isolation infrastructure
  2011-12-08  1:32           ` Benjamin Herrenschmidt
@ 2011-12-08  1:36             ` Chris Wright
  0 siblings, 0 replies; 16+ messages in thread
From: Chris Wright @ 2011-12-08  1:36 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Chris Wright, Alex Williamson, David Gibson, joerg.roedel, dwmw2,
	iommu, aik, linux-kernel, agraf, scottwood, B08248

* Benjamin Herrenschmidt (benh@kernel.crashing.org) wrote:
> On Wed, 2011-12-07 at 16:16 -0800, Chris Wright wrote:
> > Since users will want to drive devices, not groups, how would you
> > enumerate and expose devices here if the user visible construct
> > exposed is groups?
> 
> We do expose the individual devices in the group, which can then be used
> for things like mmap'ing of BARs etc...

Right, one of the areas that got sticky before, which is why I wondered
how you'd manage it.

> However, at least, the ownership transition is done on a whole group
> basis

Yeah, that makes sense.

thanks,
-chris

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RFC: Device isolation infrastructure
  2011-12-07 19:45     ` Alex Williamson
  2011-12-07 23:41       ` Benjamin Herrenschmidt
@ 2011-12-08  2:43       ` David Gibson
  2011-12-08  6:23         ` Alex Williamson
  1 sibling, 1 reply; 16+ messages in thread
From: David Gibson @ 2011-12-08  2:43 UTC (permalink / raw)
  To: Alex Williamson
  Cc: joerg.roedel, dwmw2, iommu, aik, linux-kernel, chrisw, agraf,
	scottwood, B08248, benh

On Wed, Dec 07, 2011 at 12:45:20PM -0700, Alex Williamson wrote:
> On Thu, 2011-12-08 at 01:22 +1100, David Gibson wrote:
> > On Tue, Dec 06, 2011 at 11:46:42PM -0700, Alex Williamson wrote:
> > > On Wed, 2011-12-07 at 14:58 +1100, David Gibson wrote:
> > > > 
> > > > Alex Williamson recently posted some patches adding a new hook to
> > > > iommu_ops to identify groups of devices which cannot reliably be
> > > > distinguished by the iommu, and therefore can only safely be assigned
> > > > as a unit to a guest or userspace driver.
> > > > 
> > > > I'm not very fond of that design, because:
> > > >     - It requires something else to scan the group ids to present the
> > > > groups in a userspace discoverable manner, rather than handling that
> > > > in the core.
> > > 
> > > The thought was that the vast majority of users won't make use of
> > > groups, so why bloat the core when the only user is vfio.  
> > 
> > Well, "won't make use of" actually means "are on a system with
> > all-singleton groups".  And the answer is because it actually doesn't
> > bloat, but simplifies the core.  We need the group concept _at
> > minimum_ to handle bridges and buggy multifunction devices.  So it
> > makes life easier to assume they're always there, even if they're
> > often singletons.
> 
> Ok, let's back up from the patch because I think its clouding the driver
> model with a detached device state that might be detracting from the
> useful bits.

Ok.  I'll start out by saying that I think there's some confusion in
the below because you're tending to identify an isolation group with
an iommu domain, which is not my intention.  An isolation group is the
minimum granularity of an iommu domain, of course, but it's absolutely
my intention that you could build an iommu domain with mutiple groups.
A group is a unit that *can* be isolated, not a unit which *must* be
isolated.

> An isolation group is a generic association of devices into a set based
> on, well... isolation.  Presumably something like an iommu driver does a
> bus_register_notifier() and bus_for_each_dev() and when a new device is
> added we do isolation_group_new() and isolation_group_dev_add() as
> necessary.

Uh, roughly speaking, yes.

>  When the iommu driver needs to do a dma_ops call for a
> device, it can traverse up to the isolation group and find something
> that points it to the shared iommu context for that group.  While
> dma_ops can share the iommu context among the drivers attached to the
> various devices, iommu_ops wants a single meta-driver for the group
> because it's predominantly designed for use when devices are exposed
> directly to the user.  We trust native kernel drivers to share context
> between each other, we do not trust user level drivers to share context
> with anyone but themselves.  Ok so far?

Not so much.  As Ben says, getting the existing kernel dma_ops and so
forth to use the isolation structure is very much a "maybe one day"
thing.

> The first question we run into is what does the iommu driver do if it
> does not provide isolation?  If it's a translation-only iommu, maybe it
> has a single context and can manage to maintain it internally and should
> opt-out of isolation groups.  But we also have things like MSI isolation
> where we have multiple contexts and the isolation strength is sufficient
> for dma_ops, but not iommu_ops.  I don't know if we want to make that
> determination via a strength bitmask at the group level or simply a bool
> that the iommu driver sets when it registers a group.

If the device can't be safely isolated, then it should have no
isolation group.  The definition of "safely" might vary depending on
platform and/or kernel parameters.

> Once we store iommu context at the group, it makes sense to start using
> it as the primary object the iommu drivers operate on.  There's no need
> to break everyone for dma_ops, so isolation aware iommu drivers can
> simply follow the link from the device to the group with no change to
> callers.  iommu_ops though can switch to being entirely group based [not
> looking forward to retrofitting groups into kvm device assignment].

Again, maybe one day.  We don't need to attack that in order to handle
the vfio problem.

> So the next problem is that while the group is the minimum granularity
> for the iommu, it's not necessarily the desired granularity.  iommus
> like VT-d have per PCI BDF context entries that can point to shared page
> tables.  On such systems we also typically have singleton isolation
> groups, so when multiple devices are used by a single user, we have a
> lot of duplication in time and space.  VFIO handles this by allowing
> groups to be "merged".  When this happens, the merged groups point to
> the same iommu context.  I'm not sure what the plan is with isolation
> groups, but we need some way to reduce that overhead.

Right.  So, again, I intend that mutiple groups can go into one
domain.  Not entirely sure of the interface yet.  One I had in mind
was to borrow the vfio1 interface, so you open a /dev/vfio (each open
gives a new instance).  Then you do an "addgroup" ioctl which adds a
group to the domain.  You can do that multiple times, then start using
the domain.

> Finally, we get to how do we manage the devices of the group when we
> switch to iommu_ops mode vs dma_ops, which is what this patch seems to
> have started with.  For VFIO we have bus specific drivers which attach
> to individual devices.  When all the devices in the group are attached
> to the bus driver (by the user), iommu_ops, and therefore access to
> devices from userspace becomes available.
> 
> This is the level that the vast majority of users are not going to make
> use of, so while the above does feel cleaner and justifies it's
> addition, we need to be careful here.
> 
> My concern with the "detached state" approach is that we're removing
> devices from the existing driver model.  Who does things like pm_runtime
> callbacks when we start using devices?

That's a point.  Ben suggests an approach that I'll comment on there.

>  How does a driver enumerate and
> attach to devices within the group?

Well, detaching prevents driver binding.  All the struct devices are
still there and can be examined as usual.  How the group's owner
manages access to individual devices within the group is up to it.

>  How do we even enumerate and attach
> to isolation groups for the group level metadriver?  This is where I'm
> nervous we're just making a requirement for a group-driver model that
> lives alongside the existing device driver model.

Well, sort of.  But the "group driver model" is vastly simpler than
the full driver model, because the assumption is that binding a group
owner to a group is *always* the result of an explicit user request
specifying both owner and group.  There's no matching, no drivers of
different bus types or the other things that make the normal driver
model complicated.

> I actually like the approach that individual devices are bound by the
> user to vfio bus drivers (group/iommu_ops drivers).  This is tedious,
> but easily automate-able in userspace and puts the responsibility of
> unbinding each device on the user.  Imagine the chaos a user could cause
> if vfio creates a chardev for accessing a group, we give a user access
> to it, and each time it's opened all the group devices automatically get
> unbound from the host drivers and re-bound on close.  'while (true); do
> cat /dev/vfio/$GROUP; done' could cause a denial of service on the
> system.

Well, no, because regardless of the interface, detaching the devices
would need equivalent permission.

> We just need to figure out how binding vfio bus drivers to all the
> isolation group devices turns the robot lions into Voltron ;)
> 
> > > >     - It does not provide an obvious means to prevent a kernel
> > > > driver unsafely binding to a new device which is hotplugged into an
> > > > iommu group already assigned to a guest.
> > > 
> > > I've added this for PCI, triggered via the device add notifier.
> > 
> > Hrm.  Does the notifier run early enough to guarantee that a driver
> > can't bind before it's complete?
> 
> Yes, like I assume you're planning to do with setting up groups, it runs
> from the device add bus notifier.  Thanks,
> 
> Alex
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RFC: Device isolation infrastructure
  2011-12-07 23:41       ` Benjamin Herrenschmidt
  2011-12-08  0:16         ` Chris Wright
@ 2011-12-08  5:51         ` Alex Williamson
  2011-12-09  2:38           ` David Gibson
  1 sibling, 1 reply; 16+ messages in thread
From: Alex Williamson @ 2011-12-08  5:51 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David Gibson, joerg.roedel, dwmw2, iommu, aik, linux-kernel,
	chrisw, agraf, scottwood, B08248

On Thu, 2011-12-08 at 10:41 +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2011-12-07 at 12:45 -0700, Alex Williamson wrote:
> 
> > Ok, let's back up from the patch because I think its clouding the driver
> > model with a detached device state that might be detracting from the
> > useful bits.
> > 
> > An isolation group is a generic association of devices into a set based
> > on, well... isolation.  Presumably something like an iommu driver does a
> > bus_register_notifier() and bus_for_each_dev() and when a new device is
> > added we do isolation_group_new() and isolation_group_dev_add() as
> > necessary.  When the iommu driver needs to do a dma_ops call for a
> > device, it can traverse up to the isolation group and find something
> > that points it to the shared iommu context for that group.  While
> > dma_ops can share the iommu context among the drivers attached to the
> > various devices, iommu_ops wants a single meta-driver for the group
> > because it's predominantly designed for use when devices are exposed
> > directly to the user.  We trust native kernel drivers to share context
> > between each other, we do not trust user level drivers to share context
> > with anyone but themselves.  Ok so far?
> > 
> > The first question we run into is what does the iommu driver do if it
> > does not provide isolation?  If it's a translation-only iommu, maybe it
> > has a single context and can manage to maintain it internally and should
> > opt-out of isolation groups.  But we also have things like MSI isolation
> > where we have multiple contexts and the isolation strength is sufficient
> > for dma_ops, but not iommu_ops.  I don't know if we want to make that
> > determination via a strength bitmask at the group level or simply a bool
> > that the iommu driver sets when it registers a group.
> 
>  .../...
> 
> I think we are trying to solve too many problems at once.
> 
> I don't believe we should have the normal dma ops path in the kernel be
> rewritten on top of some new iommu isolation layer or anything like that
> at this stage.

I headed down this path because David seems to be suggesting, well
stating, that all this infrastructure is useful for providing simplicity
and efficiency of groups to the core, and not just for this niche vfio
usage.  "You don't get a choice about whether you use groups."  The only
way I can rationalize that is to give it some purpose which is used by
something other that just vfio.  If all of that is "some day", then why
don't we start with the group management in vfio as I have it today and
eventually push that down into the core?

> Proceed step by step or we'll never be finished. What that means imho
> is:
> 
>  - Existing dma/iommu arch layers mostly remain as-is. It has its own
> internal representation of devices, domains and possibly grouping
> information, and things like the buggy RICOH devices are handled there
> (via a flag set in a generic header quirk)

But we're already being harassed to rework iommu_ops to operate based on
groups and to take grouping into account for things like the ricoh
device in dma_ops.  Mixed messages...

>  - Isolation groups are a new high-level thing that is not used by the
> iommu/dma code for normal in-kernel operations at this stage. The
> platform iommu/dma code creates the isolation group objects and
> populates them, it doesn't initially -use- them.

The iommu_device_group() interface is a tiny extension to the IOMMU API
that basically lets vfio query the iommu driver for a group identifier
for a device.  vfio then uses that information to create and manage
groups itself.

This is simply reversing that, allowing the iommu driver to create
groups and add devices, which are centrally managed.  That sounds like a
good long term goal, especially when we start making use of the groups
for dma/iommu ops, but it does require that we define an API both for
group providers (iommu drivers) and the group users (vfio).  It's also
much more intrusive to the iommu drivers (note the trivial patches to
add iommu_device_group support to intel and amd iommu) and creates bloat
for a niche user if we're not going to get around to making groups more
pervasive anytime soon.

>  - We can then use that for the needs of exposing the groups to user
> space, perform driver level isolation, and provide the grouping
> information to vfio.
>
>  - In the -long- run, individual iommu/dma layers might be modified to
> instead use the isolation layer as their own building block if that
> makes sense, this can be looked at on a case by case basis but is not a
> pre-req.
> 
> That leaves us with what I think is the main disagreement between David
> and Alex approaches which is whether devices in a group are "detached"
> completely (in some kind of limbo) while they are used by vfio or
> something else vs. assigned to a different driver (vfio).
> 
> After much thinking, I believe that the driver ownership model is the
> right one. However, we could still use the work David did here. IE. I
> _do_ think that exposing groups as the individual unit of ownership to
> user space is the right thing to do.

So far we're only exposing the group structure here (which vfio also
does in sysfs), not providing a userspace or driver interface for taking
ownership of the group.  I'm still not clear how this ends up satisfying
David's complaints about the group ownership vs device access model in
vfio.

> That means that instead of that current detached flag alone, what we
> could do instead is have an in-kernel mechanism to request isolation
> which works by having the isolation group kernel code perform:
> 
>  - Rename the detached flag to "exclusive". This flag prevents automatic
> & user driven attachment of a driver by the generic code. Set that flag
> on all devices in the group (need a hook for hotplug).

This is a sysfs attribute on the group?  So step #1:

        $ echo 1 > /sys/isolation/$GROUP/exclusive

>  - Detach existing drivers.

step #2:

        $ for i in $(ls /sys/isolation/$GROUP/devices/); do	\
        	echo $(basename $i) > $i/driver/unbind; done
>         
>  - Once everything is detached, attach the requested client driver (vfio
> being the one we initially provide, maybe uio could use that framework
> as well). This can be done using an explicit in-kernel attach API that
> can be used even when the exclusive flag is set.

step #3: A miracle occurs ;)

I'm not sure what this looks like.  Are we going to do a
isolation_group_register_meta_driver() that vfio calls with some defined
ops structure, so a user can:

        $ echo "vfio" > /sys/isolation/$GROUP/driver

That triggers some list walk of the group devices, asking vfio to find
and force a driver attach to the bus driver for each device?  A model
something like that is the only way I can see that removing groups from
vfio doesn't also disassociate the group meta driver from the group
device drivers.

If instead, we make vfio be the group manager (since nobody else uses it
anyway), all of the above becomes:

        $ for i in $(ls /sys/devices/virtual/vfio/pci:257/devices); do	\
        	echo $(basename $i) > $i/driver/unbind;			\
        	echo $(basename $i) > /sys/bus/pci/drivers/vfio/bind;	\
        	done

>  - Win

I wish, we still have to make it through how vfio then exposes the group
to userspace and how we don't end up with the same individual device
manipulation vs group access model that seems to be getting under
David's skin.  Effectively what we've done above is:

      * provide a group model that we'll "some day" make more pervasive
      * make the binding of devices to group device drivers implicit in
        some kind of meta driver attachment

Thanks,

Alex

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RFC: Device isolation infrastructure
  2011-12-08  2:43       ` David Gibson
@ 2011-12-08  6:23         ` Alex Williamson
  2011-12-08  6:52           ` David Gibson
  0 siblings, 1 reply; 16+ messages in thread
From: Alex Williamson @ 2011-12-08  6:23 UTC (permalink / raw)
  To: David Gibson
  Cc: joerg.roedel, dwmw2, iommu, aik, linux-kernel, chrisw, agraf,
	scottwood, B08248, benh

On Thu, 2011-12-08 at 13:43 +1100, David Gibson wrote:
> On Wed, Dec 07, 2011 at 12:45:20PM -0700, Alex Williamson wrote:
> > So the next problem is that while the group is the minimum granularity
> > for the iommu, it's not necessarily the desired granularity.  iommus
> > like VT-d have per PCI BDF context entries that can point to shared page
> > tables.  On such systems we also typically have singleton isolation
> > groups, so when multiple devices are used by a single user, we have a
> > lot of duplication in time and space.  VFIO handles this by allowing
> > groups to be "merged".  When this happens, the merged groups point to
> > the same iommu context.  I'm not sure what the plan is with isolation
> > groups, but we need some way to reduce that overhead.
> 
> Right.  So, again, I intend that mutiple groups can go into one
> domain.  Not entirely sure of the interface yet.  One I had in mind
> was to borrow the vfio1 interface, so you open a /dev/vfio (each open
> gives a new instance).  Then you do an "addgroup" ioctl which adds a
> group to the domain.  You can do that multiple times, then start using
> the domain.

This also revisits one of the primary problems of vfio1, the dependency
on a privileged uiommu domain creation interface.  Assigning a user
ownership of a group should be a privileged operation.  If a privileged
user needs to open /dev/vfio, add groups, then drop privileges and hand
the open file descriptor to an unprivileged user, the interface becomes
much harder to use.  "Hot merging" becomes impossible.

Alex


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RFC: Device isolation infrastructure
  2011-12-08  6:23         ` Alex Williamson
@ 2011-12-08  6:52           ` David Gibson
  2011-12-08 14:28             ` Alex Williamson
  0 siblings, 1 reply; 16+ messages in thread
From: David Gibson @ 2011-12-08  6:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: joerg.roedel, dwmw2, iommu, aik, linux-kernel, chrisw, agraf,
	scottwood, B08248, benh

On Wed, Dec 07, 2011 at 11:23:10PM -0700, Alex Williamson wrote:
> On Thu, 2011-12-08 at 13:43 +1100, David Gibson wrote:
> > On Wed, Dec 07, 2011 at 12:45:20PM -0700, Alex Williamson wrote:
> > > So the next problem is that while the group is the minimum granularity
> > > for the iommu, it's not necessarily the desired granularity.  iommus
> > > like VT-d have per PCI BDF context entries that can point to shared page
> > > tables.  On such systems we also typically have singleton isolation
> > > groups, so when multiple devices are used by a single user, we have a
> > > lot of duplication in time and space.  VFIO handles this by allowing
> > > groups to be "merged".  When this happens, the merged groups point to
> > > the same iommu context.  I'm not sure what the plan is with isolation
> > > groups, but we need some way to reduce that overhead.
> > 
> > Right.  So, again, I intend that mutiple groups can go into one
> > domain.  Not entirely sure of the interface yet.  One I had in mind
> > was to borrow the vfio1 interface, so you open a /dev/vfio (each open
> > gives a new instance).  Then you do an "addgroup" ioctl which adds a
> > group to the domain.  You can do that multiple times, then start using
> > the domain.
> 
> This also revisits one of the primary problems of vfio1, the dependency
> on a privileged uiommu domain creation interface.  Assigning a user
> ownership of a group should be a privileged operation.  If a privileged
> user needs to open /dev/vfio, add groups, then drop privileges and hand
> the open file descriptor to an unprivileged user, the interface becomes
> much harder to use.  "Hot merging" becomes impossible.

No, I was assuming that "permission to detach" could be handed out to
a user before this step.  uid/gid/mode attributes in sysfs would
suffice, though there might be better ways.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RFC: Device isolation infrastructure
  2011-12-08  6:52           ` David Gibson
@ 2011-12-08 14:28             ` Alex Williamson
  2011-12-09  3:00               ` David Gibson
  0 siblings, 1 reply; 16+ messages in thread
From: Alex Williamson @ 2011-12-08 14:28 UTC (permalink / raw)
  To: David Gibson
  Cc: joerg.roedel, dwmw2, iommu, aik, linux-kernel, chrisw, agraf,
	scottwood, B08248, benh

On Thu, 2011-12-08 at 17:52 +1100, David Gibson wrote:
> On Wed, Dec 07, 2011 at 11:23:10PM -0700, Alex Williamson wrote:
> > On Thu, 2011-12-08 at 13:43 +1100, David Gibson wrote:
> > > On Wed, Dec 07, 2011 at 12:45:20PM -0700, Alex Williamson wrote:
> > > > So the next problem is that while the group is the minimum granularity
> > > > for the iommu, it's not necessarily the desired granularity.  iommus
> > > > like VT-d have per PCI BDF context entries that can point to shared page
> > > > tables.  On such systems we also typically have singleton isolation
> > > > groups, so when multiple devices are used by a single user, we have a
> > > > lot of duplication in time and space.  VFIO handles this by allowing
> > > > groups to be "merged".  When this happens, the merged groups point to
> > > > the same iommu context.  I'm not sure what the plan is with isolation
> > > > groups, but we need some way to reduce that overhead.
> > > 
> > > Right.  So, again, I intend that mutiple groups can go into one
> > > domain.  Not entirely sure of the interface yet.  One I had in mind
> > > was to borrow the vfio1 interface, so you open a /dev/vfio (each open
> > > gives a new instance).  Then you do an "addgroup" ioctl which adds a
> > > group to the domain.  You can do that multiple times, then start using
> > > the domain.
> > 
> > This also revisits one of the primary problems of vfio1, the dependency
> > on a privileged uiommu domain creation interface.  Assigning a user
> > ownership of a group should be a privileged operation.  If a privileged
> > user needs to open /dev/vfio, add groups, then drop privileges and hand
> > the open file descriptor to an unprivileged user, the interface becomes
> > much harder to use.  "Hot merging" becomes impossible.
> 
> No, I was assuming that "permission to detach" could be handed out to
> a user before this step.  uid/gid/mode attributes in sysfs would
> suffice, though there might be better ways.

Wow.  So you effectively get:

# chown user:user /sys/isolation/$GROUP
$ fd = open(/dev/vfio)
$ ioctl(fd, addgroup, $GROUP)
  ^^^^ this is where the actual detach occurs?

And vfio has to find the $GROUP object, figure out where /sys is mounted
on this system, construct a path to $SYS/isolation/$GROUP, and check the
file access before calling the isolation group detach API.

Then later...

$ ioctl(fd, delgroup, $GROUP)

But a reattach can't occur here or we've just given the user the power
to cause the DoS, or at least system chaos, I described earlier.  So
then the admin later has to:

# chown root:root /sys/isolation/$GROUP
# echo 0 > /sys/ioslation/$GROUP/detached

Under the covers, "addgroup" would require vfio to a) "detach" the
group, b) put the devices in an iommu domain.  If we're dealing with
multiple group, b) can be rejected by the iommu driver, at which point
vfio then sends a "reattach" to the isolation API (but that can't
actually do much, as noted above), the user then opens a new instance
of /dev/vfio and tries an "addgroup" there.

Also note that there was a request to make it possible to get multiple
file descriptors to the same group/device/iommu from different threads.
I don't see how that's possible here.

And this is somehow better than the perceived dis-symmetry that annoys
you in vfio?  I'm getting a strong not-invented-here vibe and I'm not
sure you've internalized all the interactions that make vfio manage
groups the way it does.

Alex

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RFC: Device isolation infrastructure
  2011-12-08  5:51         ` Alex Williamson
@ 2011-12-09  2:38           ` David Gibson
  0 siblings, 0 replies; 16+ messages in thread
From: David Gibson @ 2011-12-09  2:38 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Benjamin Herrenschmidt, joerg.roedel, dwmw2, iommu, aik,
	linux-kernel, chrisw, agraf, scottwood, B08248

On Wed, Dec 07, 2011 at 10:51:59PM -0700, Alex Williamson wrote:
> On Thu, 2011-12-08 at 10:41 +1100, Benjamin Herrenschmidt wrote:
> > On Wed, 2011-12-07 at 12:45 -0700, Alex Williamson wrote:
> > 
> > > Ok, let's back up from the patch because I think its clouding the driver
> > > model with a detached device state that might be detracting from the
> > > useful bits.
> > > 
> > > An isolation group is a generic association of devices into a set based
> > > on, well... isolation.  Presumably something like an iommu driver does a
> > > bus_register_notifier() and bus_for_each_dev() and when a new device is
> > > added we do isolation_group_new() and isolation_group_dev_add() as
> > > necessary.  When the iommu driver needs to do a dma_ops call for a
> > > device, it can traverse up to the isolation group and find something
> > > that points it to the shared iommu context for that group.  While
> > > dma_ops can share the iommu context among the drivers attached to the
> > > various devices, iommu_ops wants a single meta-driver for the group
> > > because it's predominantly designed for use when devices are exposed
> > > directly to the user.  We trust native kernel drivers to share context
> > > between each other, we do not trust user level drivers to share context
> > > with anyone but themselves.  Ok so far?
> > > 
> > > The first question we run into is what does the iommu driver do if it
> > > does not provide isolation?  If it's a translation-only iommu, maybe it
> > > has a single context and can manage to maintain it internally and should
> > > opt-out of isolation groups.  But we also have things like MSI isolation
> > > where we have multiple contexts and the isolation strength is sufficient
> > > for dma_ops, but not iommu_ops.  I don't know if we want to make that
> > > determination via a strength bitmask at the group level or simply a bool
> > > that the iommu driver sets when it registers a group.
> > 
> >  .../...
> > 
> > I think we are trying to solve too many problems at once.
> > 
> > I don't believe we should have the normal dma ops path in the kernel be
> > rewritten on top of some new iommu isolation layer or anything like that
> > at this stage.
> 
> I headed down this path because David seems to be suggesting, well
> stating, that all this infrastructure is useful for providing simplicity
> and efficiency of groups to the core, and not just for this niche vfio
> usage.  "You don't get a choice about whether you use groups."  The only
> way I can rationalize that is to give it some purpose which is used by
> something other that just vfio.  If all of that is "some day", then why
> don't we start with the group management in vfio as I have it today and
> eventually push that down into the core?

So.. you misunderstand what I meant by that, but I just realised
that's because I misunderstood what I said that in response to.

My point was that if you're using vfio (or something like it), then
the choice of whether groups are relevant is not the user's, but
depends on the hardware setup.

But I now realise that your point was that very few people will use
vfio itself, or anything like it.  And therefore you're uncomfortable
with the extent to which my patch impinges on the core driver model.

Which is a good point.  I still think this sufficiently solidifies the
isolation model to be worth it.

> > Proceed step by step or we'll never be finished. What that means imho
> > is:
> > 
> >  - Existing dma/iommu arch layers mostly remain as-is. It has its own
> > internal representation of devices, domains and possibly grouping
> > information, and things like the buggy RICOH devices are handled there
> > (via a flag set in a generic header quirk)
> 
> But we're already being harassed to rework iommu_ops to operate based on
> groups and to take grouping into account for things like the ricoh
> device in dma_ops.  Mixed messages...

Well, ok, maybe not so far in the future, but in any case this can be
considered independently.

> >  - Isolation groups are a new high-level thing that is not used by the
> > iommu/dma code for normal in-kernel operations at this stage. The
> > platform iommu/dma code creates the isolation group objects and
> > populates them, it doesn't initially -use- them.
> 
> The iommu_device_group() interface is a tiny extension to the IOMMU API
> that basically lets vfio query the iommu driver for a group identifier
> for a device.  vfio then uses that information to create and manage
> groups itself.
> 
> This is simply reversing that, allowing the iommu driver to create
> groups and add devices, which are centrally managed.  That sounds like a
> good long term goal, especially when we start making use of the groups
> for dma/iommu ops, but it does require that we define an API both for
> group providers (iommu drivers) and the group users (vfio).  It's also
> much more intrusive to the iommu drivers (note the trivial patches to
> add iommu_device_group support to intel and amd iommu) and creates bloat
> for a niche user if we're not going to get around to making groups more
> pervasive anytime soon.
> 
> >  - We can then use that for the needs of exposing the groups to user
> > space, perform driver level isolation, and provide the grouping
> > information to vfio.
> >
> >  - In the -long- run, individual iommu/dma layers might be modified to
> > instead use the isolation layer as their own building block if that
> > makes sense, this can be looked at on a case by case basis but is not a
> > pre-req.
> > 
> > That leaves us with what I think is the main disagreement between David
> > and Alex approaches which is whether devices in a group are "detached"
> > completely (in some kind of limbo) while they are used by vfio or
> > something else vs. assigned to a different driver (vfio).
> > 
> > After much thinking, I believe that the driver ownership model is the
> > right one. However, we could still use the work David did here. IE. I
> > _do_ think that exposing groups as the individual unit of ownership to
> > user space is the right thing to do.
> 
> So far we're only exposing the group structure here (which vfio also
> does in sysfs), not providing a userspace or driver interface for taking
> ownership of the group.  I'm still not clear how this ends up satisfying
> David's complaints about the group ownership vs device access model in
> vfio.
> 
> > That means that instead of that current detached flag alone, what we
> > could do instead is have an in-kernel mechanism to request isolation
> > which works by having the isolation group kernel code perform:
> > 
> >  - Rename the detached flag to "exclusive". This flag prevents automatic
> > & user driven attachment of a driver by the generic code. Set that flag
> > on all devices in the group (need a hook for hotplug).
> 
> This is a sysfs attribute on the group?  So step #1:
> 
>         $ echo 1 > /sys/isolation/$GROUP/exclusive
> 
> >  - Detach existing drivers.
> 
> step #2:
> 
>         $ for i in $(ls /sys/isolation/$GROUP/devices/); do	\
>         	echo $(basename $i) > $i/driver/unbind; done

Right, but this version can be messed up if there's a hotplug after
this but before vfio binds to the group.

> >  - Once everything is detached, attach the requested client driver (vfio
> > being the one we initially provide, maybe uio could use that framework
> > as well). This can be done using an explicit in-kernel attach API that
> > can be used even when the exclusive flag is set.
> 
> step #3: A miracle occurs ;)
> 
> I'm not sure what this looks like.  Are we going to do a
> isolation_group_register_meta_driver() that vfio calls with some defined
> ops structure, so a user can:
> 
>         $ echo "vfio" > /sys/isolation/$GROUP/driver
> 
> That triggers some list walk of the group devices, asking vfio to find
> and force a driver attach to the bus driver for each device?  A model
> something like that is the only way I can see that removing groups from
> vfio doesn't also disassociate the group meta driver from the group
> device drivers.

That's one way we could do it.  Another option is that the user
explicitly just detaches the group, which puts it in limbo.  The
asking to add the group from the vfio interface triggers attaching the
vfio "meta-driver" to the group (via a call to something like
isolation_group_claim(group, &vfio_isolation_user)).  I'm not sure
what's best, I'll try to work something up and see how it looks in the
next spin of my code.

> If instead, we make vfio be the group manager (since nobody else uses it
> anyway), all of the above becomes:
> 
>         $ for i in $(ls /sys/devices/virtual/vfio/pci:257/devices); do	\
>         	echo $(basename $i) > $i/driver/unbind;			\
>         	echo $(basename $i) > /sys/bus/pci/drivers/vfio/bind;	\
>         	done
>         
> >  - Win
> 
> I wish, we still have to make it through how vfio then exposes the group
> to userspace and how we don't end up with the same individual device
> manipulation vs group access model that seems to be getting under
> David's skin.

I don't have a problem with individual device manipulation for things
that pertain to an individual device - getting the resources, doing
mmio, attaching irqs.  I just want thing pertaining to a group's
ownership change to be explicitly group oriented.

>  Effectively what we've done above is:
> 
>       * provide a group model that we'll "some day" make more pervasive
>       * make the binding of devices to group device drivers implicit in
>         some kind of meta driver attachment

That's not a bad summary.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RFC: Device isolation infrastructure
  2011-12-08 14:28             ` Alex Williamson
@ 2011-12-09  3:00               ` David Gibson
  0 siblings, 0 replies; 16+ messages in thread
From: David Gibson @ 2011-12-09  3:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: joerg.roedel, dwmw2, iommu, aik, linux-kernel, chrisw, agraf,
	scottwood, B08248, benh

On Thu, Dec 08, 2011 at 07:28:25AM -0700, Alex Williamson wrote:
> On Thu, 2011-12-08 at 17:52 +1100, David Gibson wrote:
> > On Wed, Dec 07, 2011 at 11:23:10PM -0700, Alex Williamson wrote:
> > > On Thu, 2011-12-08 at 13:43 +1100, David Gibson wrote:
> > > > On Wed, Dec 07, 2011 at 12:45:20PM -0700, Alex Williamson wrote:
> > > > > So the next problem is that while the group is the minimum granularity
> > > > > for the iommu, it's not necessarily the desired granularity.  iommus
> > > > > like VT-d have per PCI BDF context entries that can point to shared page
> > > > > tables.  On such systems we also typically have singleton isolation
> > > > > groups, so when multiple devices are used by a single user, we have a
> > > > > lot of duplication in time and space.  VFIO handles this by allowing
> > > > > groups to be "merged".  When this happens, the merged groups point to
> > > > > the same iommu context.  I'm not sure what the plan is with isolation
> > > > > groups, but we need some way to reduce that overhead.
> > > > 
> > > > Right.  So, again, I intend that mutiple groups can go into one
> > > > domain.  Not entirely sure of the interface yet.  One I had in mind
> > > > was to borrow the vfio1 interface, so you open a /dev/vfio (each open
> > > > gives a new instance).  Then you do an "addgroup" ioctl which adds a
> > > > group to the domain.  You can do that multiple times, then start using
> > > > the domain.
> > > 
> > > This also revisits one of the primary problems of vfio1, the dependency
> > > on a privileged uiommu domain creation interface.  Assigning a user
> > > ownership of a group should be a privileged operation.  If a privileged
> > > user needs to open /dev/vfio, add groups, then drop privileges and hand
> > > the open file descriptor to an unprivileged user, the interface becomes
> > > much harder to use.  "Hot merging" becomes impossible.
> > 
> > No, I was assuming that "permission to detach" could be handed out to
> > a user before this step.  uid/gid/mode attributes in sysfs would
> > suffice, though there might be better ways.
> 
> Wow.  So you effectively get:
> 
> # chown user:user /sys/isolation/$GROUP

I was thinking more /sys/isolation/$GROUP/{uid,gid,mode} properties,
because I'm not sure you can implement chown simply in sysfs, but,
whatever.

> $ fd = open(/dev/vfio)
> $ ioctl(fd, addgroup, $GROUP)
>   ^^^^ this is where the actual detach occurs?
> 
> And vfio has to find the $GROUP object,

That's easy.

> figure out where /sys is mounted
> on this system,

a) /sys is always at /sys (see Documentation/sysfs-tules.txt

b) we don't need that anyway.  vfio can use use an in-kernel interface
to grab a handle on the group, it doesn't need to trawl through the
filesystem.


> construct a path to $SYS/isolation/$GROUP, and check the
> file access before calling the isolation group detach API.

Well, no, because the perms would just be in struct isolation_group,
and we'd have a helper function that detaches iff current has the
relevant permission.

> Then later...
> 
> $ ioctl(fd, delgroup, $GROUP)
> 
> But a reattach can't occur here or we've just given the user the power
> to cause the DoS, or at least system chaos, I described earlier.  So
> then the admin later has to:

I think you overestimate this "chaos".  But this is why I'm
considering an alternate approach where the group is detached, then
"meta-bound" as separate steps.

> 
> # chown root:root /sys/isolation/$GROUP
> # echo 0 > /sys/ioslation/$GROUP/detached

Well, if you want to give it back to normal drivers.  More usually, I
expect you'd detach the group at boot, and when the first guest using
it dies, it sits in limbo until the next guest (or more likely the
same guest, restarted) claims it.

> Under the covers, "addgroup" would require vfio to a) "detach" the
> group,

One function call.

> b) put the devices in an iommu domain.  If we're dealing with
> multiple group, 

Well, it has to do that anyway.

> b) can be rejected by the iommu driver, at which point
> vfio then sends a "reattach" to the isolation API (but that can't
> actually do much, as noted above),

Yes.  Again, one function call.

> the user then opens a new instance
> of /dev/vfio and tries an "addgroup" there.
> 
> Also note that there was a request to make it possible to get multiple
> file descriptors to the same group/device/iommu from different threads.
> I don't see how that's possible here.

dup(2)?  Although it depends exactly what the alternate handles are
needed for.  I can't actually see why you would need them at all, at
present.

> And this is somehow better than the perceived dis-symmetry that annoys
> you in vfio?  I'm getting a strong not-invented-here vibe and I'm not
> sure you've internalized all the interactions that make vfio manage
> groups the way it does.
> 
> Alex
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2011-12-09  3:02 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-07  3:58 RFC: Device isolation infrastructure David Gibson
2011-12-07  6:22 ` Benjamin Herrenschmidt
2011-12-07  6:46 ` Alex Williamson
2011-12-07 14:22   ` David Gibson
2011-12-07 19:45     ` Alex Williamson
2011-12-07 23:41       ` Benjamin Herrenschmidt
2011-12-08  0:16         ` Chris Wright
2011-12-08  1:32           ` Benjamin Herrenschmidt
2011-12-08  1:36             ` Chris Wright
2011-12-08  5:51         ` Alex Williamson
2011-12-09  2:38           ` David Gibson
2011-12-08  2:43       ` David Gibson
2011-12-08  6:23         ` Alex Williamson
2011-12-08  6:52           ` David Gibson
2011-12-08 14:28             ` Alex Williamson
2011-12-09  3:00               ` David Gibson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.