* RFC: Device isolation infrastructure
@ 2011-12-07 3:58 David Gibson
2011-12-07 6:22 ` Benjamin Herrenschmidt
2011-12-07 6:46 ` Alex Williamson
0 siblings, 2 replies; 16+ messages in thread
From: David Gibson @ 2011-12-07 3:58 UTC (permalink / raw)
To: Alex Williamson, joerg.roedel, dwmw2, iommu, aik
Cc: linux-kernel, chrisw, agraf, scottwood, B08248, benh
Alex Williamson recently posted some patches adding a new hook to
iommu_ops to identify groups of devices which cannot reliably be
distinguished by the iommu, and therefore can only safely be assigned
as a unit to a guest or userspace driver.
I'm not very fond of that design, because:
- It requires something else to scan the group ids to present the
groups in a userspace discoverable manner, rather than handling that
in the core.
- It encorages manipulating devices individually, but only
permitting certain actions if the whole group is together. This seems
confusing and arse-backwards to me.
- It does not provide an obvious means to prevent a kernel
driver unsafely binding to a new device which is hotplugged into an
iommu group already assigned to a guest.
- It is inherently centred arounf the iommu structure, however
there might be other platform dependent factors which also affect
which devices can be safely isolated from others.
These could probably each be addressed with various tweaks, but
basically, I'm just not convinced that it represents the right
structure for handling this. So, Alexey Kardashevskiy and myself havr
made this patches, representing the outline of how I think this should
be handled.
Notes:
* All names are negotiable. At the moment I'm thinking of this as
the "device isolation subsystem" which handles "device isolation
groups" (not to be confused with an iommu domain, which could contain
one or more groups). A group is considered "detached" when it has
been removed from the normal driver binding process, and therefore
ready for assignment to a guest or userspace. I'm more than open to
suggestions of better names.
* This is just the generic infrastructure. Obviously, platform
specific code will be needed as well to create the groups and assign
devices to them. We have a proof of concept implementation for the
power p5ioc2 pci host bridge, and I hope to post code for that and the
p7ioc bridge shortly. I'll also be looking at how this would be used
by Intel VTd, and AMD iommus.
* isolation groups appear in /sys/isolation. Isolatable devices
have a link to their group. The group contains links to all their
devices, and a "detached" attribute, which can be used to view and
control whether the group is detached.
* At present groups are detached manually via sysfs, but the idea
is a kernel system like vfio, could also directly request a group to
be detached for its use.
* when detached, all kernel drivers are unbound from every device
in the group. No driver will match any device in the group until it
is reattached.
* There are probably stupid bugs, in particular I know some error
paths are broken. By all means point these out to expedite fixing
them.
* The isolation_group structure is more or less a stub at this
point. It will probably want a pointer to an iommu_ops and/or its own
ops structure. The 'detached' bool may need to become some sort of
state variable. We'll probably want some sort of "owner" field for
what (e.g. vfio) is using the group when its detached.
* I've also considered adding an "isolation strength" bitmask to
indicate how well isolated the group is - e.g DMA isolated (iommu),
irq isolated (irq remapping iommu, or other mechanism), error isolated
(bus errors from this device can't affect other devices), ..
Anyway, on with the code..
From: David Gibson <david@gibson.dropbear.id.au>
To: Alex Williamson <alex.williamson@redhat.com>,
joerg.roedel@amd.com, dwmw2@infradead.org,
iommu@lists.linux-foundation.org,
aik@au1.ibm.com
Cc: linux-kernel@vger.kernel.org, chrisw@redhat.com,
agraf@suse.de, scottwood@freescale.com, B08248@freescale.com,
benh@kernel.crashing.org
Bcc:
Subject: RFC: Device isolation infrastructure
Reply-To:
Alex Williamson recently posted some patches adding a new hook to
iommu_ops to identify groups of devices which cannot reliably be
distinguished by the iommu, and therefore can only safely be assigned
as a unit to a guest or userspace driver.
I'm not very fond of that design, because:
- It requires something else to scan the group ids to present the
groups in a userspace discoverable manner, rather than handling that
in the core.
- It encorages manipulating devices individually, but only
permitting certain actions if the whole group is together. This seems
confusing and arse-backwards to me.
- It does not provide an obvious means to prevent a kernel
driver unsafely binding to a new device which is hotplugged into an
iommu group already assigned to a guest.
- It is inherently centred arounf the iommu structure, however
there might be other platform dependent factors which also affect
which devices can be safely isolated from others.
These could probably each be addressed with various tweaks, but
basically, I'm just not convinced that it represents the right
structure for handling this. So, Alexey Kardashevskiy and myself havr
made this patches, representing the outline of how I think this should
be handled.
Notes:
* All names are negotiable. At the moment I'm thinking of this as
the "device isolation subsystem" which handles "device isolation
groups" (not to be confused with an iommu domain, which could contain
one or more groups). A group is considered "detached" when it has
been removed from the normal driver binding process, and therefore
ready for assignment to a guest or userspace. I'm more than open to
suggestions of better names.
* This is just the generic infrastructure. Obviously, platform
specific code will be needed as well to create the groups and assign
devices to them. We have a proof of concept implementation for the
power p5ioc2 pci host bridge, and I hope to post code for that and the
p7ioc bridge shortly. I'll also be looking at how this would be used
by Intel VTd, and AMD iommus.
* isolation groups appear in /sys/isolation. Isolatable devices
have a link to their group. The group contains links to all their
devices, and a "detached" attribute, which can be used to view and
control whether the group is detached.
* At present groups are detached manually via sysfs, but the idea
is a kernel system like vfio, could also directly request a group to
be detached for its use.
* when detached, all kernel drivers are unbound from every device
in the group. No driver will match any device in the group until it
is reattached.
* There are probably stupid bugs, in particular I know some error
paths are broken. By all means point these out to expedite fixing
them.
* The isolation_group structure is more or less a stub at this
point. It will probably want a pointer to an iommu_ops and/or its own
ops structure. The 'detached' bool may need to become some sort of
state variable. We'll probably want some sort of "owner" field for
what (e.g. vfio) is using the group when its detached.
* I've also considered adding an "isolation strength" bitmask to
indicate how well isolated the group is - e.g DMA isolated (iommu),
irq isolated (irq remapping iommu, or other mechanism), error isolated
(bus errors from this device can't affect other devices), ..
Anyway, on with the code..
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
drivers/Kconfig | 2 +
drivers/Makefile | 2 +
drivers/base/base.h | 5 +
drivers/base/core.c | 7 +
drivers/base/init.c | 2 +
drivers/isolation/Kconfig | 3 +
drivers/isolation/Makefile | 2 +
drivers/isolation/device_isolation.c | 216 ++++++++++++++++++++++++++++++++++
include/linux/device.h | 5 +
include/linux/device_isolation.h | 79 ++++++++++++
10 files changed, 323 insertions(+), 0 deletions(-)
create mode 100644 drivers/isolation/Kconfig
create mode 100644 drivers/isolation/Makefile
create mode 100644 drivers/isolation/device_isolation.c
create mode 100644 include/linux/device_isolation.h
diff --git a/drivers/Kconfig b/drivers/Kconfig
index 95b9e7e..2ce5717 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -130,4 +130,6 @@ source "drivers/iommu/Kconfig"
source "drivers/virt/Kconfig"
+source "drivers/isolation/Kconfig"
+
endmenu
diff --git a/drivers/Makefile b/drivers/Makefile
index 7fa433a..741874a 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -127,3 +127,5 @@ obj-$(CONFIG_IOMMU_SUPPORT) += iommu/
# Virtualization drivers
obj-$(CONFIG_VIRT_DRIVERS) += virt/
+
+obj-$(CONFIG_DEVICE_ISOLATION) += isolation/
diff --git a/drivers/base/base.h b/drivers/base/base.h
index a34dca0..66a4ed2 100644
--- a/drivers/base/base.h
+++ b/drivers/base/base.h
@@ -24,6 +24,9 @@
* bus_type/class to be statically allocated safely. Nothing outside of the
* driver core should ever touch these fields.
*/
+
+#include <linux/device_isolation.h>
+
struct subsys_private {
struct kset subsys;
struct kset *devices_kset;
@@ -108,6 +111,8 @@ extern int driver_probe_device(struct device_driver *drv, struct device *dev);
static inline int driver_match_device(struct device_driver *drv,
struct device *dev)
{
+ if (isolation_device_is_detached(dev))
+ return 0;
return drv->bus->match ? drv->bus->match(dev, drv) : 1;
}
diff --git a/drivers/base/core.c b/drivers/base/core.c
index bc8729d..45201c5 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -22,6 +22,7 @@
#include <linux/kallsyms.h>
#include <linux/mutex.h>
#include <linux/async.h>
+#include <linux/device_isolation.h>
#include "base.h"
#include "power/power.h"
@@ -593,6 +594,10 @@ void device_initialize(struct device *dev)
lockdep_set_novalidate_class(&dev->mutex);
spin_lock_init(&dev->devres_lock);
INIT_LIST_HEAD(&dev->devres_head);
+#ifdef CONFIG_DEVICE_ISOLATION
+ INIT_LIST_HEAD(&dev->isolation_list);
+ dev->isolation_group = NULL;
+#endif
device_pm_init(dev);
set_dev_node(dev, -1);
}
@@ -993,6 +998,8 @@ int device_add(struct device *dev)
class_intf->add_dev(dev, class_intf);
mutex_unlock(&dev->class->p->class_mutex);
}
+
+ isolation_group_update_links(dev);
done:
put_device(dev);
return error;
diff --git a/drivers/base/init.c b/drivers/base/init.c
index c8a934e..89f3e94 100644
--- a/drivers/base/init.c
+++ b/drivers/base/init.c
@@ -8,6 +8,7 @@
#include <linux/device.h>
#include <linux/init.h>
#include <linux/memory.h>
+#include <linux/device_isolation.h>
#include "base.h"
@@ -24,6 +25,7 @@ void __init driver_init(void)
devices_init();
buses_init();
classes_init();
+ isolation_init();
firmware_init();
hypervisor_init();
diff --git a/drivers/isolation/Kconfig b/drivers/isolation/Kconfig
new file mode 100644
index 0000000..a2a7703
--- /dev/null
+++ b/drivers/isolation/Kconfig
@@ -0,0 +1,3 @@
+config DEVICE_ISOLATION
+ bool "Enable isolating devices for safe pass-through to guests or user space."
+
diff --git a/drivers/isolation/Makefile b/drivers/isolation/Makefile
new file mode 100644
index 0000000..8f1c6ec
--- /dev/null
+++ b/drivers/isolation/Makefile
@@ -0,0 +1,2 @@
+obj-$(CONFIG_DEVICE_ISOLATION) := device_isolation.o
+
diff --git a/drivers/isolation/device_isolation.c b/drivers/isolation/device_isolation.c
new file mode 100644
index 0000000..2b00ff7
--- /dev/null
+++ b/drivers/isolation/device_isolation.c
@@ -0,0 +1,216 @@
+#include <linux/slab.h>
+#include <linux/device_isolation.h>
+
+#define KERN_DEBUG KERN_WARNING
+
+
+static struct kset *isolation_kset;
+
+#define to_isolation_attr(_attr) container_of(_attr, \
+ struct isolation_attribute, attr)
+
+static ssize_t isolation_attr_show(struct kobject *kobj,
+ struct attribute *attr, char *buf)
+{
+ struct isolation_attribute *isolation_attr = to_isolation_attr(attr);
+ struct isolation_group *group = container_of(kobj,
+ struct isolation_group, kobj);
+ ssize_t ret = -EIO;
+
+ if (isolation_attr->show)
+ ret = isolation_attr->show(group, buf);
+ return ret;
+}
+
+static ssize_t isolation_attr_store(struct kobject *kobj,
+ struct attribute *attr, const char *buf, size_t count)
+{
+ struct isolation_attribute *isolation_attr = to_isolation_attr(attr);
+ struct isolation_group *group = container_of(kobj,
+ struct isolation_group, kobj);
+ ssize_t ret = -EIO;
+
+ if (isolation_attr->store)
+ ret = isolation_attr->store(group, buf, count);
+ return ret;
+}
+
+static void isolation_release(struct kobject *kobj)
+{
+ struct isolation_group *group = container_of(kobj,
+ struct isolation_group, kobj);
+
+ pr_debug("isolation : release.\n");
+
+ kfree(group);
+}
+
+static const struct sysfs_ops isolation_sysfs_ops = {
+ .show = isolation_attr_show,
+ .store = isolation_attr_store,
+};
+
+static struct kobj_type isolation_ktype = {
+ .sysfs_ops = &isolation_sysfs_ops,
+ .release = isolation_release,
+};
+
+static ssize_t group_showdetached(struct isolation_group *group,
+ char *buf)
+{
+ return sprintf(buf, "%u", !!group->detached);
+}
+
+static ssize_t group_setdetached(struct isolation_group *group,
+ const char *buf, size_t count)
+{
+ switch (buf[0]) {
+ case '0':
+ isolation_group_reattach(group);
+ break;
+ case '1':
+ isolation_group_detach(group);
+ break;
+ default:
+ count = -EINVAL;
+ }
+ return count;
+}
+
+static DEVICE_ISOLATION_ATTR(detached, S_IWUSR | S_IRUSR,
+ group_showdetached, group_setdetached);
+
+struct isolation_group *isolation_group_new(const char *name)
+{
+ int ret;
+ struct isolation_group *group;
+
+ group = kzalloc(sizeof(*group), GFP_KERNEL);
+ if (!group)
+ return NULL;
+
+ spin_lock_init(&group->lock);
+
+ group->kobj.kset = isolation_kset;
+ ret = kobject_init_and_add(&group->kobj, &isolation_ktype, NULL, name);
+ if (ret < 0) {
+ printk(KERN_ERR "device_isolation: kobject_init_and_add failed "
+ "for %s\n", group->kobj.name);
+ return NULL;
+ }
+
+ ret = sysfs_create_file(&group->kobj, &isolation_attr_detached.attr);
+ if (0 > ret)
+ printk(KERN_WARNING "device_isolation: create \"detached\" failed "
+ "for %s, errno=%i\n", group->kobj.name, ret);
+
+ INIT_LIST_HEAD(&group->devices);
+ printk(KERN_DEBUG "device_isolation: group %s created\n",
+ group->kobj.name);
+
+ return group;
+}
+
+void isolation_group_dev_add(struct isolation_group *group, struct device *dev)
+{
+ printk(KERN_DEBUG "device_isolation: adding device %s to group %s\n",
+ dev->kobj.name, group->kobj.name);
+ spin_lock(&group->lock);
+
+ list_add_tail(&dev->isolation_list, &group->devices);
+ dev->isolation_group = group;
+ /* Todo: remove links we already created */
+ spin_unlock(&group->lock);
+}
+
+void isolation_group_dev_remove(struct device *dev)
+{
+ list_del(&dev->isolation_list);
+}
+
+int isolation_group_update_links(struct device *dev)
+{
+ int error;
+ char *buf;
+ struct isolation_group *group = dev->isolation_group;
+
+ if (!group) {
+ printk(KERN_DEBUG "device_isolation: no group for %s\n",
+ dev->kobj.name);
+ return 0;
+ }
+
+ printk(KERN_DEBUG "device_isolation: adding device %s to group %s\n",
+ dev->kobj.name, group->kobj.name);
+
+ spin_lock(&group->lock);
+
+ error = sysfs_create_link(&dev->kobj, &group->kobj, "isolation_group");
+ if (0 > error)
+ printk(KERN_WARNING "device_isolation: create isolation_group "
+ "link failed for %s -> %s, errno=%i\n",
+ dev->kobj.name, group->kobj.name, error);
+
+ buf = kasprintf(GFP_KERNEL, "%s-%p", dev->kobj.name, dev);
+ if (!buf) {
+ error = -ENOMEM;
+ printk(KERN_ERR "device_isolation: kasprintf failed\n");
+ goto out;
+ }
+
+ error = sysfs_create_link(&group->kobj, &dev->kobj, buf);
+ kfree(buf);
+ if (0 > error)
+ printk(KERN_WARNING "device_isolation: create %s "
+ "link failed for %s -> %s, errno=%i\n",
+ buf, dev->kobj.name, group->kobj.name, error);
+
+out:
+ /* Todo: remove links we already created */
+ spin_unlock(&group->lock);
+ return error;
+}
+
+int isolation_group_detach(struct isolation_group *group)
+{
+ int error;
+ struct device *dev;
+
+ spin_lock(&group->lock);
+
+ group->detached = true;
+ list_for_each_entry(dev, &group->devices, isolation_list) {
+ error = device_reprobe(dev);
+ }
+
+ spin_unlock(&group->lock);
+
+ return 0;
+}
+
+int isolation_group_reattach(struct isolation_group *group)
+{
+ int error;
+ struct device *dev;
+
+ spin_lock(&group->lock);
+
+ group->detached = false;
+ list_for_each_entry(dev, &group->devices, isolation_list) {
+ printk("Reprobing %s\n", dev->kobj.name);
+ error = device_reprobe(dev);
+ }
+
+ spin_unlock(&group->lock);
+
+ return 0;
+}
+
+int __init isolation_init(void)
+{
+ isolation_kset = kset_create_and_add("isolation", NULL, NULL);
+ if (!isolation_kset)
+ return -ENOMEM;
+ return 0;
+}
+
diff --git a/include/linux/device.h b/include/linux/device.h
index c20dfbf..6238a13 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -585,6 +585,11 @@ struct device {
struct dma_coherent_mem *dma_mem; /* internal for coherent mem
override */
+#ifdef CONFIG_DEVICE_ISOLATION
+ struct isolation_group *isolation_group;
+ struct list_head isolation_list;
+#endif
+
/* arch specific additions */
struct dev_archdata archdata;
diff --git a/include/linux/device_isolation.h b/include/linux/device_isolation.h
new file mode 100644
index 0000000..1dc7e68
--- /dev/null
+++ b/include/linux/device_isolation.h
@@ -0,0 +1,79 @@
+#ifndef _DEVICE_ISOLATION_H_
+#define _DEVICE_ISOLATION_H_
+
+#include <linux/kobject.h>
+#include <linux/list.h>
+#include <linux/device.h>
+
+struct isolation_group {
+ struct kobject kobj;
+ struct list_head devices;
+ spinlock_t lock;
+ bool detached;
+};
+
+struct isolation_attribute {
+ struct attribute attr;
+ ssize_t (*show)(struct isolation_group *group, char *buf);
+ ssize_t (*store)(struct isolation_group *group, const char *buf,
+ size_t count);
+};
+
+#ifdef CONFIG_DEVICE_ISOLATION
+
+int __init isolation_init(void);
+
+struct isolation_group *isolation_group_new(const char *name);
+int isolation_group_delete(struct isolation_group *group);
+
+void isolation_group_dev_add(struct isolation_group *group,
+ struct device *dev);
+void isolation_group_dev_remove(struct device *dev);
+
+int isolation_group_update_links(struct device *dev);
+int isolation_group_detach(struct isolation_group *group);
+int isolation_group_reattach(struct isolation_group *group);
+
+#define DEVICE_ISOLATION_ATTR(_name, _mode, _show, _store) \
+ struct isolation_attribute isolation_attr_##_name = \
+ __ATTR(_name, _mode, _show, _store)
+
+static inline int isolation_device_is_detached(struct device *dev)
+{
+ return dev->isolation_group && dev->isolation_group->detached;
+}
+
+#else
+
+static inline int __init isolation_init(void)
+{
+ return 0;
+}
+
+static inline struct isolation_group *isolation_group_new(const char *name)
+{
+ return NULL;
+}
+
+/*void isolation_group_dev_add(struct isolation_group *group,
+ struct device *dev)
+{
+}*/
+
+static inline void isolation_group_dev_remove(struct device *dev)
+{
+}
+
+static inline int isolation_group_update_links(struct device *dev)
+{
+ return 0;
+}
+
+static inline int isolation_is_detached(struct device *dev)
+{
+ return 0;
+}
+
+#endif
+
+#endif
--
1.7.7.3
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
^ permalink raw reply related [flat|nested] 16+ messages in thread* Re: RFC: Device isolation infrastructure 2011-12-07 3:58 RFC: Device isolation infrastructure David Gibson @ 2011-12-07 6:22 ` Benjamin Herrenschmidt 2011-12-07 6:46 ` Alex Williamson 1 sibling, 0 replies; 16+ messages in thread From: Benjamin Herrenschmidt @ 2011-12-07 6:22 UTC (permalink / raw) To: David Gibson Cc: Alex Williamson, joerg.roedel, dwmw2, iommu, aik, linux-kernel, chrisw, agraf, scottwood, B08248 On Wed, 2011-12-07 at 14:58 +1100, David Gibson wrote: > Alex Williamson recently posted some patches adding a new hook to > iommu_ops to identify groups of devices which cannot reliably be > distinguished by the iommu, and therefore can only safely be assigned > as a unit to a guest or userspace driver. Please ignore the numeros typos, David is typing all of that with a broken hand in a cast... Cheers, Ben. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: RFC: Device isolation infrastructure 2011-12-07 3:58 RFC: Device isolation infrastructure David Gibson 2011-12-07 6:22 ` Benjamin Herrenschmidt @ 2011-12-07 6:46 ` Alex Williamson 2011-12-07 14:22 ` David Gibson 1 sibling, 1 reply; 16+ messages in thread From: Alex Williamson @ 2011-12-07 6:46 UTC (permalink / raw) To: David Gibson Cc: joerg.roedel, dwmw2, iommu, aik, linux-kernel, chrisw, agraf, scottwood, B08248, benh On Wed, 2011-12-07 at 14:58 +1100, David Gibson wrote: > > Alex Williamson recently posted some patches adding a new hook to > iommu_ops to identify groups of devices which cannot reliably be > distinguished by the iommu, and therefore can only safely be assigned > as a unit to a guest or userspace driver. > > I'm not very fond of that design, because: > - It requires something else to scan the group ids to present the > groups in a userspace discoverable manner, rather than handling that > in the core. The thought was that the vast majority of users won't make use of groups, so why bloat the core when the only user is vfio. We need bus specific drivers to expose devices in vfio, so it's fairly natural to have the bus drivers enumerate devices on their bus, register them with the vfio-core, which then exposes groups in a convenient manner for userspace. > - It encorages manipulating devices individually, but only > permitting certain actions if the whole group is together. This seems > confusing and arse-backwards to me. Hmm, this patch introduces a group level detach and reattach of devices, but seems to lack any support for actually making use of devices in the detached state, so it's rather hard to compare. How does someone wanting to interact with a device within the group do so? This appears to put devices entirely outside of the driver model when detached, which seems just as awkward. > - It does not provide an obvious means to prevent a kernel > driver unsafely binding to a new device which is hotplugged into an > iommu group already assigned to a guest. I've added this for PCI, triggered via the device add notifier. > - It is inherently centred arounf the iommu structure, however > there might be other platform dependent factors which also affect > which devices can be safely isolated from others. Yep, MSI isolation being an example. I tend to think the iommu driver is pretty closely tied to the platform (all current examples are), so the iommu driver should take those factors into account when advertising groups. > > These could probably each be addressed with various tweaks, but > basically, I'm just not convinced that it represents the right > structure for handling this. So, Alexey Kardashevskiy and myself havr > made this patches, representing the outline of how I think this should > be handled. > > > Notes: > * All names are negotiable. At the moment I'm thinking of this as > the "device isolation subsystem" which handles "device isolation > groups" (not to be confused with an iommu domain, which could contain > one or more groups). A group is considered "detached" when it has > been removed from the normal driver binding process, and therefore > ready for assignment to a guest or userspace. I'm more than open to > suggestions of better names. > * This is just the generic infrastructure. Obviously, platform > specific code will be needed as well to create the groups and assign > devices to them. We have a proof of concept implementation for the > power p5ioc2 pci host bridge, and I hope to post code for that and the > p7ioc bridge shortly. I'll also be looking at how this would be used > by Intel VTd, and AMD iommus. > * isolation groups appear in /sys/isolation. Isolatable devices > have a link to their group. The group contains links to all their > devices, and a "detached" attribute, which can be used to view and > control whether the group is detached. > * At present groups are detached manually via sysfs, but the idea > is a kernel system like vfio, could also directly request a group to > be detached for its use. > * when detached, all kernel drivers are unbound from every device > in the group. No driver will match any device in the group until it > is reattached. > > * There are probably stupid bugs, in particular I know some error > paths are broken. By all means point these out to expedite fixing > them. > * The isolation_group structure is more or less a stub at this > point. It will probably want a pointer to an iommu_ops and/or its own > ops structure. The 'detached' bool may need to become some sort of > state variable. We'll probably want some sort of "owner" field for > what (e.g. vfio) is using the group when its detached. This makes me nervous. One of the flaws I'm trying to correct with vfio vs existing kvm PCI device assignment is the driver ownership model. KVM will currently try to make use of any device it can and is only stopped by drivers having already claimed various resources. VFIO has a strong device-driver-group ownership model. Isolation groups facilitate ejecting the normal device drivers (which actually seems somewhat dangerous) but isn't yet introducing any clear ownership model to replace *a* device being bound to *a* driver. Devices seem to be in limbo once detached. We're not even outlining here how we're going to attach group-drivers to groups or facilitate matching individual devices to bus specific drivers, which seems like it'll get us back into the individual device manipulation you complain about above. > * I've also considered adding an "isolation strength" bitmask to > indicate how well isolated the group is - e.g DMA isolated (iommu), > irq isolated (irq remapping iommu, or other mechanism), error isolated > (bus errors from this device can't affect other devices), .. This sounds like a maintenance burden, which is why I've been trying to push that iommu drivers should not expose groups that aren't fully isolated. It's easy to add parameters per iommu driver where the user can opt-in to certain deficiencies. > Anyway, on with the code.. > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au> > > --- > drivers/Kconfig | 2 + > drivers/Makefile | 2 + > drivers/base/base.h | 5 + > drivers/base/core.c | 7 + > drivers/base/init.c | 2 + > drivers/isolation/Kconfig | 3 + > drivers/isolation/Makefile | 2 + > drivers/isolation/device_isolation.c | 216 ++++++++++++++++++++++++++++++++++ > include/linux/device.h | 5 + > include/linux/device_isolation.h | 79 ++++++++++++ > 10 files changed, 323 insertions(+), 0 deletions(-) > create mode 100644 drivers/isolation/Kconfig > create mode 100644 drivers/isolation/Makefile > create mode 100644 drivers/isolation/device_isolation.c > create mode 100644 include/linux/device_isolation.h > > diff --git a/drivers/Kconfig b/drivers/Kconfig > index 95b9e7e..2ce5717 100644 > --- a/drivers/Kconfig > +++ b/drivers/Kconfig > @@ -130,4 +130,6 @@ source "drivers/iommu/Kconfig" > > source "drivers/virt/Kconfig" > > +source "drivers/isolation/Kconfig" > + > endmenu > diff --git a/drivers/Makefile b/drivers/Makefile > index 7fa433a..741874a 100644 > --- a/drivers/Makefile > +++ b/drivers/Makefile > @@ -127,3 +127,5 @@ obj-$(CONFIG_IOMMU_SUPPORT) += iommu/ > > # Virtualization drivers > obj-$(CONFIG_VIRT_DRIVERS) += virt/ > + > +obj-$(CONFIG_DEVICE_ISOLATION) += isolation/ > diff --git a/drivers/base/base.h b/drivers/base/base.h > index a34dca0..66a4ed2 100644 > --- a/drivers/base/base.h > +++ b/drivers/base/base.h > @@ -24,6 +24,9 @@ > * bus_type/class to be statically allocated safely. Nothing outside of the > * driver core should ever touch these fields. > */ > + > +#include <linux/device_isolation.h> > + > struct subsys_private { > struct kset subsys; > struct kset *devices_kset; > @@ -108,6 +111,8 @@ extern int driver_probe_device(struct device_driver *drv, struct device *dev); > static inline int driver_match_device(struct device_driver *drv, > struct device *dev) > { > + if (isolation_device_is_detached(dev)) > + return 0; > return drv->bus->match ? drv->bus->match(dev, drv) : 1; > } > > diff --git a/drivers/base/core.c b/drivers/base/core.c > index bc8729d..45201c5 100644 > --- a/drivers/base/core.c > +++ b/drivers/base/core.c > @@ -22,6 +22,7 @@ > #include <linux/kallsyms.h> > #include <linux/mutex.h> > #include <linux/async.h> > +#include <linux/device_isolation.h> > > #include "base.h" > #include "power/power.h" > @@ -593,6 +594,10 @@ void device_initialize(struct device *dev) > lockdep_set_novalidate_class(&dev->mutex); > spin_lock_init(&dev->devres_lock); > INIT_LIST_HEAD(&dev->devres_head); > +#ifdef CONFIG_DEVICE_ISOLATION > + INIT_LIST_HEAD(&dev->isolation_list); > + dev->isolation_group = NULL; > +#endif > device_pm_init(dev); > set_dev_node(dev, -1); > } > @@ -993,6 +998,8 @@ int device_add(struct device *dev) > class_intf->add_dev(dev, class_intf); > mutex_unlock(&dev->class->p->class_mutex); > } At this point we've already attached a native driver to a potentially in-use isolation group, so we haven't really addressed the hotplug issue. If we're creating groups and adding devices to them via the device add notifier, that looks pretty similar to how vfio does it, and there we only need the trivial iommu_device_group hook. I'll give you that the "detached" hook in the match function makes it easy to take a device out of the driver model, but then what do you replace the driver model with? > + > + isolation_group_update_links(dev); > done: > put_device(dev); > return error; We'd need something in the device_del path to block removal if the device is in use via the group. The equivalent of the .remove for the driver it can no longer attach to. > diff --git a/drivers/base/init.c b/drivers/base/init.c > index c8a934e..89f3e94 100644 > --- a/drivers/base/init.c > +++ b/drivers/base/init.c > @@ -8,6 +8,7 @@ > #include <linux/device.h> > #include <linux/init.h> > #include <linux/memory.h> > +#include <linux/device_isolation.h> > > #include "base.h" > > @@ -24,6 +25,7 @@ void __init driver_init(void) > devices_init(); > buses_init(); > classes_init(); > + isolation_init(); > firmware_init(); > hypervisor_init(); > > diff --git a/drivers/isolation/Kconfig b/drivers/isolation/Kconfig > new file mode 100644 > index 0000000..a2a7703 > --- /dev/null > +++ b/drivers/isolation/Kconfig > @@ -0,0 +1,3 @@ > +config DEVICE_ISOLATION > + bool "Enable isolating devices for safe pass-through to guests or user space." > + > diff --git a/drivers/isolation/Makefile b/drivers/isolation/Makefile > new file mode 100644 > index 0000000..8f1c6ec > --- /dev/null > +++ b/drivers/isolation/Makefile > @@ -0,0 +1,2 @@ > +obj-$(CONFIG_DEVICE_ISOLATION) := device_isolation.o > + > diff --git a/drivers/isolation/device_isolation.c b/drivers/isolation/device_isolation.c > new file mode 100644 > index 0000000..2b00ff7 > --- /dev/null > +++ b/drivers/isolation/device_isolation.c > @@ -0,0 +1,216 @@ > +#include <linux/slab.h> > +#include <linux/device_isolation.h> > + > +#define KERN_DEBUG KERN_WARNING > + > + > +static struct kset *isolation_kset; > + > +#define to_isolation_attr(_attr) container_of(_attr, \ > + struct isolation_attribute, attr) > + > +static ssize_t isolation_attr_show(struct kobject *kobj, > + struct attribute *attr, char *buf) > +{ > + struct isolation_attribute *isolation_attr = to_isolation_attr(attr); > + struct isolation_group *group = container_of(kobj, > + struct isolation_group, kobj); > + ssize_t ret = -EIO; > + > + if (isolation_attr->show) > + ret = isolation_attr->show(group, buf); > + return ret; > +} > + > +static ssize_t isolation_attr_store(struct kobject *kobj, > + struct attribute *attr, const char *buf, size_t count) > +{ > + struct isolation_attribute *isolation_attr = to_isolation_attr(attr); > + struct isolation_group *group = container_of(kobj, > + struct isolation_group, kobj); > + ssize_t ret = -EIO; > + > + if (isolation_attr->store) > + ret = isolation_attr->store(group, buf, count); > + return ret; > +} > + > +static void isolation_release(struct kobject *kobj) > +{ > + struct isolation_group *group = container_of(kobj, > + struct isolation_group, kobj); > + > + pr_debug("isolation : release.\n"); > + > + kfree(group); > +} > + > +static const struct sysfs_ops isolation_sysfs_ops = { > + .show = isolation_attr_show, > + .store = isolation_attr_store, > +}; > + > +static struct kobj_type isolation_ktype = { > + .sysfs_ops = &isolation_sysfs_ops, > + .release = isolation_release, > +}; > + > +static ssize_t group_showdetached(struct isolation_group *group, > + char *buf) > +{ > + return sprintf(buf, "%u", !!group->detached); > +} > + > +static ssize_t group_setdetached(struct isolation_group *group, > + const char *buf, size_t count) > +{ > + switch (buf[0]) { > + case '0': > + isolation_group_reattach(group); > + break; > + case '1': > + isolation_group_detach(group); > + break; > + default: > + count = -EINVAL; > + } > + return count; > +} > + > +static DEVICE_ISOLATION_ATTR(detached, S_IWUSR | S_IRUSR, > + group_showdetached, group_setdetached); > + > +struct isolation_group *isolation_group_new(const char *name) > +{ > + int ret; > + struct isolation_group *group; > + > + group = kzalloc(sizeof(*group), GFP_KERNEL); > + if (!group) > + return NULL; > + > + spin_lock_init(&group->lock); > + > + group->kobj.kset = isolation_kset; > + ret = kobject_init_and_add(&group->kobj, &isolation_ktype, NULL, name); > + if (ret < 0) { > + printk(KERN_ERR "device_isolation: kobject_init_and_add failed " > + "for %s\n", group->kobj.name); > + return NULL; > + } > + > + ret = sysfs_create_file(&group->kobj, &isolation_attr_detached.attr); > + if (0 > ret) > + printk(KERN_WARNING "device_isolation: create \"detached\" failed " > + "for %s, errno=%i\n", group->kobj.name, ret); > + > + INIT_LIST_HEAD(&group->devices); > + printk(KERN_DEBUG "device_isolation: group %s created\n", > + group->kobj.name); > + > + return group; > +} > + > +void isolation_group_dev_add(struct isolation_group *group, struct device *dev) > +{ > + printk(KERN_DEBUG "device_isolation: adding device %s to group %s\n", > + dev->kobj.name, group->kobj.name); > + spin_lock(&group->lock); > + > + list_add_tail(&dev->isolation_list, &group->devices); > + dev->isolation_group = group; > + /* Todo: remove links we already created */ > + spin_unlock(&group->lock); > +} > + > +void isolation_group_dev_remove(struct device *dev) > +{ > + list_del(&dev->isolation_list); > +} > + > +int isolation_group_update_links(struct device *dev) > +{ > + int error; > + char *buf; > + struct isolation_group *group = dev->isolation_group; > + > + if (!group) { > + printk(KERN_DEBUG "device_isolation: no group for %s\n", > + dev->kobj.name); > + return 0; > + } > + > + printk(KERN_DEBUG "device_isolation: adding device %s to group %s\n", > + dev->kobj.name, group->kobj.name); > + > + spin_lock(&group->lock); > + > + error = sysfs_create_link(&dev->kobj, &group->kobj, "isolation_group"); > + if (0 > error) > + printk(KERN_WARNING "device_isolation: create isolation_group " > + "link failed for %s -> %s, errno=%i\n", > + dev->kobj.name, group->kobj.name, error); > + > + buf = kasprintf(GFP_KERNEL, "%s-%p", dev->kobj.name, dev); > + if (!buf) { > + error = -ENOMEM; > + printk(KERN_ERR "device_isolation: kasprintf failed\n"); > + goto out; > + } > + > + error = sysfs_create_link(&group->kobj, &dev->kobj, buf); > + kfree(buf); > + if (0 > error) > + printk(KERN_WARNING "device_isolation: create %s " > + "link failed for %s -> %s, errno=%i\n", > + buf, dev->kobj.name, group->kobj.name, error); > + > +out: > + /* Todo: remove links we already created */ > + spin_unlock(&group->lock); > + return error; > +} > + > +int isolation_group_detach(struct isolation_group *group) > +{ > + int error; > + struct device *dev; > + > + spin_lock(&group->lock); > + > + group->detached = true; > + list_for_each_entry(dev, &group->devices, isolation_list) { > + error = device_reprobe(dev); How do we handle blocking .remove functions? That's one reason vfio doesn't attempt to automatically detach drivers. We can treat the groups as a unit here, but it's hard to get away from the fact that potentially several individual devices are being removed from their drivers here. > + } > + > + spin_unlock(&group->lock); > + > + return 0; > +} > + > +int isolation_group_reattach(struct isolation_group *group) > +{ > + int error; > + struct device *dev; > + > + spin_lock(&group->lock); Obviously some ownership check would have to exist here if we knew how to claim a group. > + group->detached = false; > + list_for_each_entry(dev, &group->devices, isolation_list) { > + printk("Reprobing %s\n", dev->kobj.name); > + error = device_reprobe(dev); > + } > + > + spin_unlock(&group->lock); > + > + return 0; > +} > + > +int __init isolation_init(void) > +{ > + isolation_kset = kset_create_and_add("isolation", NULL, NULL); > + if (!isolation_kset) > + return -ENOMEM; > + return 0; > +} > + > diff --git a/include/linux/device.h b/include/linux/device.h > index c20dfbf..6238a13 100644 > --- a/include/linux/device.h > +++ b/include/linux/device.h > @@ -585,6 +585,11 @@ struct device { > > struct dma_coherent_mem *dma_mem; /* internal for coherent mem > override */ > +#ifdef CONFIG_DEVICE_ISOLATION > + struct isolation_group *isolation_group; > + struct list_head isolation_list; > +#endif > + > /* arch specific additions */ > struct dev_archdata archdata; > > diff --git a/include/linux/device_isolation.h b/include/linux/device_isolation.h > new file mode 100644 > index 0000000..1dc7e68 > --- /dev/null > +++ b/include/linux/device_isolation.h > @@ -0,0 +1,79 @@ > +#ifndef _DEVICE_ISOLATION_H_ > +#define _DEVICE_ISOLATION_H_ > + > +#include <linux/kobject.h> > +#include <linux/list.h> > +#include <linux/device.h> > + > +struct isolation_group { > + struct kobject kobj; > + struct list_head devices; > + spinlock_t lock; > + bool detached; > +}; > + > +struct isolation_attribute { > + struct attribute attr; > + ssize_t (*show)(struct isolation_group *group, char *buf); > + ssize_t (*store)(struct isolation_group *group, const char *buf, > + size_t count); > +}; > + > +#ifdef CONFIG_DEVICE_ISOLATION > + > +int __init isolation_init(void); > + > +struct isolation_group *isolation_group_new(const char *name); > +int isolation_group_delete(struct isolation_group *group); > + > +void isolation_group_dev_add(struct isolation_group *group, > + struct device *dev); > +void isolation_group_dev_remove(struct device *dev); > + > +int isolation_group_update_links(struct device *dev); > +int isolation_group_detach(struct isolation_group *group); > +int isolation_group_reattach(struct isolation_group *group); > + > +#define DEVICE_ISOLATION_ATTR(_name, _mode, _show, _store) \ > + struct isolation_attribute isolation_attr_##_name = \ > + __ATTR(_name, _mode, _show, _store) > + > +static inline int isolation_device_is_detached(struct device *dev) > +{ > + return dev->isolation_group && dev->isolation_group->detached; > +} > + > +#else > + > +static inline int __init isolation_init(void) > +{ > + return 0; > +} > + > +static inline struct isolation_group *isolation_group_new(const char *name) > +{ > + return NULL; > +} > + > +/*void isolation_group_dev_add(struct isolation_group *group, > + struct device *dev) > +{ > +}*/ > + > +static inline void isolation_group_dev_remove(struct device *dev) > +{ > +} > + > +static inline int isolation_group_update_links(struct device *dev) > +{ > + return 0; > +} > + > +static inline int isolation_is_detached(struct device *dev) > +{ > + return 0; > +} > + > +#endif > + > +#endif So what this boils down to so far are some group objects with lists of devices that allow us to pop off all the native drivers and prevent them from reattaching. And for that convenience, we're going to need a whole new group driver model to allow something like vfio to claim the group and then maybe some group-device model to allow bus specific drivers to easily make use of the devices within the group? And somehow along the way we're going to avoid the group vs device awkwardness? And again, all for an interface that the vast majority of users won't use. I'm not convinced. Thanks, Alex ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: RFC: Device isolation infrastructure 2011-12-07 6:46 ` Alex Williamson @ 2011-12-07 14:22 ` David Gibson 2011-12-07 19:45 ` Alex Williamson 0 siblings, 1 reply; 16+ messages in thread From: David Gibson @ 2011-12-07 14:22 UTC (permalink / raw) To: Alex Williamson Cc: joerg.roedel, dwmw2, iommu, aik, linux-kernel, chrisw, agraf, scottwood, B08248, benh On Tue, Dec 06, 2011 at 11:46:42PM -0700, Alex Williamson wrote: > On Wed, 2011-12-07 at 14:58 +1100, David Gibson wrote: > > > > Alex Williamson recently posted some patches adding a new hook to > > iommu_ops to identify groups of devices which cannot reliably be > > distinguished by the iommu, and therefore can only safely be assigned > > as a unit to a guest or userspace driver. > > > > I'm not very fond of that design, because: > > - It requires something else to scan the group ids to present the > > groups in a userspace discoverable manner, rather than handling that > > in the core. > > The thought was that the vast majority of users won't make use of > groups, so why bloat the core when the only user is vfio. Well, "won't make use of" actually means "are on a system with all-singleton groups". And the answer is because it actually doesn't bloat, but simplifies the core. We need the group concept _at minimum_ to handle bridges and buggy multifunction devices. So it makes life easier to assume they're always there, even if they're often singletons. > We need bus > specific drivers to expose devices in vfio, so it's fairly natural to > have the bus drivers enumerate devices on their bus, register them with > the vfio-core, which then exposes groups in a convenient manner for > userspace. > > > - It encorages manipulating devices individually, but only > > permitting certain actions if the whole group is together. This seems > > confusing and arse-backwards to me. > > Hmm, this patch introduces a group level detach and reattach of devices, > but seems to lack any support for actually making use of devices in the > detached state, so it's rather hard to compare. well, yes, that's the next layer up. Working on it.. > How does someone > wanting to interact with a device within the group do so? This appears > to put devices entirely outside of the driver model when detached, which > seems just as awkward. > > > - It does not provide an obvious means to prevent a kernel > > driver unsafely binding to a new device which is hotplugged into an > > iommu group already assigned to a guest. > > I've added this for PCI, triggered via the device add notifier. Hrm. Does the notifier run early enough to guarantee that a driver can't bind before it's complete? > > - It is inherently centred arounf the iommu structure, however > > there might be other platform dependent factors which also affect > > which devices can be safely isolated from others. > > Yep, MSI isolation being an example. I tend to think the iommu driver > is pretty closely tied to the platform (all current examples are), so > the iommu driver should take those factors into account when advertising > groups. > > > These could probably each be addressed with various tweaks, but > > basically, I'm just not convinced that it represents the right > > structure for handling this. So, Alexey Kardashevskiy and myself havr > > made this patches, representing the outline of how I think this should > > be handled. > > > > > > Notes: > > * All names are negotiable. At the moment I'm thinking of this as > > the "device isolation subsystem" which handles "device isolation > > groups" (not to be confused with an iommu domain, which could contain > > one or more groups). A group is considered "detached" when it has > > been removed from the normal driver binding process, and therefore > > ready for assignment to a guest or userspace. I'm more than open to > > suggestions of better names. > > * This is just the generic infrastructure. Obviously, platform > > specific code will be needed as well to create the groups and assign > > devices to them. We have a proof of concept implementation for the > > power p5ioc2 pci host bridge, and I hope to post code for that and the > > p7ioc bridge shortly. I'll also be looking at how this would be used > > by Intel VTd, and AMD iommus. > > * isolation groups appear in /sys/isolation. Isolatable devices > > have a link to their group. The group contains links to all their > > devices, and a "detached" attribute, which can be used to view and > > control whether the group is detached. > > * At present groups are detached manually via sysfs, but the idea > > is a kernel system like vfio, could also directly request a group to > > be detached for its use. > > * when detached, all kernel drivers are unbound from every device > > in the group. No driver will match any device in the group until it > > is reattached. > > > > * There are probably stupid bugs, in particular I know some error > > paths are broken. By all means point these out to expedite fixing > > them. > > * The isolation_group structure is more or less a stub at this > > point. It will probably want a pointer to an iommu_ops and/or its own > > ops structure. The 'detached' bool may need to become some sort of > > state variable. We'll probably want some sort of "owner" field for > > what (e.g. vfio) is using the group when its detached. > > This makes me nervous. One of the flaws I'm trying to correct with vfio > vs existing kvm PCI device assignment is the driver ownership model. > KVM will currently try to make use of any device it can and is only > stopped by drivers having already claimed various resources. VFIO has a > strong device-driver-group ownership model. Right... I'm not sure what aspect makes you nervous. Exactly what I'm saying here is that we will need an ownership model for detached groups. > Isolation groups facilitate ejecting the normal device drivers (which > actually seems somewhat dangerous) Why dangerous? You have to do exactly the same thing in more steps in order to use a group with vfio. > but isn't yet introducing any clear > ownership model to replace *a* device being bound to *a* driver. > Devices seem to be in limbo once detached. We're not even outlining > here how we're going to attach group-drivers to groups or facilitate > matching individual devices to bus specific drivers, which seems like > it'll get us back into the individual device manipulation you complain > about above. I'm envisaging something like struct isolation_group_owner *owner; in the struct i_g, which is NULL when the group is attached. It would be a parameter to the detach call, so a subsystem (like vfio) could detach and claim ownership atomically. The struct i_g_o would have some generic fields and some ops, and the owner code could container_of it to get private, more detailed info e.g. uid/gid/pid/mm* or whatever for vfio. > > * I've also considered adding an "isolation strength" bitmask to > > indicate how well isolated the group is - e.g DMA isolated (iommu), > > irq isolated (irq remapping iommu, or other mechanism), error isolated > > (bus errors from this device can't affect other devices), .. > > This sounds like a maintenance burden, which is why I've been trying to > push that iommu drivers should not expose groups that aren't fully > isolated. It's easy to add parameters per iommu driver where the user > can opt-in to certain deficiencies. True. In fact the only real way to use such a bitmask, I think, would be to have it as a global, defining the minimum isolation characteristics to the iommu drivers. That might help to unify configuration of what flaws in isolation are acceptable. > > Anyway, on with the code.. > > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au> > > > > --- > > drivers/Kconfig | 2 + > > drivers/Makefile | 2 + > > drivers/base/base.h | 5 + > > drivers/base/core.c | 7 + > > drivers/base/init.c | 2 + > > drivers/isolation/Kconfig | 3 + > > drivers/isolation/Makefile | 2 + > > drivers/isolation/device_isolation.c | 216 ++++++++++++++++++++++++++++++++++ > > include/linux/device.h | 5 + > > include/linux/device_isolation.h | 79 ++++++++++++ > > 10 files changed, 323 insertions(+), 0 deletions(-) > > create mode 100644 drivers/isolation/Kconfig > > create mode 100644 drivers/isolation/Makefile > > create mode 100644 drivers/isolation/device_isolation.c > > create mode 100644 include/linux/device_isolation.h > > > > diff --git a/drivers/Kconfig b/drivers/Kconfig > > index 95b9e7e..2ce5717 100644 > > --- a/drivers/Kconfig > > +++ b/drivers/Kconfig > > @@ -130,4 +130,6 @@ source "drivers/iommu/Kconfig" > > > > source "drivers/virt/Kconfig" > > > > +source "drivers/isolation/Kconfig" > > + > > endmenu > > diff --git a/drivers/Makefile b/drivers/Makefile > > index 7fa433a..741874a 100644 > > --- a/drivers/Makefile > > +++ b/drivers/Makefile > > @@ -127,3 +127,5 @@ obj-$(CONFIG_IOMMU_SUPPORT) += iommu/ > > > > # Virtualization drivers > > obj-$(CONFIG_VIRT_DRIVERS) += virt/ > > + > > +obj-$(CONFIG_DEVICE_ISOLATION) += isolation/ > > diff --git a/drivers/base/base.h b/drivers/base/base.h > > index a34dca0..66a4ed2 100644 > > --- a/drivers/base/base.h > > +++ b/drivers/base/base.h > > @@ -24,6 +24,9 @@ > > * bus_type/class to be statically allocated safely. Nothing outside of the > > * driver core should ever touch these fields. > > */ > > + > > +#include <linux/device_isolation.h> > > + > > struct subsys_private { > > struct kset subsys; > > struct kset *devices_kset; > > @@ -108,6 +111,8 @@ extern int driver_probe_device(struct device_driver *drv, struct device *dev); > > static inline int driver_match_device(struct device_driver *drv, > > struct device *dev) > > { > > + if (isolation_device_is_detached(dev)) > > + return 0; > > return drv->bus->match ? drv->bus->match(dev, drv) : 1; > > } > > > > diff --git a/drivers/base/core.c b/drivers/base/core.c > > index bc8729d..45201c5 100644 > > --- a/drivers/base/core.c > > +++ b/drivers/base/core.c > > @@ -22,6 +22,7 @@ > > #include <linux/kallsyms.h> > > #include <linux/mutex.h> > > #include <linux/async.h> > > +#include <linux/device_isolation.h> > > > > #include "base.h" > > #include "power/power.h" > > @@ -593,6 +594,10 @@ void device_initialize(struct device *dev) > > lockdep_set_novalidate_class(&dev->mutex); > > spin_lock_init(&dev->devres_lock); > > INIT_LIST_HEAD(&dev->devres_head); > > +#ifdef CONFIG_DEVICE_ISOLATION > > + INIT_LIST_HEAD(&dev->isolation_list); > > + dev->isolation_group = NULL; > > +#endif > > device_pm_init(dev); > > set_dev_node(dev, -1); > > } > > @@ -993,6 +998,8 @@ int device_add(struct device *dev) > > class_intf->add_dev(dev, class_intf); > > mutex_unlock(&dev->class->p->class_mutex); > > } > > At this point we've already attached a native driver to a potentially > in-use isolation group, so we haven't really addressed the hotplug > issue. If we're creating groups and adding devices to them via the > device add notifier, that looks pretty similar to how vfio does it, and > there we only need the trivial iommu_device_group hook. Uh, no. This probably isn't clear without an example iommu implementation, but the idea is that group_dev_add is called before device_add, so group membership is established before the device appears in sysfs, and before a driver will bind. That works nicely for our iommu structure (built into the host bridge, essentially). It's a bit nastier on x86, because the iommus are initialized relatively late. I'm still getting my head around that code, but I think it will work, although it might need slightly different handling for the boot-time and hotplug cases. > I'll give you > that the "detached" hook in the match function makes it easy to take a > device out of the driver model, but then what do you replace the driver > model with? Other, vfio-like, subsystems explicitly claiming groups for themselves. > > + > > + isolation_group_update_links(dev); > > done: > > put_device(dev); > > return error; > > We'd need something in the device_del path to block removal if the > device is in use via the group. The equivalent of the .remove for the > driver it can no longer attach to. Good point. At least for the detach-via-sysfs case (which may not even be a good idea to keep). In general, I'd expect reattach() to be called by the subsystem which detached the group in the first place, when it's done with it. > > diff --git a/drivers/base/init.c b/drivers/base/init.c > > index c8a934e..89f3e94 100644 > > --- a/drivers/base/init.c > > +++ b/drivers/base/init.c > > @@ -8,6 +8,7 @@ > > #include <linux/device.h> > > #include <linux/init.h> > > #include <linux/memory.h> > > +#include <linux/device_isolation.h> > > > > #include "base.h" > > > > @@ -24,6 +25,7 @@ void __init driver_init(void) > > devices_init(); > > buses_init(); > > classes_init(); > > + isolation_init(); > > firmware_init(); > > hypervisor_init(); > > > > diff --git a/drivers/isolation/Kconfig b/drivers/isolation/Kconfig > > new file mode 100644 > > index 0000000..a2a7703 > > --- /dev/null > > +++ b/drivers/isolation/Kconfig > > @@ -0,0 +1,3 @@ > > +config DEVICE_ISOLATION > > + bool "Enable isolating devices for safe pass-through to guests or user space." > > + > > diff --git a/drivers/isolation/Makefile b/drivers/isolation/Makefile > > new file mode 100644 > > index 0000000..8f1c6ec > > --- /dev/null > > +++ b/drivers/isolation/Makefile > > @@ -0,0 +1,2 @@ > > +obj-$(CONFIG_DEVICE_ISOLATION) := device_isolation.o > > + > > diff --git a/drivers/isolation/device_isolation.c b/drivers/isolation/device_isolation.c > > new file mode 100644 > > index 0000000..2b00ff7 > > --- /dev/null > > +++ b/drivers/isolation/device_isolation.c > > @@ -0,0 +1,216 @@ > > +#include <linux/slab.h> > > +#include <linux/device_isolation.h> > > + > > +#define KERN_DEBUG KERN_WARNING > > + > > + > > +static struct kset *isolation_kset; > > + > > +#define to_isolation_attr(_attr) container_of(_attr, \ > > + struct isolation_attribute, attr) > > + > > +static ssize_t isolation_attr_show(struct kobject *kobj, > > + struct attribute *attr, char *buf) > > +{ > > + struct isolation_attribute *isolation_attr = to_isolation_attr(attr); > > + struct isolation_group *group = container_of(kobj, > > + struct isolation_group, kobj); > > + ssize_t ret = -EIO; > > + > > + if (isolation_attr->show) > > + ret = isolation_attr->show(group, buf); > > + return ret; > > +} > > + > > +static ssize_t isolation_attr_store(struct kobject *kobj, > > + struct attribute *attr, const char *buf, size_t count) > > +{ > > + struct isolation_attribute *isolation_attr = to_isolation_attr(attr); > > + struct isolation_group *group = container_of(kobj, > > + struct isolation_group, kobj); > > + ssize_t ret = -EIO; > > + > > + if (isolation_attr->store) > > + ret = isolation_attr->store(group, buf, count); > > + return ret; > > +} > > + > > +static void isolation_release(struct kobject *kobj) > > +{ > > + struct isolation_group *group = container_of(kobj, > > + struct isolation_group, kobj); > > + > > + pr_debug("isolation : release.\n"); > > + > > + kfree(group); > > +} > > + > > +static const struct sysfs_ops isolation_sysfs_ops = { > > + .show = isolation_attr_show, > > + .store = isolation_attr_store, > > +}; > > + > > +static struct kobj_type isolation_ktype = { > > + .sysfs_ops = &isolation_sysfs_ops, > > + .release = isolation_release, > > +}; > > + > > +static ssize_t group_showdetached(struct isolation_group *group, > > + char *buf) > > +{ > > + return sprintf(buf, "%u", !!group->detached); > > +} > > + > > +static ssize_t group_setdetached(struct isolation_group *group, > > + const char *buf, size_t count) > > +{ > > + switch (buf[0]) { > > + case '0': > > + isolation_group_reattach(group); > > + break; > > + case '1': > > + isolation_group_detach(group); > > + break; > > + default: > > + count = -EINVAL; > > + } > > + return count; > > +} > > + > > +static DEVICE_ISOLATION_ATTR(detached, S_IWUSR | S_IRUSR, > > + group_showdetached, group_setdetached); > > + > > +struct isolation_group *isolation_group_new(const char *name) > > +{ > > + int ret; > > + struct isolation_group *group; > > + > > + group = kzalloc(sizeof(*group), GFP_KERNEL); > > + if (!group) > > + return NULL; > > + > > + spin_lock_init(&group->lock); > > + > > + group->kobj.kset = isolation_kset; > > + ret = kobject_init_and_add(&group->kobj, &isolation_ktype, NULL, name); > > + if (ret < 0) { > > + printk(KERN_ERR "device_isolation: kobject_init_and_add failed " > > + "for %s\n", group->kobj.name); > > + return NULL; > > + } > > + > > + ret = sysfs_create_file(&group->kobj, &isolation_attr_detached.attr); > > + if (0 > ret) > > + printk(KERN_WARNING "device_isolation: create \"detached\" failed " > > + "for %s, errno=%i\n", group->kobj.name, ret); > > + > > + INIT_LIST_HEAD(&group->devices); > > + printk(KERN_DEBUG "device_isolation: group %s created\n", > > + group->kobj.name); > > + > > + return group; > > +} > > + > > +void isolation_group_dev_add(struct isolation_group *group, struct device *dev) > > +{ > > + printk(KERN_DEBUG "device_isolation: adding device %s to group %s\n", > > + dev->kobj.name, group->kobj.name); > > + spin_lock(&group->lock); > > + > > + list_add_tail(&dev->isolation_list, &group->devices); > > + dev->isolation_group = group; > > + /* Todo: remove links we already created */ > > + spin_unlock(&group->lock); > > +} > > + > > +void isolation_group_dev_remove(struct device *dev) > > +{ > > + list_del(&dev->isolation_list); > > +} > > + > > +int isolation_group_update_links(struct device *dev) > > +{ > > + int error; > > + char *buf; > > + struct isolation_group *group = dev->isolation_group; > > + > > + if (!group) { > > + printk(KERN_DEBUG "device_isolation: no group for %s\n", > > + dev->kobj.name); > > + return 0; > > + } > > + > > + printk(KERN_DEBUG "device_isolation: adding device %s to group %s\n", > > + dev->kobj.name, group->kobj.name); > > + > > + spin_lock(&group->lock); > > + > > + error = sysfs_create_link(&dev->kobj, &group->kobj, "isolation_group"); > > + if (0 > error) > > + printk(KERN_WARNING "device_isolation: create isolation_group " > > + "link failed for %s -> %s, errno=%i\n", > > + dev->kobj.name, group->kobj.name, error); > > + > > + buf = kasprintf(GFP_KERNEL, "%s-%p", dev->kobj.name, dev); > > + if (!buf) { > > + error = -ENOMEM; > > + printk(KERN_ERR "device_isolation: kasprintf failed\n"); > > + goto out; > > + } > > + > > + error = sysfs_create_link(&group->kobj, &dev->kobj, buf); > > + kfree(buf); > > + if (0 > error) > > + printk(KERN_WARNING "device_isolation: create %s " > > + "link failed for %s -> %s, errno=%i\n", > > + buf, dev->kobj.name, group->kobj.name, error); > > + > > +out: > > + /* Todo: remove links we already created */ > > + spin_unlock(&group->lock); > > + return error; > > +} > > + > > +int isolation_group_detach(struct isolation_group *group) > > +{ > > + int error; > > + struct device *dev; > > + > > + spin_lock(&group->lock); > > + > > + group->detached = true; > > + list_for_each_entry(dev, &group->devices, isolation_list) { > > + error = device_reprobe(dev); > > How do we handle blocking .remove functions? That's one reason vfio > doesn't attempt to automatically detach drivers. We can treat the > groups as a unit here, but it's hard to get away from the fact that > potentially several individual devices are being removed from their > drivers here. Slowly. We have to remove all the drivers, so we have to just wait for them. I guess we should have a timeout and reattach the group if we can't unbind the drivers in that time. And/or figure a way to make it interruptible, so you can ctrl-c something midway through triggering a detach, and it will re-attach. Oh, right, and that means we can't take the lock around it all. Well that's fixable with the right rechecks of the list. we might need to replace the bool with some sort of state variable; ATTACHED -> DETACHING -> DETACHED. > > + } > > + > > + spin_unlock(&group->lock); > > + > > + return 0; > > +} > > + > > +int isolation_group_reattach(struct isolation_group *group) > > +{ > > + int error; > > + struct device *dev; > > + > > + spin_lock(&group->lock); > > Obviously some ownership check would have to exist here if we knew how > to claim a group. > > > + group->detached = false; > > + list_for_each_entry(dev, &group->devices, isolation_list) { > > + printk("Reprobing %s\n", dev->kobj.name); > > + error = device_reprobe(dev); > > + } > > + > > + spin_unlock(&group->lock); > > + > > + return 0; > > +} > > + > > +int __init isolation_init(void) > > +{ > > + isolation_kset = kset_create_and_add("isolation", NULL, NULL); > > + if (!isolation_kset) > > + return -ENOMEM; > > + return 0; > > +} > > + > > diff --git a/include/linux/device.h b/include/linux/device.h > > index c20dfbf..6238a13 100644 > > --- a/include/linux/device.h > > +++ b/include/linux/device.h > > @@ -585,6 +585,11 @@ struct device { > > > > struct dma_coherent_mem *dma_mem; /* internal for coherent mem > > override */ > > +#ifdef CONFIG_DEVICE_ISOLATION > > + struct isolation_group *isolation_group; > > + struct list_head isolation_list; > > +#endif > > + > > /* arch specific additions */ > > struct dev_archdata archdata; > > > > diff --git a/include/linux/device_isolation.h b/include/linux/device_isolation.h > > new file mode 100644 > > index 0000000..1dc7e68 > > --- /dev/null > > +++ b/include/linux/device_isolation.h > > @@ -0,0 +1,79 @@ > > +#ifndef _DEVICE_ISOLATION_H_ > > +#define _DEVICE_ISOLATION_H_ > > + > > +#include <linux/kobject.h> > > +#include <linux/list.h> > > +#include <linux/device.h> > > + > > +struct isolation_group { > > + struct kobject kobj; > > + struct list_head devices; > > + spinlock_t lock; > > + bool detached; > > +}; > > + > > +struct isolation_attribute { > > + struct attribute attr; > > + ssize_t (*show)(struct isolation_group *group, char *buf); > > + ssize_t (*store)(struct isolation_group *group, const char *buf, > > + size_t count); > > +}; > > + > > +#ifdef CONFIG_DEVICE_ISOLATION > > + > > +int __init isolation_init(void); > > + > > +struct isolation_group *isolation_group_new(const char *name); > > +int isolation_group_delete(struct isolation_group *group); > > + > > +void isolation_group_dev_add(struct isolation_group *group, > > + struct device *dev); > > +void isolation_group_dev_remove(struct device *dev); > > + > > +int isolation_group_update_links(struct device *dev); > > +int isolation_group_detach(struct isolation_group *group); > > +int isolation_group_reattach(struct isolation_group *group); > > + > > +#define DEVICE_ISOLATION_ATTR(_name, _mode, _show, _store) \ > > + struct isolation_attribute isolation_attr_##_name = \ > > + __ATTR(_name, _mode, _show, _store) > > + > > +static inline int isolation_device_is_detached(struct device *dev) > > +{ > > + return dev->isolation_group && dev->isolation_group->detached; > > +} > > + > > +#else > > + > > +static inline int __init isolation_init(void) > > +{ > > + return 0; > > +} > > + > > +static inline struct isolation_group *isolation_group_new(const char *name) > > +{ > > + return NULL; > > +} > > + > > +/*void isolation_group_dev_add(struct isolation_group *group, > > + struct device *dev) > > +{ > > +}*/ > > + > > +static inline void isolation_group_dev_remove(struct device *dev) > > +{ > > +} > > + > > +static inline int isolation_group_update_links(struct device *dev) > > +{ > > + return 0; > > +} > > + > > +static inline int isolation_is_detached(struct device *dev) > > +{ > > + return 0; > > +} > > + > > +#endif > > + > > +#endif > > So what this boils down to so far are some group objects with lists of > devices that allow us to pop off all the native drivers and prevent them > from reattaching. And for that convenience, we're going to need a whole > new group driver model to allow something like vfio to claim the group Well, we need the group model anyway. Having just ids at the iommu level merely pushes the work of tracking and exposing them up to higher levels. > and then maybe some group-device model to allow bus specific drivers to > easily make use of the devices within the group? Huh? > And somehow along the > way we're going to avoid the group vs device awkwardness? Groups are there whether we like it or not. This is just about exposing it cleanly. > And again, > all for an interface that the vast majority of users won't use. I'm not > convinced. Thanks, You don't get a choice about whether you use groups. If you have any two devices (or functions) that can't be isolated from each other, you have a group, and the kernel code (and the user code, for that matter) has to deal with it. So you have to be able to see the groups, or things will just mysteriously not work when you discover you had two not-mutually isolatable devices you hadn't realised you had. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: RFC: Device isolation infrastructure 2011-12-07 14:22 ` David Gibson @ 2011-12-07 19:45 ` Alex Williamson 2011-12-07 23:41 ` Benjamin Herrenschmidt 2011-12-08 2:43 ` David Gibson 0 siblings, 2 replies; 16+ messages in thread From: Alex Williamson @ 2011-12-07 19:45 UTC (permalink / raw) To: David Gibson Cc: joerg.roedel, dwmw2, iommu, aik, linux-kernel, chrisw, agraf, scottwood, B08248, benh On Thu, 2011-12-08 at 01:22 +1100, David Gibson wrote: > On Tue, Dec 06, 2011 at 11:46:42PM -0700, Alex Williamson wrote: > > On Wed, 2011-12-07 at 14:58 +1100, David Gibson wrote: > > > > > > Alex Williamson recently posted some patches adding a new hook to > > > iommu_ops to identify groups of devices which cannot reliably be > > > distinguished by the iommu, and therefore can only safely be assigned > > > as a unit to a guest or userspace driver. > > > > > > I'm not very fond of that design, because: > > > - It requires something else to scan the group ids to present the > > > groups in a userspace discoverable manner, rather than handling that > > > in the core. > > > > The thought was that the vast majority of users won't make use of > > groups, so why bloat the core when the only user is vfio. > > Well, "won't make use of" actually means "are on a system with > all-singleton groups". And the answer is because it actually doesn't > bloat, but simplifies the core. We need the group concept _at > minimum_ to handle bridges and buggy multifunction devices. So it > makes life easier to assume they're always there, even if they're > often singletons. Ok, let's back up from the patch because I think its clouding the driver model with a detached device state that might be detracting from the useful bits. An isolation group is a generic association of devices into a set based on, well... isolation. Presumably something like an iommu driver does a bus_register_notifier() and bus_for_each_dev() and when a new device is added we do isolation_group_new() and isolation_group_dev_add() as necessary. When the iommu driver needs to do a dma_ops call for a device, it can traverse up to the isolation group and find something that points it to the shared iommu context for that group. While dma_ops can share the iommu context among the drivers attached to the various devices, iommu_ops wants a single meta-driver for the group because it's predominantly designed for use when devices are exposed directly to the user. We trust native kernel drivers to share context between each other, we do not trust user level drivers to share context with anyone but themselves. Ok so far? The first question we run into is what does the iommu driver do if it does not provide isolation? If it's a translation-only iommu, maybe it has a single context and can manage to maintain it internally and should opt-out of isolation groups. But we also have things like MSI isolation where we have multiple contexts and the isolation strength is sufficient for dma_ops, but not iommu_ops. I don't know if we want to make that determination via a strength bitmask at the group level or simply a bool that the iommu driver sets when it registers a group. Once we store iommu context at the group, it makes sense to start using it as the primary object the iommu drivers operate on. There's no need to break everyone for dma_ops, so isolation aware iommu drivers can simply follow the link from the device to the group with no change to callers. iommu_ops though can switch to being entirely group based [not looking forward to retrofitting groups into kvm device assignment]. So the next problem is that while the group is the minimum granularity for the iommu, it's not necessarily the desired granularity. iommus like VT-d have per PCI BDF context entries that can point to shared page tables. On such systems we also typically have singleton isolation groups, so when multiple devices are used by a single user, we have a lot of duplication in time and space. VFIO handles this by allowing groups to be "merged". When this happens, the merged groups point to the same iommu context. I'm not sure what the plan is with isolation groups, but we need some way to reduce that overhead. Finally, we get to how do we manage the devices of the group when we switch to iommu_ops mode vs dma_ops, which is what this patch seems to have started with. For VFIO we have bus specific drivers which attach to individual devices. When all the devices in the group are attached to the bus driver (by the user), iommu_ops, and therefore access to devices from userspace becomes available. This is the level that the vast majority of users are not going to make use of, so while the above does feel cleaner and justifies it's addition, we need to be careful here. My concern with the "detached state" approach is that we're removing devices from the existing driver model. Who does things like pm_runtime callbacks when we start using devices? How does a driver enumerate and attach to devices within the group? How do we even enumerate and attach to isolation groups for the group level metadriver? This is where I'm nervous we're just making a requirement for a group-driver model that lives alongside the existing device driver model. I actually like the approach that individual devices are bound by the user to vfio bus drivers (group/iommu_ops drivers). This is tedious, but easily automate-able in userspace and puts the responsibility of unbinding each device on the user. Imagine the chaos a user could cause if vfio creates a chardev for accessing a group, we give a user access to it, and each time it's opened all the group devices automatically get unbound from the host drivers and re-bound on close. 'while (true); do cat /dev/vfio/$GROUP; done' could cause a denial of service on the system. We just need to figure out how binding vfio bus drivers to all the isolation group devices turns the robot lions into Voltron ;) > > > - It does not provide an obvious means to prevent a kernel > > > driver unsafely binding to a new device which is hotplugged into an > > > iommu group already assigned to a guest. > > > > I've added this for PCI, triggered via the device add notifier. > > Hrm. Does the notifier run early enough to guarantee that a driver > can't bind before it's complete? Yes, like I assume you're planning to do with setting up groups, it runs from the device add bus notifier. Thanks, Alex ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: RFC: Device isolation infrastructure 2011-12-07 19:45 ` Alex Williamson @ 2011-12-07 23:41 ` Benjamin Herrenschmidt 2011-12-08 0:16 ` Chris Wright 2011-12-08 5:51 ` Alex Williamson 2011-12-08 2:43 ` David Gibson 1 sibling, 2 replies; 16+ messages in thread From: Benjamin Herrenschmidt @ 2011-12-07 23:41 UTC (permalink / raw) To: Alex Williamson Cc: David Gibson, joerg.roedel, dwmw2, iommu, aik, linux-kernel, chrisw, agraf, scottwood, B08248 On Wed, 2011-12-07 at 12:45 -0700, Alex Williamson wrote: > Ok, let's back up from the patch because I think its clouding the driver > model with a detached device state that might be detracting from the > useful bits. > > An isolation group is a generic association of devices into a set based > on, well... isolation. Presumably something like an iommu driver does a > bus_register_notifier() and bus_for_each_dev() and when a new device is > added we do isolation_group_new() and isolation_group_dev_add() as > necessary. When the iommu driver needs to do a dma_ops call for a > device, it can traverse up to the isolation group and find something > that points it to the shared iommu context for that group. While > dma_ops can share the iommu context among the drivers attached to the > various devices, iommu_ops wants a single meta-driver for the group > because it's predominantly designed for use when devices are exposed > directly to the user. We trust native kernel drivers to share context > between each other, we do not trust user level drivers to share context > with anyone but themselves. Ok so far? > > The first question we run into is what does the iommu driver do if it > does not provide isolation? If it's a translation-only iommu, maybe it > has a single context and can manage to maintain it internally and should > opt-out of isolation groups. But we also have things like MSI isolation > where we have multiple contexts and the isolation strength is sufficient > for dma_ops, but not iommu_ops. I don't know if we want to make that > determination via a strength bitmask at the group level or simply a bool > that the iommu driver sets when it registers a group. .../... I think we are trying to solve too many problems at once. I don't believe we should have the normal dma ops path in the kernel be rewritten on top of some new iommu isolation layer or anything like that at this stage. Proceed step by step or we'll never be finished. What that means imho is: - Existing dma/iommu arch layers mostly remain as-is. It has its own internal representation of devices, domains and possibly grouping information, and things like the buggy RICOH devices are handled there (via a flag set in a generic header quirk) - Isolation groups are a new high-level thing that is not used by the iommu/dma code for normal in-kernel operations at this stage. The platform iommu/dma code creates the isolation group objects and populates them, it doesn't initially -use- them. - We can then use that for the needs of exposing the groups to user space, perform driver level isolation, and provide the grouping information to vfio. - In the -long- run, individual iommu/dma layers might be modified to instead use the isolation layer as their own building block if that makes sense, this can be looked at on a case by case basis but is not a pre-req. That leaves us with what I think is the main disagreement between David and Alex approaches which is whether devices in a group are "detached" completely (in some kind of limbo) while they are used by vfio or something else vs. assigned to a different driver (vfio). After much thinking, I believe that the driver ownership model is the right one. However, we could still use the work David did here. IE. I _do_ think that exposing groups as the individual unit of ownership to user space is the right thing to do. That means that instead of that current detached flag alone, what we could do instead is have an in-kernel mechanism to request isolation which works by having the isolation group kernel code perform: - Rename the detached flag to "exclusive". This flag prevents automatic & user driven attachment of a driver by the generic code. Set that flag on all devices in the group (need a hook for hotplug). - Detach existing drivers. - Once everything is detached, attach the requested client driver (vfio being the one we initially provide, maybe uio could use that framework as well). This can be done using an explicit in-kernel attach API that can be used even when the exclusive flag is set. - Win - When releasing the group, we detach the client driver, clear the exclusive flag and trigger a re-probe (automatic matching of kernel drivers). Cheers, Ben. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: RFC: Device isolation infrastructure 2011-12-07 23:41 ` Benjamin Herrenschmidt @ 2011-12-08 0:16 ` Chris Wright 2011-12-08 1:32 ` Benjamin Herrenschmidt 2011-12-08 5:51 ` Alex Williamson 1 sibling, 1 reply; 16+ messages in thread From: Chris Wright @ 2011-12-08 0:16 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alex Williamson, David Gibson, joerg.roedel, dwmw2, iommu, aik, linux-kernel, chrisw, agraf, scottwood, B08248 * Benjamin Herrenschmidt (benh@kernel.crashing.org) wrote: > After much thinking, I believe that the driver ownership model is the > right one. However, we could still use the work David did here. IE. I > _do_ think that exposing groups as the individual unit of ownership to > user space is the right thing to do. OK, I think I agree at least w/ the ownership model... > That means that instead of that current detached flag alone, what we > could do instead is have an in-kernel mechanism to request isolation > which works by having the isolation group kernel code perform: > > - Rename the detached flag to "exclusive". This flag prevents automatic > & user driven attachment of a driver by the generic code. Set that flag > on all devices in the group (need a hook for hotplug). > > - Detach existing drivers. > > - Once everything is detached, attach the requested client driver (vfio > being the one we initially provide, maybe uio could use that framework > as well). This can be done using an explicit in-kernel attach API that > can be used even when the exclusive flag is set. Since users will want to drive devices, not groups, how would you enumerate and expose devices here if the user visible construct exposed is groups? thanks, -chris ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: RFC: Device isolation infrastructure 2011-12-08 0:16 ` Chris Wright @ 2011-12-08 1:32 ` Benjamin Herrenschmidt 2011-12-08 1:36 ` Chris Wright 0 siblings, 1 reply; 16+ messages in thread From: Benjamin Herrenschmidt @ 2011-12-08 1:32 UTC (permalink / raw) To: Chris Wright Cc: Alex Williamson, David Gibson, joerg.roedel, dwmw2, iommu, aik, linux-kernel, agraf, scottwood, B08248 On Wed, 2011-12-07 at 16:16 -0800, Chris Wright wrote: > Since users will want to drive devices, not groups, how would you > enumerate and expose devices here if the user visible construct > exposed is groups? We do expose the individual devices in the group, which can then be used for things like mmap'ing of BARs etc... However, at least, the ownership transition is done on a whole group basis Cheers, Ben. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: RFC: Device isolation infrastructure 2011-12-08 1:32 ` Benjamin Herrenschmidt @ 2011-12-08 1:36 ` Chris Wright 0 siblings, 0 replies; 16+ messages in thread From: Chris Wright @ 2011-12-08 1:36 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Chris Wright, Alex Williamson, David Gibson, joerg.roedel, dwmw2, iommu, aik, linux-kernel, agraf, scottwood, B08248 * Benjamin Herrenschmidt (benh@kernel.crashing.org) wrote: > On Wed, 2011-12-07 at 16:16 -0800, Chris Wright wrote: > > Since users will want to drive devices, not groups, how would you > > enumerate and expose devices here if the user visible construct > > exposed is groups? > > We do expose the individual devices in the group, which can then be used > for things like mmap'ing of BARs etc... Right, one of the areas that got sticky before, which is why I wondered how you'd manage it. > However, at least, the ownership transition is done on a whole group > basis Yeah, that makes sense. thanks, -chris ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: RFC: Device isolation infrastructure 2011-12-07 23:41 ` Benjamin Herrenschmidt 2011-12-08 0:16 ` Chris Wright @ 2011-12-08 5:51 ` Alex Williamson 2011-12-09 2:38 ` David Gibson 1 sibling, 1 reply; 16+ messages in thread From: Alex Williamson @ 2011-12-08 5:51 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: David Gibson, joerg.roedel, dwmw2, iommu, aik, linux-kernel, chrisw, agraf, scottwood, B08248 On Thu, 2011-12-08 at 10:41 +1100, Benjamin Herrenschmidt wrote: > On Wed, 2011-12-07 at 12:45 -0700, Alex Williamson wrote: > > > Ok, let's back up from the patch because I think its clouding the driver > > model with a detached device state that might be detracting from the > > useful bits. > > > > An isolation group is a generic association of devices into a set based > > on, well... isolation. Presumably something like an iommu driver does a > > bus_register_notifier() and bus_for_each_dev() and when a new device is > > added we do isolation_group_new() and isolation_group_dev_add() as > > necessary. When the iommu driver needs to do a dma_ops call for a > > device, it can traverse up to the isolation group and find something > > that points it to the shared iommu context for that group. While > > dma_ops can share the iommu context among the drivers attached to the > > various devices, iommu_ops wants a single meta-driver for the group > > because it's predominantly designed for use when devices are exposed > > directly to the user. We trust native kernel drivers to share context > > between each other, we do not trust user level drivers to share context > > with anyone but themselves. Ok so far? > > > > The first question we run into is what does the iommu driver do if it > > does not provide isolation? If it's a translation-only iommu, maybe it > > has a single context and can manage to maintain it internally and should > > opt-out of isolation groups. But we also have things like MSI isolation > > where we have multiple contexts and the isolation strength is sufficient > > for dma_ops, but not iommu_ops. I don't know if we want to make that > > determination via a strength bitmask at the group level or simply a bool > > that the iommu driver sets when it registers a group. > > .../... > > I think we are trying to solve too many problems at once. > > I don't believe we should have the normal dma ops path in the kernel be > rewritten on top of some new iommu isolation layer or anything like that > at this stage. I headed down this path because David seems to be suggesting, well stating, that all this infrastructure is useful for providing simplicity and efficiency of groups to the core, and not just for this niche vfio usage. "You don't get a choice about whether you use groups." The only way I can rationalize that is to give it some purpose which is used by something other that just vfio. If all of that is "some day", then why don't we start with the group management in vfio as I have it today and eventually push that down into the core? > Proceed step by step or we'll never be finished. What that means imho > is: > > - Existing dma/iommu arch layers mostly remain as-is. It has its own > internal representation of devices, domains and possibly grouping > information, and things like the buggy RICOH devices are handled there > (via a flag set in a generic header quirk) But we're already being harassed to rework iommu_ops to operate based on groups and to take grouping into account for things like the ricoh device in dma_ops. Mixed messages... > - Isolation groups are a new high-level thing that is not used by the > iommu/dma code for normal in-kernel operations at this stage. The > platform iommu/dma code creates the isolation group objects and > populates them, it doesn't initially -use- them. The iommu_device_group() interface is a tiny extension to the IOMMU API that basically lets vfio query the iommu driver for a group identifier for a device. vfio then uses that information to create and manage groups itself. This is simply reversing that, allowing the iommu driver to create groups and add devices, which are centrally managed. That sounds like a good long term goal, especially when we start making use of the groups for dma/iommu ops, but it does require that we define an API both for group providers (iommu drivers) and the group users (vfio). It's also much more intrusive to the iommu drivers (note the trivial patches to add iommu_device_group support to intel and amd iommu) and creates bloat for a niche user if we're not going to get around to making groups more pervasive anytime soon. > - We can then use that for the needs of exposing the groups to user > space, perform driver level isolation, and provide the grouping > information to vfio. > > - In the -long- run, individual iommu/dma layers might be modified to > instead use the isolation layer as their own building block if that > makes sense, this can be looked at on a case by case basis but is not a > pre-req. > > That leaves us with what I think is the main disagreement between David > and Alex approaches which is whether devices in a group are "detached" > completely (in some kind of limbo) while they are used by vfio or > something else vs. assigned to a different driver (vfio). > > After much thinking, I believe that the driver ownership model is the > right one. However, we could still use the work David did here. IE. I > _do_ think that exposing groups as the individual unit of ownership to > user space is the right thing to do. So far we're only exposing the group structure here (which vfio also does in sysfs), not providing a userspace or driver interface for taking ownership of the group. I'm still not clear how this ends up satisfying David's complaints about the group ownership vs device access model in vfio. > That means that instead of that current detached flag alone, what we > could do instead is have an in-kernel mechanism to request isolation > which works by having the isolation group kernel code perform: > > - Rename the detached flag to "exclusive". This flag prevents automatic > & user driven attachment of a driver by the generic code. Set that flag > on all devices in the group (need a hook for hotplug). This is a sysfs attribute on the group? So step #1: $ echo 1 > /sys/isolation/$GROUP/exclusive > - Detach existing drivers. step #2: $ for i in $(ls /sys/isolation/$GROUP/devices/); do \ echo $(basename $i) > $i/driver/unbind; done > > - Once everything is detached, attach the requested client driver (vfio > being the one we initially provide, maybe uio could use that framework > as well). This can be done using an explicit in-kernel attach API that > can be used even when the exclusive flag is set. step #3: A miracle occurs ;) I'm not sure what this looks like. Are we going to do a isolation_group_register_meta_driver() that vfio calls with some defined ops structure, so a user can: $ echo "vfio" > /sys/isolation/$GROUP/driver That triggers some list walk of the group devices, asking vfio to find and force a driver attach to the bus driver for each device? A model something like that is the only way I can see that removing groups from vfio doesn't also disassociate the group meta driver from the group device drivers. If instead, we make vfio be the group manager (since nobody else uses it anyway), all of the above becomes: $ for i in $(ls /sys/devices/virtual/vfio/pci:257/devices); do \ echo $(basename $i) > $i/driver/unbind; \ echo $(basename $i) > /sys/bus/pci/drivers/vfio/bind; \ done > - Win I wish, we still have to make it through how vfio then exposes the group to userspace and how we don't end up with the same individual device manipulation vs group access model that seems to be getting under David's skin. Effectively what we've done above is: * provide a group model that we'll "some day" make more pervasive * make the binding of devices to group device drivers implicit in some kind of meta driver attachment Thanks, Alex ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: RFC: Device isolation infrastructure 2011-12-08 5:51 ` Alex Williamson @ 2011-12-09 2:38 ` David Gibson 0 siblings, 0 replies; 16+ messages in thread From: David Gibson @ 2011-12-09 2:38 UTC (permalink / raw) To: Alex Williamson Cc: Benjamin Herrenschmidt, joerg.roedel, dwmw2, iommu, aik, linux-kernel, chrisw, agraf, scottwood, B08248 On Wed, Dec 07, 2011 at 10:51:59PM -0700, Alex Williamson wrote: > On Thu, 2011-12-08 at 10:41 +1100, Benjamin Herrenschmidt wrote: > > On Wed, 2011-12-07 at 12:45 -0700, Alex Williamson wrote: > > > > > Ok, let's back up from the patch because I think its clouding the driver > > > model with a detached device state that might be detracting from the > > > useful bits. > > > > > > An isolation group is a generic association of devices into a set based > > > on, well... isolation. Presumably something like an iommu driver does a > > > bus_register_notifier() and bus_for_each_dev() and when a new device is > > > added we do isolation_group_new() and isolation_group_dev_add() as > > > necessary. When the iommu driver needs to do a dma_ops call for a > > > device, it can traverse up to the isolation group and find something > > > that points it to the shared iommu context for that group. While > > > dma_ops can share the iommu context among the drivers attached to the > > > various devices, iommu_ops wants a single meta-driver for the group > > > because it's predominantly designed for use when devices are exposed > > > directly to the user. We trust native kernel drivers to share context > > > between each other, we do not trust user level drivers to share context > > > with anyone but themselves. Ok so far? > > > > > > The first question we run into is what does the iommu driver do if it > > > does not provide isolation? If it's a translation-only iommu, maybe it > > > has a single context and can manage to maintain it internally and should > > > opt-out of isolation groups. But we also have things like MSI isolation > > > where we have multiple contexts and the isolation strength is sufficient > > > for dma_ops, but not iommu_ops. I don't know if we want to make that > > > determination via a strength bitmask at the group level or simply a bool > > > that the iommu driver sets when it registers a group. > > > > .../... > > > > I think we are trying to solve too many problems at once. > > > > I don't believe we should have the normal dma ops path in the kernel be > > rewritten on top of some new iommu isolation layer or anything like that > > at this stage. > > I headed down this path because David seems to be suggesting, well > stating, that all this infrastructure is useful for providing simplicity > and efficiency of groups to the core, and not just for this niche vfio > usage. "You don't get a choice about whether you use groups." The only > way I can rationalize that is to give it some purpose which is used by > something other that just vfio. If all of that is "some day", then why > don't we start with the group management in vfio as I have it today and > eventually push that down into the core? So.. you misunderstand what I meant by that, but I just realised that's because I misunderstood what I said that in response to. My point was that if you're using vfio (or something like it), then the choice of whether groups are relevant is not the user's, but depends on the hardware setup. But I now realise that your point was that very few people will use vfio itself, or anything like it. And therefore you're uncomfortable with the extent to which my patch impinges on the core driver model. Which is a good point. I still think this sufficiently solidifies the isolation model to be worth it. > > Proceed step by step or we'll never be finished. What that means imho > > is: > > > > - Existing dma/iommu arch layers mostly remain as-is. It has its own > > internal representation of devices, domains and possibly grouping > > information, and things like the buggy RICOH devices are handled there > > (via a flag set in a generic header quirk) > > But we're already being harassed to rework iommu_ops to operate based on > groups and to take grouping into account for things like the ricoh > device in dma_ops. Mixed messages... Well, ok, maybe not so far in the future, but in any case this can be considered independently. > > - Isolation groups are a new high-level thing that is not used by the > > iommu/dma code for normal in-kernel operations at this stage. The > > platform iommu/dma code creates the isolation group objects and > > populates them, it doesn't initially -use- them. > > The iommu_device_group() interface is a tiny extension to the IOMMU API > that basically lets vfio query the iommu driver for a group identifier > for a device. vfio then uses that information to create and manage > groups itself. > > This is simply reversing that, allowing the iommu driver to create > groups and add devices, which are centrally managed. That sounds like a > good long term goal, especially when we start making use of the groups > for dma/iommu ops, but it does require that we define an API both for > group providers (iommu drivers) and the group users (vfio). It's also > much more intrusive to the iommu drivers (note the trivial patches to > add iommu_device_group support to intel and amd iommu) and creates bloat > for a niche user if we're not going to get around to making groups more > pervasive anytime soon. > > > - We can then use that for the needs of exposing the groups to user > > space, perform driver level isolation, and provide the grouping > > information to vfio. > > > > - In the -long- run, individual iommu/dma layers might be modified to > > instead use the isolation layer as their own building block if that > > makes sense, this can be looked at on a case by case basis but is not a > > pre-req. > > > > That leaves us with what I think is the main disagreement between David > > and Alex approaches which is whether devices in a group are "detached" > > completely (in some kind of limbo) while they are used by vfio or > > something else vs. assigned to a different driver (vfio). > > > > After much thinking, I believe that the driver ownership model is the > > right one. However, we could still use the work David did here. IE. I > > _do_ think that exposing groups as the individual unit of ownership to > > user space is the right thing to do. > > So far we're only exposing the group structure here (which vfio also > does in sysfs), not providing a userspace or driver interface for taking > ownership of the group. I'm still not clear how this ends up satisfying > David's complaints about the group ownership vs device access model in > vfio. > > > That means that instead of that current detached flag alone, what we > > could do instead is have an in-kernel mechanism to request isolation > > which works by having the isolation group kernel code perform: > > > > - Rename the detached flag to "exclusive". This flag prevents automatic > > & user driven attachment of a driver by the generic code. Set that flag > > on all devices in the group (need a hook for hotplug). > > This is a sysfs attribute on the group? So step #1: > > $ echo 1 > /sys/isolation/$GROUP/exclusive > > > - Detach existing drivers. > > step #2: > > $ for i in $(ls /sys/isolation/$GROUP/devices/); do \ > echo $(basename $i) > $i/driver/unbind; done Right, but this version can be messed up if there's a hotplug after this but before vfio binds to the group. > > - Once everything is detached, attach the requested client driver (vfio > > being the one we initially provide, maybe uio could use that framework > > as well). This can be done using an explicit in-kernel attach API that > > can be used even when the exclusive flag is set. > > step #3: A miracle occurs ;) > > I'm not sure what this looks like. Are we going to do a > isolation_group_register_meta_driver() that vfio calls with some defined > ops structure, so a user can: > > $ echo "vfio" > /sys/isolation/$GROUP/driver > > That triggers some list walk of the group devices, asking vfio to find > and force a driver attach to the bus driver for each device? A model > something like that is the only way I can see that removing groups from > vfio doesn't also disassociate the group meta driver from the group > device drivers. That's one way we could do it. Another option is that the user explicitly just detaches the group, which puts it in limbo. The asking to add the group from the vfio interface triggers attaching the vfio "meta-driver" to the group (via a call to something like isolation_group_claim(group, &vfio_isolation_user)). I'm not sure what's best, I'll try to work something up and see how it looks in the next spin of my code. > If instead, we make vfio be the group manager (since nobody else uses it > anyway), all of the above becomes: > > $ for i in $(ls /sys/devices/virtual/vfio/pci:257/devices); do \ > echo $(basename $i) > $i/driver/unbind; \ > echo $(basename $i) > /sys/bus/pci/drivers/vfio/bind; \ > done > > > - Win > > I wish, we still have to make it through how vfio then exposes the group > to userspace and how we don't end up with the same individual device > manipulation vs group access model that seems to be getting under > David's skin. I don't have a problem with individual device manipulation for things that pertain to an individual device - getting the resources, doing mmio, attaching irqs. I just want thing pertaining to a group's ownership change to be explicitly group oriented. > Effectively what we've done above is: > > * provide a group model that we'll "some day" make more pervasive > * make the binding of devices to group device drivers implicit in > some kind of meta driver attachment That's not a bad summary. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: RFC: Device isolation infrastructure 2011-12-07 19:45 ` Alex Williamson 2011-12-07 23:41 ` Benjamin Herrenschmidt @ 2011-12-08 2:43 ` David Gibson 2011-12-08 6:23 ` Alex Williamson 1 sibling, 1 reply; 16+ messages in thread From: David Gibson @ 2011-12-08 2:43 UTC (permalink / raw) To: Alex Williamson Cc: joerg.roedel, dwmw2, iommu, aik, linux-kernel, chrisw, agraf, scottwood, B08248, benh On Wed, Dec 07, 2011 at 12:45:20PM -0700, Alex Williamson wrote: > On Thu, 2011-12-08 at 01:22 +1100, David Gibson wrote: > > On Tue, Dec 06, 2011 at 11:46:42PM -0700, Alex Williamson wrote: > > > On Wed, 2011-12-07 at 14:58 +1100, David Gibson wrote: > > > > > > > > Alex Williamson recently posted some patches adding a new hook to > > > > iommu_ops to identify groups of devices which cannot reliably be > > > > distinguished by the iommu, and therefore can only safely be assigned > > > > as a unit to a guest or userspace driver. > > > > > > > > I'm not very fond of that design, because: > > > > - It requires something else to scan the group ids to present the > > > > groups in a userspace discoverable manner, rather than handling that > > > > in the core. > > > > > > The thought was that the vast majority of users won't make use of > > > groups, so why bloat the core when the only user is vfio. > > > > Well, "won't make use of" actually means "are on a system with > > all-singleton groups". And the answer is because it actually doesn't > > bloat, but simplifies the core. We need the group concept _at > > minimum_ to handle bridges and buggy multifunction devices. So it > > makes life easier to assume they're always there, even if they're > > often singletons. > > Ok, let's back up from the patch because I think its clouding the driver > model with a detached device state that might be detracting from the > useful bits. Ok. I'll start out by saying that I think there's some confusion in the below because you're tending to identify an isolation group with an iommu domain, which is not my intention. An isolation group is the minimum granularity of an iommu domain, of course, but it's absolutely my intention that you could build an iommu domain with mutiple groups. A group is a unit that *can* be isolated, not a unit which *must* be isolated. > An isolation group is a generic association of devices into a set based > on, well... isolation. Presumably something like an iommu driver does a > bus_register_notifier() and bus_for_each_dev() and when a new device is > added we do isolation_group_new() and isolation_group_dev_add() as > necessary. Uh, roughly speaking, yes. > When the iommu driver needs to do a dma_ops call for a > device, it can traverse up to the isolation group and find something > that points it to the shared iommu context for that group. While > dma_ops can share the iommu context among the drivers attached to the > various devices, iommu_ops wants a single meta-driver for the group > because it's predominantly designed for use when devices are exposed > directly to the user. We trust native kernel drivers to share context > between each other, we do not trust user level drivers to share context > with anyone but themselves. Ok so far? Not so much. As Ben says, getting the existing kernel dma_ops and so forth to use the isolation structure is very much a "maybe one day" thing. > The first question we run into is what does the iommu driver do if it > does not provide isolation? If it's a translation-only iommu, maybe it > has a single context and can manage to maintain it internally and should > opt-out of isolation groups. But we also have things like MSI isolation > where we have multiple contexts and the isolation strength is sufficient > for dma_ops, but not iommu_ops. I don't know if we want to make that > determination via a strength bitmask at the group level or simply a bool > that the iommu driver sets when it registers a group. If the device can't be safely isolated, then it should have no isolation group. The definition of "safely" might vary depending on platform and/or kernel parameters. > Once we store iommu context at the group, it makes sense to start using > it as the primary object the iommu drivers operate on. There's no need > to break everyone for dma_ops, so isolation aware iommu drivers can > simply follow the link from the device to the group with no change to > callers. iommu_ops though can switch to being entirely group based [not > looking forward to retrofitting groups into kvm device assignment]. Again, maybe one day. We don't need to attack that in order to handle the vfio problem. > So the next problem is that while the group is the minimum granularity > for the iommu, it's not necessarily the desired granularity. iommus > like VT-d have per PCI BDF context entries that can point to shared page > tables. On such systems we also typically have singleton isolation > groups, so when multiple devices are used by a single user, we have a > lot of duplication in time and space. VFIO handles this by allowing > groups to be "merged". When this happens, the merged groups point to > the same iommu context. I'm not sure what the plan is with isolation > groups, but we need some way to reduce that overhead. Right. So, again, I intend that mutiple groups can go into one domain. Not entirely sure of the interface yet. One I had in mind was to borrow the vfio1 interface, so you open a /dev/vfio (each open gives a new instance). Then you do an "addgroup" ioctl which adds a group to the domain. You can do that multiple times, then start using the domain. > Finally, we get to how do we manage the devices of the group when we > switch to iommu_ops mode vs dma_ops, which is what this patch seems to > have started with. For VFIO we have bus specific drivers which attach > to individual devices. When all the devices in the group are attached > to the bus driver (by the user), iommu_ops, and therefore access to > devices from userspace becomes available. > > This is the level that the vast majority of users are not going to make > use of, so while the above does feel cleaner and justifies it's > addition, we need to be careful here. > > My concern with the "detached state" approach is that we're removing > devices from the existing driver model. Who does things like pm_runtime > callbacks when we start using devices? That's a point. Ben suggests an approach that I'll comment on there. > How does a driver enumerate and > attach to devices within the group? Well, detaching prevents driver binding. All the struct devices are still there and can be examined as usual. How the group's owner manages access to individual devices within the group is up to it. > How do we even enumerate and attach > to isolation groups for the group level metadriver? This is where I'm > nervous we're just making a requirement for a group-driver model that > lives alongside the existing device driver model. Well, sort of. But the "group driver model" is vastly simpler than the full driver model, because the assumption is that binding a group owner to a group is *always* the result of an explicit user request specifying both owner and group. There's no matching, no drivers of different bus types or the other things that make the normal driver model complicated. > I actually like the approach that individual devices are bound by the > user to vfio bus drivers (group/iommu_ops drivers). This is tedious, > but easily automate-able in userspace and puts the responsibility of > unbinding each device on the user. Imagine the chaos a user could cause > if vfio creates a chardev for accessing a group, we give a user access > to it, and each time it's opened all the group devices automatically get > unbound from the host drivers and re-bound on close. 'while (true); do > cat /dev/vfio/$GROUP; done' could cause a denial of service on the > system. Well, no, because regardless of the interface, detaching the devices would need equivalent permission. > We just need to figure out how binding vfio bus drivers to all the > isolation group devices turns the robot lions into Voltron ;) > > > > > - It does not provide an obvious means to prevent a kernel > > > > driver unsafely binding to a new device which is hotplugged into an > > > > iommu group already assigned to a guest. > > > > > > I've added this for PCI, triggered via the device add notifier. > > > > Hrm. Does the notifier run early enough to guarantee that a driver > > can't bind before it's complete? > > Yes, like I assume you're planning to do with setting up groups, it runs > from the device add bus notifier. Thanks, > > Alex > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: RFC: Device isolation infrastructure 2011-12-08 2:43 ` David Gibson @ 2011-12-08 6:23 ` Alex Williamson 2011-12-08 6:52 ` David Gibson 0 siblings, 1 reply; 16+ messages in thread From: Alex Williamson @ 2011-12-08 6:23 UTC (permalink / raw) To: David Gibson Cc: joerg.roedel, dwmw2, iommu, aik, linux-kernel, chrisw, agraf, scottwood, B08248, benh On Thu, 2011-12-08 at 13:43 +1100, David Gibson wrote: > On Wed, Dec 07, 2011 at 12:45:20PM -0700, Alex Williamson wrote: > > So the next problem is that while the group is the minimum granularity > > for the iommu, it's not necessarily the desired granularity. iommus > > like VT-d have per PCI BDF context entries that can point to shared page > > tables. On such systems we also typically have singleton isolation > > groups, so when multiple devices are used by a single user, we have a > > lot of duplication in time and space. VFIO handles this by allowing > > groups to be "merged". When this happens, the merged groups point to > > the same iommu context. I'm not sure what the plan is with isolation > > groups, but we need some way to reduce that overhead. > > Right. So, again, I intend that mutiple groups can go into one > domain. Not entirely sure of the interface yet. One I had in mind > was to borrow the vfio1 interface, so you open a /dev/vfio (each open > gives a new instance). Then you do an "addgroup" ioctl which adds a > group to the domain. You can do that multiple times, then start using > the domain. This also revisits one of the primary problems of vfio1, the dependency on a privileged uiommu domain creation interface. Assigning a user ownership of a group should be a privileged operation. If a privileged user needs to open /dev/vfio, add groups, then drop privileges and hand the open file descriptor to an unprivileged user, the interface becomes much harder to use. "Hot merging" becomes impossible. Alex ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: RFC: Device isolation infrastructure 2011-12-08 6:23 ` Alex Williamson @ 2011-12-08 6:52 ` David Gibson 2011-12-08 14:28 ` Alex Williamson 0 siblings, 1 reply; 16+ messages in thread From: David Gibson @ 2011-12-08 6:52 UTC (permalink / raw) To: Alex Williamson Cc: joerg.roedel, dwmw2, iommu, aik, linux-kernel, chrisw, agraf, scottwood, B08248, benh On Wed, Dec 07, 2011 at 11:23:10PM -0700, Alex Williamson wrote: > On Thu, 2011-12-08 at 13:43 +1100, David Gibson wrote: > > On Wed, Dec 07, 2011 at 12:45:20PM -0700, Alex Williamson wrote: > > > So the next problem is that while the group is the minimum granularity > > > for the iommu, it's not necessarily the desired granularity. iommus > > > like VT-d have per PCI BDF context entries that can point to shared page > > > tables. On such systems we also typically have singleton isolation > > > groups, so when multiple devices are used by a single user, we have a > > > lot of duplication in time and space. VFIO handles this by allowing > > > groups to be "merged". When this happens, the merged groups point to > > > the same iommu context. I'm not sure what the plan is with isolation > > > groups, but we need some way to reduce that overhead. > > > > Right. So, again, I intend that mutiple groups can go into one > > domain. Not entirely sure of the interface yet. One I had in mind > > was to borrow the vfio1 interface, so you open a /dev/vfio (each open > > gives a new instance). Then you do an "addgroup" ioctl which adds a > > group to the domain. You can do that multiple times, then start using > > the domain. > > This also revisits one of the primary problems of vfio1, the dependency > on a privileged uiommu domain creation interface. Assigning a user > ownership of a group should be a privileged operation. If a privileged > user needs to open /dev/vfio, add groups, then drop privileges and hand > the open file descriptor to an unprivileged user, the interface becomes > much harder to use. "Hot merging" becomes impossible. No, I was assuming that "permission to detach" could be handed out to a user before this step. uid/gid/mode attributes in sysfs would suffice, though there might be better ways. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: RFC: Device isolation infrastructure 2011-12-08 6:52 ` David Gibson @ 2011-12-08 14:28 ` Alex Williamson 2011-12-09 3:00 ` David Gibson 0 siblings, 1 reply; 16+ messages in thread From: Alex Williamson @ 2011-12-08 14:28 UTC (permalink / raw) To: David Gibson Cc: joerg.roedel, dwmw2, iommu, aik, linux-kernel, chrisw, agraf, scottwood, B08248, benh On Thu, 2011-12-08 at 17:52 +1100, David Gibson wrote: > On Wed, Dec 07, 2011 at 11:23:10PM -0700, Alex Williamson wrote: > > On Thu, 2011-12-08 at 13:43 +1100, David Gibson wrote: > > > On Wed, Dec 07, 2011 at 12:45:20PM -0700, Alex Williamson wrote: > > > > So the next problem is that while the group is the minimum granularity > > > > for the iommu, it's not necessarily the desired granularity. iommus > > > > like VT-d have per PCI BDF context entries that can point to shared page > > > > tables. On such systems we also typically have singleton isolation > > > > groups, so when multiple devices are used by a single user, we have a > > > > lot of duplication in time and space. VFIO handles this by allowing > > > > groups to be "merged". When this happens, the merged groups point to > > > > the same iommu context. I'm not sure what the plan is with isolation > > > > groups, but we need some way to reduce that overhead. > > > > > > Right. So, again, I intend that mutiple groups can go into one > > > domain. Not entirely sure of the interface yet. One I had in mind > > > was to borrow the vfio1 interface, so you open a /dev/vfio (each open > > > gives a new instance). Then you do an "addgroup" ioctl which adds a > > > group to the domain. You can do that multiple times, then start using > > > the domain. > > > > This also revisits one of the primary problems of vfio1, the dependency > > on a privileged uiommu domain creation interface. Assigning a user > > ownership of a group should be a privileged operation. If a privileged > > user needs to open /dev/vfio, add groups, then drop privileges and hand > > the open file descriptor to an unprivileged user, the interface becomes > > much harder to use. "Hot merging" becomes impossible. > > No, I was assuming that "permission to detach" could be handed out to > a user before this step. uid/gid/mode attributes in sysfs would > suffice, though there might be better ways. Wow. So you effectively get: # chown user:user /sys/isolation/$GROUP $ fd = open(/dev/vfio) $ ioctl(fd, addgroup, $GROUP) ^^^^ this is where the actual detach occurs? And vfio has to find the $GROUP object, figure out where /sys is mounted on this system, construct a path to $SYS/isolation/$GROUP, and check the file access before calling the isolation group detach API. Then later... $ ioctl(fd, delgroup, $GROUP) But a reattach can't occur here or we've just given the user the power to cause the DoS, or at least system chaos, I described earlier. So then the admin later has to: # chown root:root /sys/isolation/$GROUP # echo 0 > /sys/ioslation/$GROUP/detached Under the covers, "addgroup" would require vfio to a) "detach" the group, b) put the devices in an iommu domain. If we're dealing with multiple group, b) can be rejected by the iommu driver, at which point vfio then sends a "reattach" to the isolation API (but that can't actually do much, as noted above), the user then opens a new instance of /dev/vfio and tries an "addgroup" there. Also note that there was a request to make it possible to get multiple file descriptors to the same group/device/iommu from different threads. I don't see how that's possible here. And this is somehow better than the perceived dis-symmetry that annoys you in vfio? I'm getting a strong not-invented-here vibe and I'm not sure you've internalized all the interactions that make vfio manage groups the way it does. Alex ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: RFC: Device isolation infrastructure 2011-12-08 14:28 ` Alex Williamson @ 2011-12-09 3:00 ` David Gibson 0 siblings, 0 replies; 16+ messages in thread From: David Gibson @ 2011-12-09 3:00 UTC (permalink / raw) To: Alex Williamson Cc: joerg.roedel, dwmw2, iommu, aik, linux-kernel, chrisw, agraf, scottwood, B08248, benh On Thu, Dec 08, 2011 at 07:28:25AM -0700, Alex Williamson wrote: > On Thu, 2011-12-08 at 17:52 +1100, David Gibson wrote: > > On Wed, Dec 07, 2011 at 11:23:10PM -0700, Alex Williamson wrote: > > > On Thu, 2011-12-08 at 13:43 +1100, David Gibson wrote: > > > > On Wed, Dec 07, 2011 at 12:45:20PM -0700, Alex Williamson wrote: > > > > > So the next problem is that while the group is the minimum granularity > > > > > for the iommu, it's not necessarily the desired granularity. iommus > > > > > like VT-d have per PCI BDF context entries that can point to shared page > > > > > tables. On such systems we also typically have singleton isolation > > > > > groups, so when multiple devices are used by a single user, we have a > > > > > lot of duplication in time and space. VFIO handles this by allowing > > > > > groups to be "merged". When this happens, the merged groups point to > > > > > the same iommu context. I'm not sure what the plan is with isolation > > > > > groups, but we need some way to reduce that overhead. > > > > > > > > Right. So, again, I intend that mutiple groups can go into one > > > > domain. Not entirely sure of the interface yet. One I had in mind > > > > was to borrow the vfio1 interface, so you open a /dev/vfio (each open > > > > gives a new instance). Then you do an "addgroup" ioctl which adds a > > > > group to the domain. You can do that multiple times, then start using > > > > the domain. > > > > > > This also revisits one of the primary problems of vfio1, the dependency > > > on a privileged uiommu domain creation interface. Assigning a user > > > ownership of a group should be a privileged operation. If a privileged > > > user needs to open /dev/vfio, add groups, then drop privileges and hand > > > the open file descriptor to an unprivileged user, the interface becomes > > > much harder to use. "Hot merging" becomes impossible. > > > > No, I was assuming that "permission to detach" could be handed out to > > a user before this step. uid/gid/mode attributes in sysfs would > > suffice, though there might be better ways. > > Wow. So you effectively get: > > # chown user:user /sys/isolation/$GROUP I was thinking more /sys/isolation/$GROUP/{uid,gid,mode} properties, because I'm not sure you can implement chown simply in sysfs, but, whatever. > $ fd = open(/dev/vfio) > $ ioctl(fd, addgroup, $GROUP) > ^^^^ this is where the actual detach occurs? > > And vfio has to find the $GROUP object, That's easy. > figure out where /sys is mounted > on this system, a) /sys is always at /sys (see Documentation/sysfs-tules.txt b) we don't need that anyway. vfio can use use an in-kernel interface to grab a handle on the group, it doesn't need to trawl through the filesystem. > construct a path to $SYS/isolation/$GROUP, and check the > file access before calling the isolation group detach API. Well, no, because the perms would just be in struct isolation_group, and we'd have a helper function that detaches iff current has the relevant permission. > Then later... > > $ ioctl(fd, delgroup, $GROUP) > > But a reattach can't occur here or we've just given the user the power > to cause the DoS, or at least system chaos, I described earlier. So > then the admin later has to: I think you overestimate this "chaos". But this is why I'm considering an alternate approach where the group is detached, then "meta-bound" as separate steps. > > # chown root:root /sys/isolation/$GROUP > # echo 0 > /sys/ioslation/$GROUP/detached Well, if you want to give it back to normal drivers. More usually, I expect you'd detach the group at boot, and when the first guest using it dies, it sits in limbo until the next guest (or more likely the same guest, restarted) claims it. > Under the covers, "addgroup" would require vfio to a) "detach" the > group, One function call. > b) put the devices in an iommu domain. If we're dealing with > multiple group, Well, it has to do that anyway. > b) can be rejected by the iommu driver, at which point > vfio then sends a "reattach" to the isolation API (but that can't > actually do much, as noted above), Yes. Again, one function call. > the user then opens a new instance > of /dev/vfio and tries an "addgroup" there. > > Also note that there was a request to make it possible to get multiple > file descriptors to the same group/device/iommu from different threads. > I don't see how that's possible here. dup(2)? Although it depends exactly what the alternate handles are needed for. I can't actually see why you would need them at all, at present. > And this is somehow better than the perceived dis-symmetry that annoys > you in vfio? I'm getting a strong not-invented-here vibe and I'm not > sure you've internalized all the interactions that make vfio manage > groups the way it does. > > Alex > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2011-12-09 3:02 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-12-07 3:58 RFC: Device isolation infrastructure David Gibson 2011-12-07 6:22 ` Benjamin Herrenschmidt 2011-12-07 6:46 ` Alex Williamson 2011-12-07 14:22 ` David Gibson 2011-12-07 19:45 ` Alex Williamson 2011-12-07 23:41 ` Benjamin Herrenschmidt 2011-12-08 0:16 ` Chris Wright 2011-12-08 1:32 ` Benjamin Herrenschmidt 2011-12-08 1:36 ` Chris Wright 2011-12-08 5:51 ` Alex Williamson 2011-12-09 2:38 ` David Gibson 2011-12-08 2:43 ` David Gibson 2011-12-08 6:23 ` Alex Williamson 2011-12-08 6:52 ` David Gibson 2011-12-08 14:28 ` Alex Williamson 2011-12-09 3:00 ` David Gibson
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.