From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([140.186.70.92]:56875) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Rfx7s-0003kp-NS for qemu-devel@nongnu.org; Wed, 28 Dec 2011 12:17:18 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Rfx7p-0003mK-Pq for qemu-devel@nongnu.org; Wed, 28 Dec 2011 12:17:16 -0500 Received: from mx1.redhat.com ([209.132.183.28]:9911) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Rfx7p-0003mA-Bi for qemu-devel@nongnu.org; Wed, 28 Dec 2011 12:17:13 -0500 Message-ID: <4EFB4EFD.4080706@redhat.com> Date: Wed, 28 Dec 2011 19:16:45 +0200 From: Ronen Hod MIME-Version: 1.0 References: <20111221213019.27028.26890.stgit@bling.home> <20111221214212.27028.78947.stgit@bling.home> In-Reply-To: <20111221214212.27028.78947.stgit@bling.home> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH 1/5] vfio: Introduce documentation for VFIO driver List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alex Williamson Cc: aafabbri@cisco.com, kvm@vger.kernel.org, B07421@freescale.com, aik@ozlabs.ru, joerg.roedel@amd.com, agraf@suse.de, qemu-devel@nongnu.org, chrisw@sous-sol.org, B08248@freescale.com, iommu@lists.linux-foundation.org, avi@redhat.com, linux-pci@vger.kernel.org, david@gibson.dropbear.id.au, linux-kernel@vger.kernel.org, benve@cisco.com On 12/21/2011 11:42 PM, Alex Williamson wrote: > Including rationale for design, example usage and API description. > > Signed-off-by: Alex Williamson > --- > > Documentation/vfio.txt | 352 ++++++++++++++++++++++++++++++++++++++++++++++++ > 1 files changed, 352 insertions(+), 0 deletions(-) > create mode 100644 Documentation/vfio.txt > > diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt > new file mode 100644 > index 0000000..09a5a5b > --- /dev/null > +++ b/Documentation/vfio.txt > @@ -0,0 +1,352 @@ > +VFIO - "Virtual Function I/O"[1] > +------------------------------------------------------------------------------- > +Many modern system now provide DMA and interrupt remapping facilities > +to help ensure I/O devices behave within the boundaries they've been > +allotted. This includes x86 hardware with AMD-Vi and Intel VT-d, > +POWER systems with Partitionable Endpoints (PEs) and embedded PowerPC > +systems such as Freescale PAMU. The VFIO driver is an IOMMU/device > +agnostic framework for exposing direct device access to userspace, in > +a secure, IOMMU protected environment. In other words, this allows > +safe[2], non-privileged, userspace drivers. > + > +Why do we want that? Virtual machines often make use of direct device > +access ("device assignment") when configured for the highest possible > +I/O performance. From a device and host perspective, this simply > +turns the VM into a userspace driver, with the benefits of > +significantly reduced latency, higher bandwidth, and direct use of > +bare-metal device drivers[3]. > + > +Some applications, particularly in the high performance computing > +field, also benefit from low-overhead, direct device access from > +userspace. Examples include network adapters (often non-TCP/IP based) > +and compute accelerators. Prior to VFIO, these drivers had to either > +go through the full development cycle to become proper upstream > +driver, be maintained out of tree, or make use of the UIO framework, > +which has no notion of IOMMU protection, limited interrupt support, > +and requires root privileges to access things like PCI configuration > +space. > + > +The VFIO driver framework intends to unify these, replacing both the > +KVM PCI specific device assignment code as well as provide a more > +secure, more featureful userspace driver environment than UIO. > + > +Groups, Devices, and IOMMUs > +------------------------------------------------------------------------------- > + > +Userspace drivers are primarily concerned with manipulating individual > +devices and setting up mappings in the IOMMU for those devices. > +Unfortunately, the IOMMU doesn't always have the granularity to track > +mappings for an individual device. Sometimes this is a topology > +barrier, such as a PCIe-to-PCI bridge interposing the device and > +IOMMU, other times this is an IOMMU limitation. In any case, the > +reality is that devices are not always independent with respect to the > +IOMMU. Translations setup for one device can be used by another > +device in these scenarios. > + > +The IOMMU API exposes these relationships by identifying an "IOMMU > +group" for these dependent devices. Devices on the same bus with the > +same IOMMU group (or just "group" for this document) are not isolated > +from each other with respect to DMA mappings. For userspace usage, > +this logically means that instead of being able to grant ownership of > +an individual device, we must grant ownership of a group, which may > +contain one or more devices. > + > +These groups therefore become a fundamental component of VFIO and the > +working unit we use for exposing devices and granting permissions to > +userspace. In addition, VFIO make efforts to ensure the integrity of > +the group for user access. This includes ensuring that all devices > +within the group are controlled by VFIO (vs native host drivers) > +before allowing a user to access any member of the group or the IOMMU > +mappings, as well as maintaining the group viability as devices are > +dynamically added or removed from the system. > + > +To access a device through VFIO, a user must open a character device > +for the group that the device belongs to and then issue an ioctl to > +retrieve a file descriptor for the individual device. This ensures > +that the user has permissions to the group (file based access to the > +/dev entry) and allows a check point at which VFIO can deny access to > +the device if the group is not viable (all devices within the group > +controlled by VFIO). A file descriptor for the IOMMU is obtain in the > +same fashion. > + > +VFIO defines a standard set of APIs for access to devices and a > +modular interface for adding new, bus-specific VFIO device drivers. > +We call these "VFIO bus drivers". The vfio-pci module is an example > +of a bus driver for exposing PCI devices. When the bus driver module > +is loaded it enumerates all of the devices for it's bus, registering > +each device with the vfio core along with a set of callbacks. For > +buses that support hotplug, the bus driver also adds itself to the > +notification chain for such events. The callbacks registered with > +each device implement the VFIO device access API for that bus. > + > +The VFIO device API includes ioctls for describing the device, the I/O > +regions and their read/write/mmap offsets on the device descriptor, as > +well as mechanisms for describing and registering interrupt > +notifications. > + > +The VFIO IOMMU object is accessed in a similar way; an ioctl on the > +group provides a file descriptor for programming the IOMMU. Like > +devices, the IOMMU file descriptor is only accessible when a group is > +viable. The API for the IOMMU is effectively a userspace extension of > +the kernel IOMMU API. The IOMMU provides an ioctl to describe the > +IOMMU domain as well as to setup and teardown DMA mappings. As the > +IOMMU API is extended to support more esoteric IOMMU implementations, > +it's expected that the VFIO interface will also evolve. > + > +To facilitate this evolution, all of the VFIO interfaces are designed > +for extensions. Particularly, for all structures passed via ioctl, we > +include a structure size and flags field. We also define the ioctl > +request to be independent of passed structure size. This allows us to > +later add structure fields and define flags as necessary. It's > +expected that each additional field will have an associated flag to > +indicate whether the data is valid. Additionally, we provide an > +"info" ioctl for each file descriptor, which allows us to flag new > +features as they're added (ex. an IOMMU domain configuration ioctl). > + > +The final aspect of VFIO is the notion of merging groups. In both the > +assignment of devices to virtual machines and the pure userspace > +driver model, it's expect that a single user instance is likely to > +have multiple groups in use simultaneously. If these groups are all > +using the same set of IOMMU mappings, the overhead of userspace > +setting up and tearing down the mappings, as well as the internal > +IOMMU driver overhead of managing those mappings can be non-trivial. > +Some IOMMU implementations are able to easily reduce this overhead by > +simply using the same set of page tables across multiple groups. > +VFIO allows users to take advantage of this option by merging groups > +together, effectively creating a super group (IOMMU groups only define > +the minimum granularity). > + > +A user can attempt to merge groups together by calling the merge ioctl > +on one group (the "merger") and passing a file descriptor for the > +group to be merged in (the "mergee"). Note that existing DMA mappings > +cannot be atomically merged between groups, it's therefore a > +requirement that the mergee group is not in use. This is enforced by > +not allowing open device or iommu file descriptors on the mergee group > +at the time of merging. The merger group can be actively in use at > +the time of merging. Likewise, to unmerge a group, none of the device > +file descriptors for the group being removed can be in use. The > +remaining merged group can be actively in use. > + Can you elaborate on the scenario that led a user to merge groups? Does it make sense to try to "automatically" merge a (new) group with all the existing groups sometime prior to its first device open? As always, it is a pleasure to read your documentation. Ronen. > +If the groups cannot be merged, the ioctl will fail and the user will > +need to manage the groups independently. Users should have no > +expectation for group merging to be successful. Some platforms may > +not support it at all, others may only enable merging of sufficiently > +similar groups. If the ioctl succeeds, then the group file > +descriptors are effectively fungible between the groups. That is, > +instead of their actions being isolated to the individual group, each > +of them are gateways into the combined, merged group. For instance, > +retrieving an IOMMU file descriptor from any group returns a reference > +to the same object, mappings to that IOMMU descriptor are visible to > +all devices in the merged group, and device descriptors can be > +retrieved for any device in the merged group from any one of the group > +file descriptors. In effect, a user can manage devices and the IOMMU > +of a merged group using a single file descriptor (saving the merged > +groups file descriptors away only for unmerged) without the > +permission complications of creating a separate "super group" character > +device. > + > +VFIO Usage Example > +------------------------------------------------------------------------------- > + > +Assume user wants to access PCI device 0000:06:0d.0 > + > +$ cat /sys/bus/pci/devices/0000:06:0d.0/iommu_group > +240 > + > +Since this device is on the "pci" bus, the user can then find the > +character device for interacting with the VFIO group as: > + > +$ ls -l /dev/vfio/pci:240 > +crw-rw---- 1 root root 252, 27 Dec 15 15:13 /dev/vfio/pci:240 > + > +We can also examine other members of the group through sysfs: > + > +$ ls -l /sys/devices/virtual/vfio/pci:240/devices/ > +total 0 > +lrwxrwxrwx 1 root root 0 Dec 20 12:01 0000:06:0d.0 -> \ > + ../../../../pci0000:00/0000:00:1e.0/0000:06:0d.0 > +lrwxrwxrwx 1 root root 0 Dec 20 12:01 0000:06:0d.1 -> \ > + ../../../../pci0000:00/0000:00:1e.0/0000:06:0d.1 > + > +This group therefore contains two devices[4]. VFIO will prevent > +device or iommu manipulation unless all group members are attached to > +the vfio bus driver, so we simply unbind the devices from their > +current driver and rebind them to vfio: > + > +# for i in /sys/devices/virtual/vfio/pci:240/devices/*; do > + dir=$(readlink -f $i) > + if [ -L $dir/driver ]; then > + echo $(basename $i)> $dir/driver/unbind > + fi > + vendor=$(cat $dir/vendor) > + device=$(cat $dir/device) > + echo $vendor $device> /sys/bus/pci/drivers/vfio/new_id > + echo $(basename $i)> /sys/bus/pci/drivers/vfio/bind > +done > + > +# chown user:user /dev/vfio/pci:240 > + > +The user now has full access to all the devices and the iommu for this > +group and can access them as follows: > + > + int group, iommu, device, i; > + struct vfio_group_info group_info = { .argsz = sizeof(group_info) }; > + struct vfio_iommu_info iommu_info = { .argsz = sizeof(iommu_info) }; > + struct vfio_dma_map dma_map = { .argsz = sizeof(dma_map) }; > + struct vfio_device_info device_info = { .argsz = sizeof(device_info) }; > + > + /* Open the group */ > + group = open("/dev/vfio/pci:240", O_RDWR); > + > + /* Test the group is viable and available */ > + ioctl(group, VFIO_GROUP_GET_INFO,&group_info); > + > + if (!(group_info.flags& VFIO_GROUP_FLAGS_VIABLE)) > + /* Group is not viable */ > + > + if ((group_info.flags& VFIO_GROUP_FLAGS_MM_LOCKED)) > + /* Already in use by someone else */ > + > + /* Get a file descriptor for the IOMMU */ > + iommu = ioctl(group, VFIO_GROUP_GET_IOMMU_FD); > + > + /* Test the IOMMU is what we expect */ > + ioctl(iommu, VFIO_IOMMU_GET_INFO,&iommu_info); > + > + /* Allocate some space and setup a DMA mapping */ > + dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE, > + MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); > + dma_map.size = 1024 * 1024; > + dma_map.iova = 0; /* 1MB starting at 0x0 from device view */ > + dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE; > + > + ioctl(iommu, VFIO_IOMMU_MAP_DMA,&dma_map); > + > + /* Get a file descriptor for the device */ > + device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0"); > + > + /* Test and setup the device */ > + ioctl(device, VFIO_DEVICE_GET_INFO,&device_info); > + > + for (i = 0; i< device_info.num_regions; i++) { > + struct vfio_region_info reg = { .argsz = sizeof(reg) }; > + > + reg.index = i; > + > + ioctl(device, VFIO_DEVICE_GET_REGION_INFO,®); > + > + /* Setup mappings... read/write offsets, mmaps > + * For PCI devices, config space is a region */ > + } > + > + for (i = 0; i< device_info.num_irqs; i++) { > + struct vfio_irq_info irq = { .argsz = sizeof(irq) }; > + > + irq.index = i; > + > + ioctl(device, VFIO_DEVICE_GET_IRQ_INFO,®); > + > + /* Setup IRQs... eventfds, VFIO_DEVICE_SET_IRQ_EVENTFDS */ > + } > + > + /* Gratuitous device reset and go... */ > + ioctl(device, VFIO_DEVICE_RESET); > + > +VFIO User API > +------------------------------------------------------------------------------- > + > +Please see include/linux/vfio.h for complete API documentation. > + > +VFIO bus driver API > +------------------------------------------------------------------------------- > + > +Bus drivers, such as PCI, have three jobs: > + 1) Add/remove devices from vfio > + 2) Provide vfio_device_ops for device access > + 3) Device binding and unbinding > + > +When initialized, the bus driver should enumerate the devices on its > +bus and call vfio_group_add_dev() for each device. If the bus > +supports hotplug, notifiers should be enabled to track devices being > +added and removed. vfio_group_del_dev() removes a previously added > +device from vfio. > + > +extern int vfio_group_add_dev(struct device *dev, > + const struct vfio_device_ops *ops); > +extern void vfio_group_del_dev(struct device *dev); > + > +Adding a device registers a vfio_device_ops function pointer structure > +for the device: > + > +struct vfio_device_ops { > + bool (*match)(struct device *dev, char *buf); > + int (*claim)(struct device *dev); > + int (*open)(void *device_data); > + void (*release)(void *device_data); > + ssize_t (*read)(void *device_data, char __user *buf, > + size_t count, loff_t *ppos); > + ssize_t (*write)(void *device_data, const char __user *buf, > + size_t size, loff_t *ppos); > + long (*ioctl)(void *device_data, unsigned int cmd, > + unsigned long arg); > + int (*mmap)(void *device_data, struct vm_area_struct *vma); > +}; > + > +For buses supporting hotplug, all functions are required to be > +implemented. Non-hotplug buses do not need to implement claim(). > + > +match() provides a device specific method for associating a struct > +device to a user provided string. Many drivers may simply strcmp the > +buffer to dev_name(). > + > +claim() is used when a device is hot-added to a group that is already > +in use. This is how VFIO requests that a bus driver manually takes > +ownership of a device. The expected call path for this is triggered > +from the bus add notifier. The bus driver calls vfio_group_add_dev for > +the newly added device, vfio-core determines this group is already in > +use and calls claim on the bus driver. This triggers the bus driver > +to call it's own probe function, including calling vfio_bind_dev to > +mark the device as controlled by vfio. The device is then available > +for use by the group. > + > +The remaining vfio_device_ops are similar to a simplified struct > +file_operations except a device_data pointer is provided rather than a > +file pointer. The device_data is an opaque structure registered by > +the bus driver when a device is bound to the vfio bus driver: > + > +extern int vfio_bind_dev(struct device *dev, void *device_data); > +extern void *vfio_unbind_dev(struct device *dev); > + > +When the device is unbound from the driver, the bus driver will call > +vfio_unbind_dev() which will return the device_data for any bus driver > +specific cleanup and freeing of the structure. The vfio_unbind_dev > +call may block if the group is currently in use. > + > +------------------------------------------------------------------------------- > + > +[1] VFIO was originally an acronym for "Virtual Function I/O" in it's > +initial implementation by Tom Lyon while as Cisco. We've since > +outgrown the acronym, but it's catchy. > + > +[2] "safe" also depends upon a device being "well behaved". It's > +possible for multi-function devices to have backdoors between > +functions and even for single function devices to have alternative > +access to things like PCI config space through MMIO registers. To > +guard against the former we can include additional precautions in the > +IOMMU driver to group multi-function PCI devices together > +(iommu=group_mf). The latter we can't prevent, but the IOMMU should > +still provide isolation. For PCI, Virtual Functions are the best > +indicator of "well behaved", as these are designed for virtualization > +usage models. > + > +[3] As always there are trade-offs to virtual machine device > +assignment that are beyond the scope of VFIO. It's expected that > +future IOMMU technologies will reduce some, but maybe not all, of > +these trade-offs. > + > +[4] In this case the device is below a PCI bridge: > + > +-[0000:00]-+-1e.0-[06]--+-0d.0 > + \-0d.1 > + > +00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90) > >