From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([140.186.70.92]:56875)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <rhod@redhat.com>) id 1Rfx7s-0003kp-NS
	for qemu-devel@nongnu.org; Wed, 28 Dec 2011 12:17:18 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <rhod@redhat.com>) id 1Rfx7p-0003mK-Pq
	for qemu-devel@nongnu.org; Wed, 28 Dec 2011 12:17:16 -0500
Received: from mx1.redhat.com ([209.132.183.28]:9911)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <rhod@redhat.com>) id 1Rfx7p-0003mA-Bi
	for qemu-devel@nongnu.org; Wed, 28 Dec 2011 12:17:13 -0500
Message-ID: <4EFB4EFD.4080706@redhat.com>
Date: Wed, 28 Dec 2011 19:16:45 +0200
From: Ronen Hod <rhod@redhat.com>
MIME-Version: 1.0
References: <20111221213019.27028.26890.stgit@bling.home>
	<20111221214212.27028.78947.stgit@bling.home>
In-Reply-To: <20111221214212.27028.78947.stgit@bling.home>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH 1/5] vfio: Introduce documentation for VFIO
 driver
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: aafabbri@cisco.com, kvm@vger.kernel.org, B07421@freescale.com, aik@ozlabs.ru, joerg.roedel@amd.com, agraf@suse.de, qemu-devel@nongnu.org, chrisw@sous-sol.org, B08248@freescale.com, iommu@lists.linux-foundation.org, avi@redhat.com, linux-pci@vger.kernel.org, david@gibson.dropbear.id.au, linux-kernel@vger.kernel.org, benve@cisco.com

On 12/21/2011 11:42 PM, Alex Williamson wrote:
> Including rationale for design, example usage and API description.
>
> Signed-off-by: Alex Williamson<alex.williamson@redhat.com>
> ---
>
>   Documentation/vfio.txt |  352 ++++++++++++++++++++++++++++++++++++++++++++++++
>   1 files changed, 352 insertions(+), 0 deletions(-)
>   create mode 100644 Documentation/vfio.txt
>
> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> new file mode 100644
> index 0000000..09a5a5b
> --- /dev/null
> +++ b/Documentation/vfio.txt
> @@ -0,0 +1,352 @@
> +VFIO - "Virtual Function I/O"[1]
> +-------------------------------------------------------------------------------
> +Many modern system now provide DMA and interrupt remapping facilities
> +to help ensure I/O devices behave within the boundaries they've been
> +allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d,
> +POWER systems with Partitionable Endpoints (PEs) and embedded PowerPC
> +systems such as Freescale PAMU.  The VFIO driver is an IOMMU/device
> +agnostic framework for exposing direct device access to userspace, in
> +a secure, IOMMU protected environment.  In other words, this allows
> +safe[2], non-privileged, userspace drivers.
> +
> +Why do we want that?  Virtual machines often make use of direct device
> +access ("device assignment") when configured for the highest possible
> +I/O performance.  From a device and host perspective, this simply
> +turns the VM into a userspace driver, with the benefits of
> +significantly reduced latency, higher bandwidth, and direct use of
> +bare-metal device drivers[3].
> +
> +Some applications, particularly in the high performance computing
> +field, also benefit from low-overhead, direct device access from
> +userspace.  Examples include network adapters (often non-TCP/IP based)
> +and compute accelerators.  Prior to VFIO, these drivers had to either
> +go through the full development cycle to become proper upstream
> +driver, be maintained out of tree, or make use of the UIO framework,
> +which has no notion of IOMMU protection, limited interrupt support,
> +and requires root privileges to access things like PCI configuration
> +space.
> +
> +The VFIO driver framework intends to unify these, replacing both the
> +KVM PCI specific device assignment code as well as provide a more
> +secure, more featureful userspace driver environment than UIO.
> +
> +Groups, Devices, and IOMMUs
> +-------------------------------------------------------------------------------
> +
> +Userspace drivers are primarily concerned with manipulating individual
> +devices and setting up mappings in the IOMMU for those devices.
> +Unfortunately, the IOMMU doesn't always have the granularity to track
> +mappings for an individual device.  Sometimes this is a topology
> +barrier, such as a PCIe-to-PCI bridge interposing the device and
> +IOMMU, other times this is an IOMMU limitation.  In any case, the
> +reality is that devices are not always independent with respect to the
> +IOMMU.  Translations setup for one device can be used by another
> +device in these scenarios.
> +
> +The IOMMU API exposes these relationships by identifying an "IOMMU
> +group" for these dependent devices.  Devices on the same bus with the
> +same IOMMU group (or just "group" for this document) are not isolated
> +from each other with respect to DMA mappings.  For userspace usage,
> +this logically means that instead of being able to grant ownership of
> +an individual device, we must grant ownership of a group, which may
> +contain one or more devices.
> +
> +These groups therefore become a fundamental component of VFIO and the
> +working unit we use for exposing devices and granting permissions to
> +userspace.  In addition, VFIO make efforts to ensure the integrity of
> +the group for user access.  This includes ensuring that all devices
> +within the group are controlled by VFIO (vs native host drivers)
> +before allowing a user to access any member of the group or the IOMMU
> +mappings, as well as maintaining the group viability as devices are
> +dynamically added or removed from the system.
> +
> +To access a device through VFIO, a user must open a character device
> +for the group that the device belongs to and then issue an ioctl to
> +retrieve a file descriptor for the individual device.  This ensures
> +that the user has permissions to the group (file based access to the
> +/dev entry) and allows a check point at which VFIO can deny access to
> +the device if the group is not viable (all devices within the group
> +controlled by VFIO).  A file descriptor for the IOMMU is obtain in the
> +same fashion.
> +
> +VFIO defines a standard set of APIs for access to devices and a
> +modular interface for adding new, bus-specific VFIO device drivers.
> +We call these "VFIO bus drivers".  The vfio-pci module is an example
> +of a bus driver for exposing PCI devices.  When the bus driver module
> +is loaded it enumerates all of the devices for it's bus, registering
> +each device with the vfio core along with a set of callbacks.  For
> +buses that support hotplug, the bus driver also adds itself to the
> +notification chain for such events.  The callbacks registered with
> +each device implement the VFIO device access API for that bus.
> +
> +The VFIO device API includes ioctls for describing the device, the I/O
> +regions and their read/write/mmap offsets on the device descriptor, as
> +well as mechanisms for describing and registering interrupt
> +notifications.
> +
> +The VFIO IOMMU object is accessed in a similar way; an ioctl on the
> +group provides a file descriptor for programming the IOMMU.  Like
> +devices, the IOMMU file descriptor is only accessible when a group is
> +viable.  The API for the IOMMU is effectively a userspace extension of
> +the kernel IOMMU API.  The IOMMU provides an ioctl to describe the
> +IOMMU domain as well as to setup and teardown DMA mappings.  As the
> +IOMMU API is extended to support more esoteric IOMMU implementations,
> +it's expected that the VFIO interface will also evolve.
> +
> +To facilitate this evolution, all of the VFIO interfaces are designed
> +for extensions.  Particularly, for all structures passed via ioctl, we
> +include a structure size and flags field.  We also define the ioctl
> +request to be independent of passed structure size.  This allows us to
> +later add structure fields and define flags as necessary.  It's
> +expected that each additional field will have an associated flag to
> +indicate whether the data is valid.  Additionally, we provide an
> +"info" ioctl for each file descriptor, which allows us to flag new
> +features as they're added (ex. an IOMMU domain configuration ioctl).
> +
> +The final aspect of VFIO is the notion of merging groups.  In both the
> +assignment of devices to virtual machines and the pure userspace
> +driver model, it's expect that a single user instance is likely to
> +have multiple groups in use simultaneously.  If these groups are all
> +using the same set of IOMMU mappings, the overhead of userspace
> +setting up and tearing down the mappings, as well as the internal
> +IOMMU driver overhead of managing those mappings can be non-trivial.
> +Some IOMMU implementations are able to easily reduce this overhead by
> +simply using the same set of page tables across multiple groups.
> +VFIO allows users to take advantage of this option by merging groups
> +together, effectively creating a super group (IOMMU groups only define
> +the minimum granularity).
> +
> +A user can attempt to merge groups together by calling the merge ioctl
> +on one group (the "merger") and passing a file descriptor for the
> +group to be merged in (the "mergee").  Note that existing DMA mappings
> +cannot be atomically merged between groups, it's therefore a
> +requirement that the mergee group is not in use.  This is enforced by
> +not allowing open device or iommu file descriptors on the mergee group
> +at the time of merging.  The merger group can be actively in use at
> +the time of merging.  Likewise, to unmerge a group, none of the device
> +file descriptors for the group being removed can be in use.  The
> +remaining merged group can be actively in use.
> +

Can you elaborate on the scenario that led a user to merge groups?
Does it make sense to try to "automatically" merge a (new) group with 
all the existing groups sometime prior to its first device open?

As always, it is a pleasure to read your documentation.
Ronen.

> +If the groups cannot be merged, the ioctl will fail and the user will
> +need to manage the groups independently.  Users should have no
> +expectation for group merging to be successful.  Some platforms may
> +not support it at all, others may only enable merging of sufficiently
> +similar groups.  If the ioctl succeeds, then the group file
> +descriptors are effectively fungible between the groups.  That is,
> +instead of their actions being isolated to the individual group, each
> +of them are gateways into the combined, merged group.  For instance,
> +retrieving an IOMMU file descriptor from any group returns a reference
> +to the same object, mappings to that IOMMU descriptor are visible to
> +all devices in the merged group, and device descriptors can be
> +retrieved for any device in the merged group from any one of the group
> +file descriptors.  In effect, a user can manage devices and the IOMMU
> +of a merged group using a single file descriptor (saving the merged
> +groups file descriptors away only for unmerged) without the
> +permission complications of creating a separate "super group" character
> +device.
> +
> +VFIO Usage Example
> +-------------------------------------------------------------------------------
> +
> +Assume user wants to access PCI device 0000:06:0d.0
> +
> +$ cat /sys/bus/pci/devices/0000:06:0d.0/iommu_group
> +240
> +
> +Since this device is on the "pci" bus, the user can then find the
> +character device for interacting with the VFIO group as:
> +
> +$ ls -l /dev/vfio/pci:240
> +crw-rw---- 1 root root 252, 27 Dec 15 15:13 /dev/vfio/pci:240
> +
> +We can also examine other members of the group through sysfs:
> +
> +$ ls -l /sys/devices/virtual/vfio/pci:240/devices/
> +total 0
> +lrwxrwxrwx 1 root root 0 Dec 20 12:01 0000:06:0d.0 ->  \
> +		../../../../pci0000:00/0000:00:1e.0/0000:06:0d.0
> +lrwxrwxrwx 1 root root 0 Dec 20 12:01 0000:06:0d.1 ->  \
> +		../../../../pci0000:00/0000:00:1e.0/0000:06:0d.1
> +
> +This group therefore contains two devices[4].  VFIO will prevent
> +device or iommu manipulation unless all group members are attached to
> +the vfio bus driver, so we simply unbind the devices from their
> +current driver and rebind them to vfio:
> +
> +# for i in /sys/devices/virtual/vfio/pci:240/devices/*; do
> +	dir=$(readlink -f $i)
> +	if [ -L $dir/driver ]; then
> +		echo $(basename $i)>  $dir/driver/unbind
> +	fi
> +	vendor=$(cat $dir/vendor)
> +	device=$(cat $dir/device)
> +	echo $vendor $device>  /sys/bus/pci/drivers/vfio/new_id
> +	echo $(basename $i)>  /sys/bus/pci/drivers/vfio/bind
> +done
> +
> +# chown user:user /dev/vfio/pci:240
> +
> +The user now has full access to all the devices and the iommu for this
> +group and can access them as follows:
> +
> +	int group, iommu, device, i;
> +	struct vfio_group_info group_info = { .argsz = sizeof(group_info) };
> +	struct vfio_iommu_info iommu_info = { .argsz = sizeof(iommu_info) };
> +	struct vfio_dma_map dma_map = { .argsz = sizeof(dma_map) };
> +	struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
> +
> +	/* Open the group */
> +	group = open("/dev/vfio/pci:240", O_RDWR);
> +
> +	/* Test the group is viable and available */
> +	ioctl(group, VFIO_GROUP_GET_INFO,&group_info);
> +
> +	if (!(group_info.flags&  VFIO_GROUP_FLAGS_VIABLE))
> +		/* Group is not viable */
> +
> +	if ((group_info.flags&  VFIO_GROUP_FLAGS_MM_LOCKED))
> +		/* Already in use by someone else */
> +
> +	/* Get a file descriptor for the IOMMU */
> +	iommu = ioctl(group, VFIO_GROUP_GET_IOMMU_FD);
> +
> +	/* Test the IOMMU is what we expect */
> +	ioctl(iommu, VFIO_IOMMU_GET_INFO,&iommu_info);
> +
> +	/* Allocate some space and setup a DMA mapping */
> +	dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
> +			     MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
> +	dma_map.size = 1024 * 1024;
> +	dma_map.iova = 0; /* 1MB starting at 0x0 from device view */
> +	dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
> +
> +	ioctl(iommu, VFIO_IOMMU_MAP_DMA,&dma_map);
> +
> +	/* Get a file descriptor for the device */
> +	device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0");
> +
> +	/* Test and setup the device */
> +	ioctl(device, VFIO_DEVICE_GET_INFO,&device_info);
> +
> +	for (i = 0; i<  device_info.num_regions; i++) {
> +		struct vfio_region_info reg = { .argsz = sizeof(reg) };
> +
> +		reg.index = i;
> +
> +		ioctl(device, VFIO_DEVICE_GET_REGION_INFO,&reg);
> +
> +		/* Setup mappings... read/write offsets, mmaps
> +		 * For PCI devices, config space is a region */
> +	}
> +
> +	for (i = 0; i<  device_info.num_irqs; i++) {
> +		struct vfio_irq_info irq = { .argsz = sizeof(irq) };
> +
> +		irq.index = i;
> +
> +		ioctl(device, VFIO_DEVICE_GET_IRQ_INFO,&reg);
> +
> +		/* Setup IRQs... eventfds, VFIO_DEVICE_SET_IRQ_EVENTFDS */
> +	}
> +
> +	/* Gratuitous device reset and go... */
> +	ioctl(device, VFIO_DEVICE_RESET);
> +
> +VFIO User API
> +-------------------------------------------------------------------------------
> +
> +Please see include/linux/vfio.h for complete API documentation.
> +
> +VFIO bus driver API
> +-------------------------------------------------------------------------------
> +
> +Bus drivers, such as PCI, have three jobs:
> + 1) Add/remove devices from vfio
> + 2) Provide vfio_device_ops for device access
> + 3) Device binding and unbinding
> +
> +When initialized, the bus driver should enumerate the devices on its
> +bus and call vfio_group_add_dev() for each device.  If the bus
> +supports hotplug, notifiers should be enabled to track devices being
> +added and removed.  vfio_group_del_dev() removes a previously added
> +device from vfio.
> +
> +extern int vfio_group_add_dev(struct device *dev,
> +                              const struct vfio_device_ops *ops);
> +extern void vfio_group_del_dev(struct device *dev);
> +
> +Adding a device registers a vfio_device_ops function pointer structure
> +for the device:
> +
> +struct vfio_device_ops {
> +	bool	(*match)(struct device *dev, char *buf);
> +	int	(*claim)(struct device *dev);
> +	int	(*open)(void *device_data);
> +	void	(*release)(void *device_data);
> +	ssize_t	(*read)(void *device_data, char __user *buf,
> +			size_t count, loff_t *ppos);
> +	ssize_t	(*write)(void *device_data, const char __user *buf,
> +			 size_t size, loff_t *ppos);
> +	long	(*ioctl)(void *device_data, unsigned int cmd,
> +			 unsigned long arg);
> +	int	(*mmap)(void *device_data, struct vm_area_struct *vma);
> +};
> +
> +For buses supporting hotplug, all functions are required to be
> +implemented.  Non-hotplug buses do not need to implement claim().
> +
> +match() provides a device specific method for associating a struct
> +device to a user provided string.  Many drivers may simply strcmp the
> +buffer to dev_name().
> +
> +claim() is used when a device is hot-added to a group that is already
> +in use.  This is how VFIO requests that a bus driver manually takes
> +ownership of a device.  The expected call path for this is triggered
> +from the bus add notifier.  The bus driver calls vfio_group_add_dev for
> +the newly added device, vfio-core determines this group is already in
> +use and calls claim on the bus driver.  This triggers the bus driver
> +to call it's own probe function, including calling vfio_bind_dev to
> +mark the device as controlled by vfio.  The device is then available
> +for use by the group.
> +
> +The remaining vfio_device_ops are similar to a simplified struct
> +file_operations except a device_data pointer is provided rather than a
> +file pointer.  The device_data is an opaque structure registered by
> +the bus driver when a device is bound to the vfio bus driver:
> +
> +extern int vfio_bind_dev(struct device *dev, void *device_data);
> +extern void *vfio_unbind_dev(struct device *dev);
> +
> +When the device is unbound from the driver, the bus driver will call
> +vfio_unbind_dev() which will return the device_data for any bus driver
> +specific cleanup and freeing of the structure.  The vfio_unbind_dev
> +call may block if the group is currently in use.
> +
> +-------------------------------------------------------------------------------
> +
> +[1] VFIO was originally an acronym for "Virtual Function I/O" in it's
> +initial implementation by Tom Lyon while as Cisco.  We've since
> +outgrown the acronym, but it's catchy.
> +
> +[2] "safe" also depends upon a device being "well behaved".  It's
> +possible for multi-function devices to have backdoors between
> +functions and even for single function devices to have alternative
> +access to things like PCI config space through MMIO registers.  To
> +guard against the former we can include additional precautions in the
> +IOMMU driver to group multi-function PCI devices together
> +(iommu=group_mf).  The latter we can't prevent, but the IOMMU should
> +still provide isolation.  For PCI, Virtual Functions are the best
> +indicator of "well behaved", as these are designed for virtualization
> +usage models.
> +
> +[3] As always there are trade-offs to virtual machine device
> +assignment that are beyond the scope of VFIO.  It's expected that
> +future IOMMU technologies will reduce some, but maybe not all, of
> +these trade-offs.
> +
> +[4] In this case the device is below a PCI bridge:
> +
> +-[0000:00]-+-1e.0-[06]--+-0d.0
> +                        \-0d.1
> +
> +00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)
>
>