linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Randy Dunlap <randy.dunlap@oracle.com>
To: "Tom Lyon" <pugs@cisco.com>
Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	chrisw@sous-sol.org, joro@8bytes.org, hjk@linutronix.de,
	mst@redhat.com, avi@redhat.com, gregkh@suse.de,
	aafabbri@cisco.com, scofeldm@cisco.com
Subject: Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
Date: Fri, 28 May 2010 16:56:00 -0700	[thread overview]
Message-ID: <20100528165600.68cd2618.randy.dunlap@oracle.com> (raw)
In-Reply-To: <4c004cba.Z/2Hpd7reetFaFC5%pugs@cisco.com>

On Fri, 28 May 2010 16:07:38 -0700 Tom Lyon wrote:

> diff -uprN linux-2.6.34/Documentation/vfio.txt vfio-linux-2.6.34/Documentation/vfio.txt
> --- linux-2.6.34/Documentation/vfio.txt	1969-12-31 16:00:00.000000000 -0800
> +++ vfio-linux-2.6.34/Documentation/vfio.txt	2010-05-28 14:03:05.000000000 -0700
> @@ -0,0 +1,176 @@
> +-------------------------------------------------------------------------------
> +The VFIO "driver" is used to allow privileged AND non-privileged processes to
> +implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
> +devices.
> +
> +Why is this interesting?  Some applications, especially in the high performance
> +computing field, need access to hardware functions with as little overhead as
> +possible. Examples are in network adapters (typically non tcp/ip based) and

                                                         non-TCP/IP-based)

> +in compute accelerators - i.e., array processors, FPGA processors, etc.
> +Previous to the VFIO drivers these apps would need either a kernel-level
> +driver (with corrsponding overheads), or else root permissions to directly

                corresponding

> +access the hardware. The VFIO driver allows generic access to the hardware
> +from non-privileged apps IF the hardware is "well-behaved" enough for this
> +to be safe.
> +
> +While there have long been ways to implement user-level drivers using specific
> +corresponding drivers in the kernel, it was not until the introduction of the
> +UIO driver framework, and the uio_pci_generic driver that one could have a
> +generic kernel component supporting many types of user level drivers. However,
> +even with the uio_pci_generic driver, processes implementing the user level
> +drivers had to be trusted - they could do dangerous manipulation of DMA
> +addreses and were required to be root to write PCI configuration space
> +registers.
> +
> +Recent hardware technologies - I/O MMUs and PCI I/O Virtualization - provide
> +new hardware capabilities which the VFIO solution exploits to allow non-root
> +user level drivers. The main role of the IOMMU is to ensure that DMA accesses
> +from devices go only to the appropriate memory locations, this allows VFIO to

                                                  locations;

> +ensure that user level drivers do not corrupt inappropriate memory.  PCI I/O
> +virtualization (SR-IOV) was defined to allow "pass-through" of virtual devices
> +to guest virtual machines. VFIO in essence implements pass-through of devices
> +to user processes, not virtual machines.  SR-IOV devices implement a
> +traditional PCI device (the physical function) and a dynamic number of special
> +PCI devices (virtual functions) whose feature set is somewhat restricted - in
> +order to allow the operating system or virtual machine monitor to ensure the
> +safe operation of the system.
> +
> +Any SR-IOV virtual function meets the VFIO definition of "well-behaved", but
> +there are many other non-IOV PCI devices which also meet the defintion.
> +Elements of this definition are:
> +- The size of any memory BARs to be mmap'ed into the user process space must be
> +  a multiple of the system page size.
> +- If MSI-X interrupts are used, the device driver must not attempt to mmap or
> +  write the MSI-X vector area.
> +- If the device is a PCI device (not PCI-X or PCIe), it must conform to PCI
> +  revision 2.3 to allow its interrupts to be masked in a generic way.
> +- The device must not use the PCI configuration space in any non-standard way,
> +  i.e., the user level driver will be permitted only to read and write standard
> +  fields of the PCI config space, and only if those fields cannot cause harm to
> +  the system. In addition, some fields are "virtualized", so that the user
> +  driver can read/write them like a kernel driver, but they do not affect the
> +  real device.
> +- For now, there is no support for user access to the PCIe and PCI-X extended
> +  capabilities configuration space.
> +
> +Even with these restrictions, there are bound to be devices which are unsafe
> +for user level use - it is still up to the system admin to decide whether to
> +grant access to the device.  When the vfio module is loaded, it will have
> +access to no devices until the desired PCI devices are "bound" to the driver.
> +First, make sure the devices are not bound to another kernel driver. You can
> +unload that driver if you wish to unbind all its devices, or else enter the
> +driver's sysfs directory, and unbind a specific device:
> +	cd /sys/bus/pci/drivers/<drivername>
> +	echo 0000:06:02.00 > unbind
> +(The 0000:06:02.00 is a fully qualified PCI device name - different for each
> +device).  Now, to bind to the vfio driver, go to /sys/bus/pci/drivers/vfio and
> +write the PCI device type of the target device to the new_id file:
> +	echo 8086 10ca > new_id
> +(8086 10ca are the vendor and device type for the Intel 82576 virtual function
> +devices). A /dev/vfio<N> entry will be created for each device bound. The final
> +step is to grant users permission by changing the mode and/or owner of the /dev
> +entry - "chmod 666 /dev/vfio0".
> +
> +Reads & Writes:
> +
> +The user driver will typically use mmap to access the memory BAR(s) of a
> +device; the I/O BARs and the PCI config space may be accessed through normal
> +read and write system calls. Only 1 file descriptor is needed for all driver
> +functions -- the desired BAR for I/O, memory, or config space is indicated via
> +high-order bits of the file offset.  For instance, the following implements a
> +write to the PCI config space:
> +
> +	#include <linux/vfio.h>
> +	void pci_write_config_word(int pci_fd, u16 off, u16 wd)
> +	{
> +		off_t cfg_off = VFIO_PCI_CONFIG_OFF + off;
> +
> +		if (pwrite(pci_fd, &wd, 2, cfg_off) != 2)
> +			perror("pwrite config_dword");
> +	}
> +
> +The routines vfio_pci_space_to_offset and vfio_offset_to_pci_space are provided
> +in vfio.h to convert bar numbers to file offsets and vice-versa.

                        BAR

> +
> +Interrupts:
> +
> +Device interrupts are translated by the vfio driver into input events on event
> +notification file descriptors created by the eventfd system call. The user
> +program must one or more event descriptors and pass them to the vfio driver

           must ___ ?  missing word?

> +via ioctls to arrange for the interrupt mapping:
> +1.
> +	efd = eventfd(0, 0);
> +	ioctl(vfio_fd, VFIO_EVENTFD_IRQ, &efd);
> +		This provides an eventfd for traditional IRQ interrupts.
> +		IRQs will be disable after each interrupt until the driver

		             disabled

> +		re-enables them via the PCI COMMAND register.
> +2.
> +	efd = eventfd(0, 0);
> +	ioctl(vfio_fd, VFIO_EVENTFD_MSI, &efd);
> +		This connects MSI interrupts to an eventfd.
> +3.
> + 	int arg[N+1];
> +	arg[0] = N;
> +	arg[1..N] = eventfd(0, 0);
> +	ioctl(vfio_fd, VFIO_EVENTFDS_MSIX, arg);
> +		This connects N MSI-X interrupts with N eventfds.
> +
> +Waiting and checking for interrupts is done by the user program by reads,
> +polls, or selects on the related event file descriptors.
> +
> +DMA:
> +
> +The VFIO driver uses ioctls to allow the user level driver to get DMA
> +addresses which correspond to virtual addresses.  In systems with IOMMUs,
> +each PCI device will have its own address space for DMA operations, so when
> +the user level driver programs the device registers, only addresses known to
> +the IOMMU will be valid, any others will be rejected.  The IOMMU creates the
> +illusion (to the device) that multi-page buffers are physically contiguous,
> +so a single DMA operation can safely span multiple user pages.  Note that
> +the VFIO driver is still useful in systems without IOMMUs, but only for
> +trusted processes which can deal with DMAs which do not span pages (Huge
> +pages count as a single page also).
> +
> +If the user process desires many DMA buffers, it may be wise to do a mapping
> +of a single large buffer, and then allocate the smaller buffers from the
> +large one.
> +
> +The DMA buffers are locked into physical memory for the duration of their
> +existence - until VFIO_DMA_UNMAP is called, until the user pages are
> +unmapped from the user process, or until the vfio file descriptor is closed.
> +The user process must have permission to lock the pages given by the ulimit(-l)
> +command, which in turn relies on settings in the /etc/security/limits.conf
> +file.
> +
> +The vfio_dma_map structure is used as an argument to the ioctls which
> +do the DMA mapping. Its vaddr, dmaaddr, and size fields must always be a
> +multiple of a page. Its rdwr field is zero for read-only (outbound), and
> +non-zero for read/write buffers.
> +
> +	struct vfio_dma_map {
> +		__u64	vaddr;	  /* process virtual addr */
> +		__u64	dmaaddr;  /* desired and/or returned dma address */
> +		__u64	size;	  /* size in bytes */
> +		int	rdwr;	  /* bool: 0 for r/o; 1 for r/w */
> +	};
> +
> +The VFIO_DMA_MAP_ANYWHERE is called with a vfio_dma_map structure as its
> +argument, and returns the structure with a valid dmaaddr field.
> +
> +The VFIO_DMA_MAP_IOVA is called with a vfio_dma_map structure with the
> +dmaaddr field already assigned. The system will attempt to map the DMA
> +buffer into the IO space at the givne dmaaddr. This is expected to be

                                   given

> +useful if KVM or other virtualization facilities use this driver.
> +
> +The VFIO_DMA_UNMAP takes a fully filled vfio_dma_map structure and unmaps
> +the buffer and releases the corresponding system resources.
> +
> +The VFIO_DMA_MASK ioctl is used to set the maximum permissible DMA address
> +(device dependent). It takes a single unsigned 64 bit integer as an argument.
> +This call also has the side effect on enabled PCI bus mastership.

eh?  I don't get that last sentence...

> +
> +Miscellaneous:
> +
> +The VFIO_BAR_LEN ioctl provides an easy way to determine the size of a PCI
> +device's base address region. It is passed a single integer specifying which
> +BAR (0-5 or 6 for ROM bar), and passes back the length in the same field.


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

  parent reply	other threads:[~2010-05-28 23:56 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-05-28 23:07 [PATCH] VFIO driver: Non-privileged user level PCI drivers Tom Lyon
2010-05-28 23:36 ` Randy Dunlap
2010-05-28 23:56 ` Randy Dunlap [this message]
2010-05-29 11:55 ` Arnd Bergmann
2010-05-29 12:16   ` Avi Kivity
2010-05-30 12:19 ` Michael S. Tsirkin
2010-05-30 12:27   ` Avi Kivity
2010-05-30 12:49     ` Michael S. Tsirkin
2010-05-30 13:01       ` Avi Kivity
2010-05-30 13:03         ` Michael S. Tsirkin
2010-05-30 13:13           ` Avi Kivity
2010-05-30 14:53             ` Michael S. Tsirkin
2010-05-31 11:50               ` Avi Kivity
2010-05-31 17:10                 ` Michael S. Tsirkin
2010-06-01  8:10                   ` Avi Kivity
2010-06-01  9:55                     ` Michael S. Tsirkin
2010-06-01 10:28                       ` Avi Kivity
2010-06-01 10:46                         ` Michael S. Tsirkin
2010-06-01 12:41                           ` Avi Kivity
2010-06-02  9:45                             ` Joerg Roedel
2010-06-02  9:49                               ` Avi Kivity
2010-06-02 10:04                                 ` Joerg Roedel
2010-06-02 10:09                                   ` Michael S. Tsirkin
2010-06-02 11:21                                   ` Avi Kivity
2010-06-02 16:53                                     ` Chris Wright
2010-06-06 13:44                                       ` Avi Kivity
2010-06-02 10:15                               ` Michael S. Tsirkin
2010-06-02 10:26                                 ` Joerg Roedel
2010-06-01 21:26                           ` Tom Lyon
2010-06-02  2:59                             ` Avi Kivity
2010-06-02  5:29                               ` Chris Wright
2010-06-02  5:40                                 ` Avi Kivity
2010-06-02  4:29                         ` Alex Williamson
2010-06-02  4:59                           ` Tom Lyon
2010-06-02  5:08                             ` Avi Kivity
2010-06-02  9:53                             ` Joerg Roedel
2010-06-02  9:42                       ` Joerg Roedel
2010-06-02  9:50                         ` Avi Kivity
2010-06-02  9:53                         ` Michael S. Tsirkin
2010-06-02 10:19                           ` Joerg Roedel
2010-06-02 10:21                             ` Michael S. Tsirkin
2010-06-02 10:35                               ` Joerg Roedel
2010-06-02 10:38                                 ` Michael S. Tsirkin
2010-06-02 11:12                                   ` Joerg Roedel
2010-06-02 11:21                                     ` Michael S. Tsirkin
2010-06-02 12:19                                       ` Joerg Roedel
2010-06-02 12:25                                         ` Avi Kivity
2010-06-02 12:50                                           ` Joerg Roedel
2010-06-02 13:06                                             ` Avi Kivity
2010-06-02 13:53                                               ` Joerg Roedel
2010-06-02 13:17                                             ` Michael S. Tsirkin
2010-06-02 14:01                                               ` Joerg Roedel
2010-06-02 12:34                                         ` Michael S. Tsirkin
2010-06-02 13:02                                           ` Joerg Roedel
2010-06-02 17:46                                         ` Chris Wright
2010-06-02 18:09                                           ` Tom Lyon
2010-06-02 19:46                                             ` Joerg Roedel
2010-06-03  6:23                                           ` Avi Kivity
2010-06-03 21:41                                             ` Tom Lyon
2010-06-06  9:54                                               ` Michael S. Tsirkin
2010-06-07 19:01                                                 ` Tom Lyon
2010-06-08 21:22                                                   ` Michael S. Tsirkin
2010-06-02 10:44                             ` Michael S. Tsirkin
2010-05-30 12:59 ` Avi Kivity
2010-05-31 17:17 ` Alan Cox
2010-06-01 21:29   ` Tom Lyon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100528165600.68cd2618.randy.dunlap@oracle.com \
    --to=randy.dunlap@oracle.com \
    --cc=aafabbri@cisco.com \
    --cc=avi@redhat.com \
    --cc=chrisw@sous-sol.org \
    --cc=gregkh@suse.de \
    --cc=hjk@linutronix.de \
    --cc=joro@8bytes.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=pugs@cisco.com \
    --cc=scofeldm@cisco.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).