qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Jike Song <jike.song@intel.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: "Ruan, Shuai" <shuai.ruan@intel.com>,
	"Tian, Kevin" <kevin.tian@intel.com>,
	kvm@vger.kernel.org, "igvt-g@lists.01.org" <igvt-g@ml01.01.org>,
	qemu-devel <qemu-devel@nongnu.org>,
	Gerd Hoffmann <kraxel@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Zhiyuan Lv <zhiyuan.lv@intel.com>
Subject: Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
Date: Mon, 18 Jan 2016 16:56:13 +0800	[thread overview]
Message-ID: <569CA8AD.6070200@intel.com> (raw)
In-Reply-To: <1453092476.32741.67.camel@redhat.com>

On 01/18/2016 12:47 PM, Alex Williamson wrote:
> Hi Jike,
> 
> On Mon, 2016-01-18 at 10:39 +0800, Jike Song wrote:
>> Hi Alex, let's continue with a new thread :)
>>
>> Basically we agree with you: exposing vGPU via VFIO can make
>> QEMU share as much code as possible with pcidev(PF or VF) assignment.
>> And yes, different vGPU vendors can share quite a lot of the
>> QEMU part, which will do good for upper layers such as libvirt.
>>
>>
>> To achieve this, there are quite a lot to do, I'll summarize
>> it below. I dived into VFIO for a while but still may have
>> things misunderstood, so please correct me :)
>>
>>
>>
>> First, let me illustrate my understanding of current VFIO
>> framework used to pass through a pcidev to guest:
>>
>>
>>                  +----------------------------------+
>>                  |            vfio qemu             |
>>                  +-----+------------------------+---+
>>                        |DMA                  ^  |CFG
>> QEMU                   |map               IRQ|  |
>> -----------------------|---------------------|--|-----------
>> KERNEL    +------------|---------------------|--|----------+
>>           | VFIO       |                     |  |          |
>>           |            v                     |  v          |
>>           |  +-------------------+     +-----+-----------+ |
>> IOMMU     |  | vfio iommu driver |     | vfio bus driver | |
>> API  <-------+                   |     |                 | |
>> Layer     |  | e.g. type1        |     | e.g. vfio_pci   | |
>>           |  +-------------------+     +-----------------+ |
>>           +------------------------------------------------+
>>
>>
>> Here when a particular pcidev is passed-through to a KVM guest,
>> it is attached to vfio_pci driver in host, and guest memory
>> is mapped into IOMMU via the type1 iommu driver.
>>
>>
>> Then, the draft infrastructure of future VFIO-based vgpu:
>>
>>
>>
>>                  +-------------------------------------+
>>                  |              vfio qemu              |
>>                  +----+-------------------------+------+
>>                       |DMA                   ^  |CFG
>> QEMU                  |map                IRQ|  |
>> ----------------------|----------------------|--|-----------
>> KERNEL                |                      |  |
>>          +------------|----------------------|--|----------+
>>          |VFIO        |                      |  |          |
>>          |            v                      |  v          |
>>          | +--------------------+      +-----+-----------+ |
>> DMA      | | vfio iommu driver  |      | vfio bus driver | |
>> API <------+                    |      |                 | |
>> Layer    | |  e.g. vfio_type2   |      |  e.g. vfio_vgpu | |
>>          | +--------------------+      +-----------------+ |
>>          |         |  ^                      |  ^          |
>>          +---------|--|----------------------|--|----------+
>>                    |  |                      |  |
>>                    |  |                      v  |
>>          +---------|--|----------+   +---------------------+
>>          | +-------v-----------+ |   |                     |
>>          | |                   | |   |                     |
>>          | |      KVMGT        | |   |                     |
>>          | |                   | |   |   host gfx driver   |
>>          | +-------------------+ |   |                     |
>>          |                       |   |                     |
>>          |    KVM hypervisor     |   |                     |
>>          +-----------------------+   +---------------------+
>>
>>         NOTE    vfio_type2 and vfio_vgpu are only *logically* parts
>>                 of VFIO, they may be implemented in KVM hypervisor
>>                 or host gfx driver.
>>
>>
>>
>> Here we need to implement a new vfio IOMMU driver instead of type1,
>> let's call it vfio_type2 temporarily. The main difference from pcidev
>> assignment is, vGPU doesn't have its own DMA requester id, so it has
>> to share mappings with host and other vGPUs.
>>
>>         - type1 iommu driver maps gpa to hpa for passing through;
>>           whereas type2 maps iova to hpa;
>>
>>         - hardware iommu is always needed by type1, whereas for
>>           type2, hardware iommu is optional;
>>
>>         - type1 will invoke low-level IOMMU API (iommu_map et al) to
>>           setup IOMMU page table directly, whereas type2 dosen't (only
>>           need to invoke higher level DMA API like dma_map_page);
> 
> Yes, the current type1 implementation is not compatible with vgpu since
> there are not separate requester IDs on the bus and you probably don't
> want or need to pin all of guest memory like we do for direct
> assignment.  However, let's separate the type1 user API from the
> current implementation.  It's quite easy within the vfio code to
> consider "type1" to be an API specification that may have multiple
> implementations.  A minor code change would allow us to continue
> looking for compatible iommu backends if the group we're trying to
> attach is rejected.

Would you elaborate a bit about 'iommu backends' here? Previously
I thought that entire type1 will be duplicated. If not, what is supposed
to add, a new vfio_dma_do_map?

> The benefit here is that QEMU could work
> unmodified, using the type1 vfio-iommu API regardless of whether a
> device is directly assigned or virtual.
> 
> Let's look at the type1 interface; we have simple map and unmap
> interfaces which map and unmap process virtual address space (vaddr) to
> the device address space (iova).  The host physical address is obtained
> by pinning the vaddr.  In the current implementation, a map operation
> pins pages and populates the hardware iommu.  A vgpu compatible
> implementation might simply register the translation into a kernel-
> based database to be called upon later.  When the host graphics driver
> needs to enable dma for the vgpu, it doesn't need to go to QEMU for the
> translation, it already possesses the iova to vaddr mapping, which
> becomes iova to hpa after a pinning operation.
> 
> So, I would encourage you to look at creating a vgpu vfio iommu
> backened that makes use of the type1 api since it will reduce the
> changes necessary for userspace.
> 

Yes, keeping type1 API sounds a great idea.

>> We also need to implement a new 'bus' driver instead of vfio_pci,
>> let's call it vfio_vgpu temporarily:
>>
>>         - vfio_pci is a real pci driver, it has a probe method called
>>           during dev attaching; whereas the vfio_vgpu is a pseudo
>>           driver, it won't attach any devivce - the GPU is always owned by
>>           host gfx driver. It has to do 'probing' elsewhere, but
>>           still in host gfx driver attached to the device;
>>
>>         - pcidev(PF or VF) attached to vfio_pci has a natural path
>>           in sysfs; whereas vgpu is purely a software concept:
>>           vfio_vgpu needs to create create/destory vgpu instances,
>>           maintain their paths in sysfs (e.g. "/sys/class/vgpu/intel/vgpu0")
>>           etc. There should be something added in a higher layer
>>           to do this (VFIO or DRM).
>>
>>         - vfio_pci in most case will allow QEMU to access pcidev
>>           hardware; whereas vfio_vgpu is to access virtual resource
>>           emulated by another device model;
>>
>>         - vfio_pci will inject an IRQ to guest only when physical IRQ
>>           generated; whereas vfio_vgpu may inject an IRQ for emulation
>>           purpose. Anyway they can share the same injection interface;
> 
> Here too, I think you're making assumptions based on an implementation
> path.  Personally, I think each vgpu should be a struct device and that
> an iommu group should be created for each.  I think this is a valid
> abstraction; dma isolation is provided through something other than a
> system-level iommu, but it's still provided.  Without this, the entire
> vfio core would need to be aware of vgpu, since the core operates on
> devices and groups.  I believe creating a struct device also gives you
> basic probe and release support for a driver.
> 

Indeed.
BTW, that should be done in the 'bus' driver, right?

> There will be a need for some sort of lifecycle management of a vgpu.
>  How is it created?  Destroyed?  Can it be given more or less resources
> than other vgpus, etc.  This could be implemented in sysfs for each
> physical gpu with vgpu support, sort of like how we support sr-iov now,
> the PF exports controls for creating VFs.  The more commonality we can
> get for lifecycle and device access for userspace, the better.
> 

Will have a look at the VF managements, thanks for the info.

> As for virtual vs physical resources and interrupts, part of the
> purpose of vfio is to abstract a device into basic components.  It's up
> to the bus driver how accesses to each space map to the physical
> device.  Take for instance PCI config space, the existing vfio-pci
> driver emulates some portions of config space for the user.
> 
>> Questions:
>>
>>         [1] For VFIO No-IOMMU mode (!iommu_present), I saw it was reverted
>>             in upstream ae5515d66362(Revert: "vfio: Include No-IOMMU mode").
>>             In my opinion, vfio_type2 doesn't rely on it to support No-IOMMU
>>             case, instead it needs a new implementation which fits both
>>             w/ and w/o IOMMU. Is this correct?
>>
> 
> vfio no-iommu has also been re-added for v4.5 (03a76b60f8ba), this was
> simply a case that the kernel development outpaced the intended user
> and I didn't want to commit to the user api changes until it had been
> completely vetted.  In any case, vgpu should have no dependency
> whatsoever on no-iommu.  As above, I think vgpu should create virtual
> devices and add them to an iommu group, similar to how no-iommu does,
> but without the kernel tainting because you are actually providing
> isolation through other means than a system iommu.
> 

Thanks for confirmation.

>> For things not mentioned above, we might have them discussed in
>> other threads, or temporarily maintained in a TODO list (we might get
>> back to them after the big picture get agreed):
>>
>>
>>         - How to expose guest framebuffer via VFIO for SPICE;
> 
> Potentially through a new, device specific region, which I think can be
> done within the existing vfio API.  The API can already expose an
> arbitrary number of regions to the user, it's just a matter of how we
> tell the user the purpose of a region index beyond the fixed set we map
> to PCI resources.
> 
>>         - How to avoid double translation with two-stage: GTT + IOMMU,
>>           whether identity map is possible, and if yes, how to make it
>>           more effectively;
>>
>>         - Application acceleration
>>           You mentioned that with VFIO, a vGPU may be used by
>>           applications to get GPU acceleration. It's a potential
>>           opportunity to use vGPU for container usage, worthy of
>>           further investigation.
> 
> Yes, interesting topics.  Thanks,
> 

Looks that things get more clear overall, with small exceptions.
Thanks for the advice:)


> Alex
> 

--
Thanks,
Jike

  reply	other threads:[~2016-01-18  8:58 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-18  2:39 [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...) Jike Song
2016-01-18  4:47 ` Alex Williamson
2016-01-18  8:56   ` Jike Song [this message]
2016-01-18 19:05     ` Alex Williamson
2016-01-20  8:59       ` Jike Song
2016-01-20  9:05         ` Tian, Kevin
2016-01-25 11:34           ` Jike Song
2016-01-25 21:30             ` Alex Williamson
2016-01-25 21:45               ` Tian, Kevin
2016-01-25 21:48                 ` Tian, Kevin
2016-01-26  9:48                 ` Neo Jia
2016-01-26 10:20                 ` Neo Jia
2016-01-26 19:24                   ` Tian, Kevin
2016-01-26 19:29                     ` Neo Jia
2016-01-26 20:06                   ` Alex Williamson
2016-01-26 21:38                     ` Tian, Kevin
2016-01-26 22:28                     ` Neo Jia
2016-01-26 23:30                       ` Alex Williamson
2016-01-27  9:14                         ` Neo Jia
2016-01-27 16:10                           ` Alex Williamson
2016-01-27 21:48                             ` Neo Jia
2016-01-27  8:06                     ` Kirti Wankhede
2016-01-27 16:00                       ` Alex Williamson
2016-01-27 20:55                         ` Kirti Wankhede
2016-01-27 21:58                           ` Alex Williamson
2016-01-28  3:01                             ` Kirti Wankhede
2016-01-26  7:41               ` Jike Song
2016-01-26 14:05                 ` Yang Zhang
2016-01-26 16:37                   ` Alex Williamson
2016-01-26 21:21                     ` Tian, Kevin
2016-01-26 21:30                       ` Neo Jia
2016-01-26 21:43                         ` Tian, Kevin
2016-01-26 21:43                       ` Alex Williamson
2016-01-26 21:50                         ` Tian, Kevin
2016-01-26 22:07                           ` Alex Williamson
2016-01-26 22:15                             ` Tian, Kevin
2016-01-26 22:27                               ` Alex Williamson
2016-01-26 22:39                                 ` Tian, Kevin
2016-01-26 22:56                                   ` Alex Williamson
2016-01-27  1:47                                     ` Jike Song
2016-01-27  3:07                                       ` Alex Williamson
2016-01-27  5:43                                         ` Jike Song
2016-01-27 16:19                                           ` Alex Williamson
2016-01-28  6:00                                             ` Jike Song
2016-01-28 15:23                                               ` Alex Williamson
2016-01-29  7:20                                                 ` Jike Song
2016-01-29  8:49                                                   ` [Qemu-devel] [iGVT-g] " Jike Song
2016-01-29 18:50                                                     ` Alex Williamson
2016-02-01 13:10                                                       ` Gerd Hoffmann
2016-02-01 21:44                                                         ` Alex Williamson
2016-02-02  7:28                                                           ` Gerd Hoffmann
2016-02-02  7:35                                                           ` Zhiyuan Lv
2016-01-27  1:52                                     ` [Qemu-devel] " Yang Zhang
2016-01-27  3:37                                       ` Alex Williamson
2016-01-27  0:06                   ` Jike Song
2016-01-27  1:34                     ` Yang Zhang
2016-01-27  1:51                       ` Jike Song
2016-01-26 16:12                 ` Alex Williamson
2016-01-26 21:57                   ` Tian, Kevin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=569CA8AD.6070200@intel.com \
    --to=jike.song@intel.com \
    --cc=alex.williamson@redhat.com \
    --cc=igvt-g@ml01.01.org \
    --cc=kevin.tian@intel.com \
    --cc=kraxel@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=shuai.ruan@intel.com \
    --cc=zhiyuan.lv@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).