From: Jike Song <jike.song@intel.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: "Ruan, Shuai" <shuai.ruan@intel.com>,
"Tian, Kevin" <kevin.tian@intel.com>,
kvm@vger.kernel.org, "igvt-g@lists.01.org" <igvt-g@ml01.01.org>,
qemu-devel <qemu-devel@nongnu.org>,
Gerd Hoffmann <kraxel@redhat.com>,
Paolo Bonzini <pbonzini@redhat.com>,
Zhiyuan Lv <zhiyuan.lv@intel.com>
Subject: Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
Date: Mon, 18 Jan 2016 16:56:13 +0800 [thread overview]
Message-ID: <569CA8AD.6070200@intel.com> (raw)
In-Reply-To: <1453092476.32741.67.camel@redhat.com>
On 01/18/2016 12:47 PM, Alex Williamson wrote:
> Hi Jike,
>
> On Mon, 2016-01-18 at 10:39 +0800, Jike Song wrote:
>> Hi Alex, let's continue with a new thread :)
>>
>> Basically we agree with you: exposing vGPU via VFIO can make
>> QEMU share as much code as possible with pcidev(PF or VF) assignment.
>> And yes, different vGPU vendors can share quite a lot of the
>> QEMU part, which will do good for upper layers such as libvirt.
>>
>>
>> To achieve this, there are quite a lot to do, I'll summarize
>> it below. I dived into VFIO for a while but still may have
>> things misunderstood, so please correct me :)
>>
>>
>>
>> First, let me illustrate my understanding of current VFIO
>> framework used to pass through a pcidev to guest:
>>
>>
>> +----------------------------------+
>> | vfio qemu |
>> +-----+------------------------+---+
>> |DMA ^ |CFG
>> QEMU |map IRQ| |
>> -----------------------|---------------------|--|-----------
>> KERNEL +------------|---------------------|--|----------+
>> | VFIO | | | |
>> | v | v |
>> | +-------------------+ +-----+-----------+ |
>> IOMMU | | vfio iommu driver | | vfio bus driver | |
>> API <-------+ | | | |
>> Layer | | e.g. type1 | | e.g. vfio_pci | |
>> | +-------------------+ +-----------------+ |
>> +------------------------------------------------+
>>
>>
>> Here when a particular pcidev is passed-through to a KVM guest,
>> it is attached to vfio_pci driver in host, and guest memory
>> is mapped into IOMMU via the type1 iommu driver.
>>
>>
>> Then, the draft infrastructure of future VFIO-based vgpu:
>>
>>
>>
>> +-------------------------------------+
>> | vfio qemu |
>> +----+-------------------------+------+
>> |DMA ^ |CFG
>> QEMU |map IRQ| |
>> ----------------------|----------------------|--|-----------
>> KERNEL | | |
>> +------------|----------------------|--|----------+
>> |VFIO | | | |
>> | v | v |
>> | +--------------------+ +-----+-----------+ |
>> DMA | | vfio iommu driver | | vfio bus driver | |
>> API <------+ | | | |
>> Layer | | e.g. vfio_type2 | | e.g. vfio_vgpu | |
>> | +--------------------+ +-----------------+ |
>> | | ^ | ^ |
>> +---------|--|----------------------|--|----------+
>> | | | |
>> | | v |
>> +---------|--|----------+ +---------------------+
>> | +-------v-----------+ | | |
>> | | | | | |
>> | | KVMGT | | | |
>> | | | | | host gfx driver |
>> | +-------------------+ | | |
>> | | | |
>> | KVM hypervisor | | |
>> +-----------------------+ +---------------------+
>>
>> NOTE vfio_type2 and vfio_vgpu are only *logically* parts
>> of VFIO, they may be implemented in KVM hypervisor
>> or host gfx driver.
>>
>>
>>
>> Here we need to implement a new vfio IOMMU driver instead of type1,
>> let's call it vfio_type2 temporarily. The main difference from pcidev
>> assignment is, vGPU doesn't have its own DMA requester id, so it has
>> to share mappings with host and other vGPUs.
>>
>> - type1 iommu driver maps gpa to hpa for passing through;
>> whereas type2 maps iova to hpa;
>>
>> - hardware iommu is always needed by type1, whereas for
>> type2, hardware iommu is optional;
>>
>> - type1 will invoke low-level IOMMU API (iommu_map et al) to
>> setup IOMMU page table directly, whereas type2 dosen't (only
>> need to invoke higher level DMA API like dma_map_page);
>
> Yes, the current type1 implementation is not compatible with vgpu since
> there are not separate requester IDs on the bus and you probably don't
> want or need to pin all of guest memory like we do for direct
> assignment. However, let's separate the type1 user API from the
> current implementation. It's quite easy within the vfio code to
> consider "type1" to be an API specification that may have multiple
> implementations. A minor code change would allow us to continue
> looking for compatible iommu backends if the group we're trying to
> attach is rejected.
Would you elaborate a bit about 'iommu backends' here? Previously
I thought that entire type1 will be duplicated. If not, what is supposed
to add, a new vfio_dma_do_map?
> The benefit here is that QEMU could work
> unmodified, using the type1 vfio-iommu API regardless of whether a
> device is directly assigned or virtual.
>
> Let's look at the type1 interface; we have simple map and unmap
> interfaces which map and unmap process virtual address space (vaddr) to
> the device address space (iova). The host physical address is obtained
> by pinning the vaddr. In the current implementation, a map operation
> pins pages and populates the hardware iommu. A vgpu compatible
> implementation might simply register the translation into a kernel-
> based database to be called upon later. When the host graphics driver
> needs to enable dma for the vgpu, it doesn't need to go to QEMU for the
> translation, it already possesses the iova to vaddr mapping, which
> becomes iova to hpa after a pinning operation.
>
> So, I would encourage you to look at creating a vgpu vfio iommu
> backened that makes use of the type1 api since it will reduce the
> changes necessary for userspace.
>
Yes, keeping type1 API sounds a great idea.
>> We also need to implement a new 'bus' driver instead of vfio_pci,
>> let's call it vfio_vgpu temporarily:
>>
>> - vfio_pci is a real pci driver, it has a probe method called
>> during dev attaching; whereas the vfio_vgpu is a pseudo
>> driver, it won't attach any devivce - the GPU is always owned by
>> host gfx driver. It has to do 'probing' elsewhere, but
>> still in host gfx driver attached to the device;
>>
>> - pcidev(PF or VF) attached to vfio_pci has a natural path
>> in sysfs; whereas vgpu is purely a software concept:
>> vfio_vgpu needs to create create/destory vgpu instances,
>> maintain their paths in sysfs (e.g. "/sys/class/vgpu/intel/vgpu0")
>> etc. There should be something added in a higher layer
>> to do this (VFIO or DRM).
>>
>> - vfio_pci in most case will allow QEMU to access pcidev
>> hardware; whereas vfio_vgpu is to access virtual resource
>> emulated by another device model;
>>
>> - vfio_pci will inject an IRQ to guest only when physical IRQ
>> generated; whereas vfio_vgpu may inject an IRQ for emulation
>> purpose. Anyway they can share the same injection interface;
>
> Here too, I think you're making assumptions based on an implementation
> path. Personally, I think each vgpu should be a struct device and that
> an iommu group should be created for each. I think this is a valid
> abstraction; dma isolation is provided through something other than a
> system-level iommu, but it's still provided. Without this, the entire
> vfio core would need to be aware of vgpu, since the core operates on
> devices and groups. I believe creating a struct device also gives you
> basic probe and release support for a driver.
>
Indeed.
BTW, that should be done in the 'bus' driver, right?
> There will be a need for some sort of lifecycle management of a vgpu.
> How is it created? Destroyed? Can it be given more or less resources
> than other vgpus, etc. This could be implemented in sysfs for each
> physical gpu with vgpu support, sort of like how we support sr-iov now,
> the PF exports controls for creating VFs. The more commonality we can
> get for lifecycle and device access for userspace, the better.
>
Will have a look at the VF managements, thanks for the info.
> As for virtual vs physical resources and interrupts, part of the
> purpose of vfio is to abstract a device into basic components. It's up
> to the bus driver how accesses to each space map to the physical
> device. Take for instance PCI config space, the existing vfio-pci
> driver emulates some portions of config space for the user.
>
>> Questions:
>>
>> [1] For VFIO No-IOMMU mode (!iommu_present), I saw it was reverted
>> in upstream ae5515d66362(Revert: "vfio: Include No-IOMMU mode").
>> In my opinion, vfio_type2 doesn't rely on it to support No-IOMMU
>> case, instead it needs a new implementation which fits both
>> w/ and w/o IOMMU. Is this correct?
>>
>
> vfio no-iommu has also been re-added for v4.5 (03a76b60f8ba), this was
> simply a case that the kernel development outpaced the intended user
> and I didn't want to commit to the user api changes until it had been
> completely vetted. In any case, vgpu should have no dependency
> whatsoever on no-iommu. As above, I think vgpu should create virtual
> devices and add them to an iommu group, similar to how no-iommu does,
> but without the kernel tainting because you are actually providing
> isolation through other means than a system iommu.
>
Thanks for confirmation.
>> For things not mentioned above, we might have them discussed in
>> other threads, or temporarily maintained in a TODO list (we might get
>> back to them after the big picture get agreed):
>>
>>
>> - How to expose guest framebuffer via VFIO for SPICE;
>
> Potentially through a new, device specific region, which I think can be
> done within the existing vfio API. The API can already expose an
> arbitrary number of regions to the user, it's just a matter of how we
> tell the user the purpose of a region index beyond the fixed set we map
> to PCI resources.
>
>> - How to avoid double translation with two-stage: GTT + IOMMU,
>> whether identity map is possible, and if yes, how to make it
>> more effectively;
>>
>> - Application acceleration
>> You mentioned that with VFIO, a vGPU may be used by
>> applications to get GPU acceleration. It's a potential
>> opportunity to use vGPU for container usage, worthy of
>> further investigation.
>
> Yes, interesting topics. Thanks,
>
Looks that things get more clear overall, with small exceptions.
Thanks for the advice:)
> Alex
>
--
Thanks,
Jike
next prev parent reply other threads:[~2016-01-18 8:58 UTC|newest]
Thread overview: 59+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-01-18 2:39 [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...) Jike Song
2016-01-18 4:47 ` Alex Williamson
2016-01-18 8:56 ` Jike Song [this message]
2016-01-18 19:05 ` Alex Williamson
2016-01-20 8:59 ` Jike Song
2016-01-20 9:05 ` Tian, Kevin
2016-01-25 11:34 ` Jike Song
2016-01-25 21:30 ` Alex Williamson
2016-01-25 21:45 ` Tian, Kevin
2016-01-25 21:48 ` Tian, Kevin
2016-01-26 9:48 ` Neo Jia
2016-01-26 10:20 ` Neo Jia
2016-01-26 19:24 ` Tian, Kevin
2016-01-26 19:29 ` Neo Jia
2016-01-26 20:06 ` Alex Williamson
2016-01-26 21:38 ` Tian, Kevin
2016-01-26 22:28 ` Neo Jia
2016-01-26 23:30 ` Alex Williamson
2016-01-27 9:14 ` Neo Jia
2016-01-27 16:10 ` Alex Williamson
2016-01-27 21:48 ` Neo Jia
2016-01-27 8:06 ` Kirti Wankhede
2016-01-27 16:00 ` Alex Williamson
2016-01-27 20:55 ` Kirti Wankhede
2016-01-27 21:58 ` Alex Williamson
2016-01-28 3:01 ` Kirti Wankhede
2016-01-26 7:41 ` Jike Song
2016-01-26 14:05 ` Yang Zhang
2016-01-26 16:37 ` Alex Williamson
2016-01-26 21:21 ` Tian, Kevin
2016-01-26 21:30 ` Neo Jia
2016-01-26 21:43 ` Tian, Kevin
2016-01-26 21:43 ` Alex Williamson
2016-01-26 21:50 ` Tian, Kevin
2016-01-26 22:07 ` Alex Williamson
2016-01-26 22:15 ` Tian, Kevin
2016-01-26 22:27 ` Alex Williamson
2016-01-26 22:39 ` Tian, Kevin
2016-01-26 22:56 ` Alex Williamson
2016-01-27 1:47 ` Jike Song
2016-01-27 3:07 ` Alex Williamson
2016-01-27 5:43 ` Jike Song
2016-01-27 16:19 ` Alex Williamson
2016-01-28 6:00 ` Jike Song
2016-01-28 15:23 ` Alex Williamson
2016-01-29 7:20 ` Jike Song
2016-01-29 8:49 ` [Qemu-devel] [iGVT-g] " Jike Song
2016-01-29 18:50 ` Alex Williamson
2016-02-01 13:10 ` Gerd Hoffmann
2016-02-01 21:44 ` Alex Williamson
2016-02-02 7:28 ` Gerd Hoffmann
2016-02-02 7:35 ` Zhiyuan Lv
2016-01-27 1:52 ` [Qemu-devel] " Yang Zhang
2016-01-27 3:37 ` Alex Williamson
2016-01-27 0:06 ` Jike Song
2016-01-27 1:34 ` Yang Zhang
2016-01-27 1:51 ` Jike Song
2016-01-26 16:12 ` Alex Williamson
2016-01-26 21:57 ` Tian, Kevin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=569CA8AD.6070200@intel.com \
--to=jike.song@intel.com \
--cc=alex.williamson@redhat.com \
--cc=igvt-g@ml01.01.org \
--cc=kevin.tian@intel.com \
--cc=kraxel@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=shuai.ruan@intel.com \
--cc=zhiyuan.lv@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).