All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jike Song <jike.song@intel.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: "igvt-g@ml01.01.org" <igvt-g@ml01.01.org>,
	"Reddy, Raghuveer" <raghuveer.reddy@intel.com>,
	"White, Michael L" <michael.l.white@intel.com>,
	"Cowperthwaite, David J" <david.j.cowperthwaite@intel.com>,
	"intel-gfx@lists.freedesktop.org"
	<intel-gfx@lists.freedesktop.org>,
	"Li, Susie" <susie.li@intel.com>,
	"Dong, Eddie" <eddie.dong@intel.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	qemu-devel <qemu-devel@nongnu.org>,
	"Zhou, Chao" <chao.zhou@intel.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	"Zhu, Libo" <libo.zhu@intel.com>,
	"Wang, Hongbo" <hongbo.wang@intel.com>
Subject: Re: [Announcement] 2015-Q3 release of XenGT - a Mediated Graphics Passthrough Solution from Intel
Date: Thu, 19 Nov 2015 15:22:56 +0800	[thread overview]
Message-ID: <564D78D0.80904@intel.com> (raw)
In-Reply-To: <AADFC41AFE54684AB9EE6CBC0274A5D15F7152DB@SHSMSX101.ccr.corp.intel.com>

Hi Alex,
On 11/19/2015 12:06 PM, Tian, Kevin wrote:
>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>> Sent: Thursday, November 19, 2015 2:12 AM
>>
>> [cc +qemu-devel, +paolo, +gerd]
>>
>> On Tue, 2015-10-27 at 17:25 +0800, Jike Song wrote:
>>> {snip}
>>
>> Hi!
>>
>> At redhat we've been thinking about how to support vGPUs from multiple
>> vendors in a common way within QEMU.  We want to enable code sharing
>> between vendors and give new vendors an easy path to add their own
>> support.  We also have the complication that not all vGPU vendors are as
>> open source friendly as Intel, so being able to abstract the device
>> mediation and access outside of QEMU is a big advantage.
>>
>> The proposal I'd like to make is that a vGPU, whether it is from Intel
>> or another vendor, is predominantly a PCI(e) device.  We have an
>> interface in QEMU already for exposing arbitrary PCI devices, vfio-pci.
>> Currently vfio-pci uses the VFIO API to interact with "physical" devices
>> and system IOMMUs.  I highlight /physical/ there because some of these
>> physical devices are SR-IOV VFs, which is somewhat of a fuzzy concept,
>> somewhere between fixed hardware and a virtual device implemented in
>> software.  That software just happens to be running on the physical
>> endpoint.
>
> Agree.
>
> One clarification for rest discussion, is that we're talking about GVT-g vGPU
> here which is a pure software GPU virtualization technique. GVT-d (note
> some use in the text) refers to passing through the whole GPU or a specific
> VF. GVT-d already falls into existing VFIO APIs nicely (though some on-going
> effort to remove Intel specific platform stickness from gfx driver). :-)
>

Hi Alex, thanks for the discussion.

In addition to Kevin's replies, I have a high-level question: can VFIO
be used by QEMU for both KVM and Xen?

--
Thanks,
Jike

  
>>
>> vGPUs are similar, with the virtual device created at a different point,
>> host software.  They also rely on different IOMMU constructs, making use
>> of the MMU capabilities of the GPU (GTTs and such), but really having
>> similar requirements.
>
> One important difference between system IOMMU and GPU-MMU here.
> System IOMMU is very much about translation from a DMA target
> (IOVA on native, or GPA in virtualization case) to HPA. However GPU
> internal MMUs is to translate from Graphics Memory Address (GMA)
> to DMA target (HPA if system IOMMU is disabled, or IOVA/GPA if system
> IOMMU is enabled). GMA is an internal addr space within GPU, not
> exposed to Qemu and fully managed by GVT-g device model. Since it's
> not a standard PCI defined resource, we don't need abstract this capability
> in VFIO interface.
>
>>
>> The proposal is therefore that GPU vendors can expose vGPUs to
>> userspace, and thus to QEMU, using the VFIO API.  For instance, vfio
>> supports modular bus drivers and IOMMU drivers.  An intel-vfio-gvt-d
>> module (or extension of i915) can register as a vfio bus driver, create
>> a struct device per vGPU, create an IOMMU group for that device, and
>> register that device with the vfio-core.  Since we don't rely on the
>> system IOMMU for GVT-d vGPU assignment, another vGPU vendor driver (or
>> extension of the same module) can register a "type1" compliant IOMMU
>> driver into vfio-core.  From the perspective of QEMU then, all of the
>> existing vfio-pci code is re-used, QEMU remains largely unaware of any
>> specifics of the vGPU being assigned, and the only necessary change so
>> far is how QEMU traverses sysfs to find the device and thus the IOMMU
>> group leading to the vfio group.
>
> GVT-g requires to pin guest memory and query GPA->HPA information,
> upon which shadow GTTs will be updated accordingly from (GMA->GPA)
> to (GMA->HPA). So yes, here a dummy or simple "type1" compliant IOMMU
> can be introduced just for this requirement.
>
> However there's one tricky point which I'm not sure whether overall
> VFIO concept will be violated. GVT-g doesn't require system IOMMU
> to function, however host system may enable system IOMMU just for
> hardening purpose. This means two-level translations existing (GMA->
> IOVA->HPA), so the dummy IOMMU driver has to request system IOMMU
> driver to allocate IOVA for VMs and then setup IOVA->HPA mapping
> in IOMMU page table. In this case, multiple VM's translations are
> multiplexed in one IOMMU page table.
>
> We might need create some group/sub-group or parent/child concepts
> among those IOMMUs for thorough permission control.
>
>>
>> There are a few areas where we know we'll need to extend the VFIO API to
>> make this work, but it seems like they can all be done generically.  One
>> is that PCI BARs are described through the VFIO API as regions and each
>> region has a single flag describing whether mmap (ie. direct mapping) of
>> that region is possible.  We expect that vGPUs likely need finer
>> granularity, enabling some areas within a BAR to be trapped and fowarded
>> as a read or write access for the vGPU-vfio-device module to emulate,
>> while other regions, like framebuffers or texture regions, are directly
>> mapped.  I have prototype code to enable this already.
>
> Yes in GVT-g one BAR resource might be partitioned among multiple vGPUs.
> If VFIO can support such partial resource assignment, it'd be great. Similar
> parent/child concept might also be required here, so any resource enumerated
> on a vGPU shouldn't break limitations enforced on the physical device.
>
> One unique requirement for GVT-g here, though, is that vGPU device model
> need to know guest BAR configuration for proper emulation (e.g. register
> IO emulation handler to KVM). Similar is about guest MSI vector for virtual
> interrupt injection. Not sure how this can be fit into common VFIO model.
> Does VFIO allow vendor specific extension today?
>
>>
>> Another area is that we really don't want to proliferate each vGPU
>> needing a new IOMMU type within vfio.  The existing type1 IOMMU provides
>> potentially the most simple mapping and unmapping interface possible.
>> We'd therefore need to allow multiple "type1" IOMMU drivers for vfio,
>> making type1 be more of an interface specification rather than a single
>> implementation.  This is a trivial change to make within vfio and one
>> that I believe is compatible with the existing API.  Note that
>> implementing a type1-compliant vfio IOMMU does not imply pinning an
>> mapping every registered page.  A vGPU, with mediated device access, may
>> use this only to track the current HVA to GPA mappings for a VM.  Only
>> when a DMA is enabled for the vGPU instance is that HVA pinned and an
>> HPA to GPA translation programmed into the GPU MMU.
>>
>> Another area of extension is how to expose a framebuffer to QEMU for
>> seamless integration into a SPICE/VNC channel.  For this I believe we
>> could use a new region, much like we've done to expose VGA access
>> through a vfio device file descriptor.  An area within this new
>> framebuffer region could be directly mappable in QEMU while a
>> non-mappable page, at a standard location with standardized format,
>> provides a description of framebuffer and potentially even a
>> communication channel to synchronize framebuffer captures.  This would
>> be new code for QEMU, but something we could share among all vGPU
>> implementations.
>
> Now GVT-g already provides an interface to decode framebuffer information,
> w/ an assumption that the framebuffer will be further composited into
> OpenGL APIs. So the format is defined according to OpenGL definition.
> Does that meet SPICE requirement?
>
> Another thing to be added. Framebuffers are frequently switched in
> reality. So either Qemu needs to poll or a notification mechanism is required.
> And since it's dynamic, having framebuffer page directly exposed in the
> new region might be tricky. We can just expose framebuffer information
> (including base, format, etc.) and let Qemu to map separately out of VFIO
> interface.
>
> And... this works fine with vGPU model since software knows all the
> detail about framebuffer. However in pass-through case, who do you expect
> to provide that information? Is it OK to introduce vGPU specific APIs in
> VFIO?
>
>>
>> Another obvious area to be standardized would be how to discover,
>> create, and destroy vGPU instances.  SR-IOV has a standard mechanism to
>> create VFs in sysfs and I would propose that vGPU vendors try to
>> standardize on similar interfaces to enable libvirt to easily discover
>> the vGPU capabilities of a given GPU and manage the lifecycle of a vGPU
>> instance.
>
> Now there is no standard. We expose vGPU life-cycle mgmt. APIs through
> sysfs (under i915 node), which is very Intel specific. In reality different
> vendors have quite different capabilities for their own vGPUs, so not sure
> how standard we can define such a mechanism. But this code should be
> minor to be maintained in libvirt.
>
>>
>> This is obviously a lot to digest, but I'd certainly be interested in
>> hearing feedback on this proposal as well as try to clarify anything
>> I've left out or misrepresented above.  Another benefit to this
>> mechanism is that direct GPU assignment and vGPU assignment use the same
>> code within QEMU and same API to the kernel, which should make debugging
>> and code support between the two easier.  I'd really like to start a
>> discussion around this proposal, and of course the first open source
>> implementation of this sort of model will really help to drive the
>> direction it takes.  Thanks!
>>
>
> Thanks for starting this discussion. Intel will definitely work with
> community on this work. Based on earlier comments, I'm not sure
> whether we can exactly same code for direct GPU assignment and
> vGPU assignment, since even we extend VFIO some interfaces might
> be vGPU specific. Does this way still achieve your end goal?
>
> Thanks
> Kevin
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

WARNING: multiple messages have this Message-ID (diff)
From: Jike Song <jike.song@intel.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: "Tian, Kevin" <kevin.tian@intel.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	"igvt-g@ml01.01.org" <igvt-g@ml01.01.org>,
	"intel-gfx@lists.freedesktop.org"
	<intel-gfx@lists.freedesktop.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"White, Michael L" <michael.l.white@intel.com>,
	"Dong, Eddie" <eddie.dong@intel.com>,
	"Li, Susie" <susie.li@intel.com>,
	"Cowperthwaite, David J" <david.j.cowperthwaite@intel.com>,
	"Reddy, Raghuveer" <raghuveer.reddy@intel.com>,
	"Zhu, Libo" <libo.zhu@intel.com>,
	"Zhou, Chao" <chao.zhou@intel.com>,
	"Wang, Hongbo" <hongbo.wang@intel.com>,
	"Lv, Zhiyuan" <zhiyuan.lv@intel.com>,
	qemu-devel <qemu-devel@nongnu.org>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Gerd Hoffmann <kraxel@redhat.com>
Subject: Re: [Intel-gfx] [Announcement] 2015-Q3 release of XenGT - a Mediated Graphics Passthrough Solution from Intel
Date: Thu, 19 Nov 2015 15:22:56 +0800	[thread overview]
Message-ID: <564D78D0.80904@intel.com> (raw)
In-Reply-To: <AADFC41AFE54684AB9EE6CBC0274A5D15F7152DB@SHSMSX101.ccr.corp.intel.com>

Hi Alex,
On 11/19/2015 12:06 PM, Tian, Kevin wrote:
>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>> Sent: Thursday, November 19, 2015 2:12 AM
>>
>> [cc +qemu-devel, +paolo, +gerd]
>>
>> On Tue, 2015-10-27 at 17:25 +0800, Jike Song wrote:
>>> {snip}
>>
>> Hi!
>>
>> At redhat we've been thinking about how to support vGPUs from multiple
>> vendors in a common way within QEMU.  We want to enable code sharing
>> between vendors and give new vendors an easy path to add their own
>> support.  We also have the complication that not all vGPU vendors are as
>> open source friendly as Intel, so being able to abstract the device
>> mediation and access outside of QEMU is a big advantage.
>>
>> The proposal I'd like to make is that a vGPU, whether it is from Intel
>> or another vendor, is predominantly a PCI(e) device.  We have an
>> interface in QEMU already for exposing arbitrary PCI devices, vfio-pci.
>> Currently vfio-pci uses the VFIO API to interact with "physical" devices
>> and system IOMMUs.  I highlight /physical/ there because some of these
>> physical devices are SR-IOV VFs, which is somewhat of a fuzzy concept,
>> somewhere between fixed hardware and a virtual device implemented in
>> software.  That software just happens to be running on the physical
>> endpoint.
>
> Agree.
>
> One clarification for rest discussion, is that we're talking about GVT-g vGPU
> here which is a pure software GPU virtualization technique. GVT-d (note
> some use in the text) refers to passing through the whole GPU or a specific
> VF. GVT-d already falls into existing VFIO APIs nicely (though some on-going
> effort to remove Intel specific platform stickness from gfx driver). :-)
>

Hi Alex, thanks for the discussion.

In addition to Kevin's replies, I have a high-level question: can VFIO
be used by QEMU for both KVM and Xen?

--
Thanks,
Jike

  
>>
>> vGPUs are similar, with the virtual device created at a different point,
>> host software.  They also rely on different IOMMU constructs, making use
>> of the MMU capabilities of the GPU (GTTs and such), but really having
>> similar requirements.
>
> One important difference between system IOMMU and GPU-MMU here.
> System IOMMU is very much about translation from a DMA target
> (IOVA on native, or GPA in virtualization case) to HPA. However GPU
> internal MMUs is to translate from Graphics Memory Address (GMA)
> to DMA target (HPA if system IOMMU is disabled, or IOVA/GPA if system
> IOMMU is enabled). GMA is an internal addr space within GPU, not
> exposed to Qemu and fully managed by GVT-g device model. Since it's
> not a standard PCI defined resource, we don't need abstract this capability
> in VFIO interface.
>
>>
>> The proposal is therefore that GPU vendors can expose vGPUs to
>> userspace, and thus to QEMU, using the VFIO API.  For instance, vfio
>> supports modular bus drivers and IOMMU drivers.  An intel-vfio-gvt-d
>> module (or extension of i915) can register as a vfio bus driver, create
>> a struct device per vGPU, create an IOMMU group for that device, and
>> register that device with the vfio-core.  Since we don't rely on the
>> system IOMMU for GVT-d vGPU assignment, another vGPU vendor driver (or
>> extension of the same module) can register a "type1" compliant IOMMU
>> driver into vfio-core.  From the perspective of QEMU then, all of the
>> existing vfio-pci code is re-used, QEMU remains largely unaware of any
>> specifics of the vGPU being assigned, and the only necessary change so
>> far is how QEMU traverses sysfs to find the device and thus the IOMMU
>> group leading to the vfio group.
>
> GVT-g requires to pin guest memory and query GPA->HPA information,
> upon which shadow GTTs will be updated accordingly from (GMA->GPA)
> to (GMA->HPA). So yes, here a dummy or simple "type1" compliant IOMMU
> can be introduced just for this requirement.
>
> However there's one tricky point which I'm not sure whether overall
> VFIO concept will be violated. GVT-g doesn't require system IOMMU
> to function, however host system may enable system IOMMU just for
> hardening purpose. This means two-level translations existing (GMA->
> IOVA->HPA), so the dummy IOMMU driver has to request system IOMMU
> driver to allocate IOVA for VMs and then setup IOVA->HPA mapping
> in IOMMU page table. In this case, multiple VM's translations are
> multiplexed in one IOMMU page table.
>
> We might need create some group/sub-group or parent/child concepts
> among those IOMMUs for thorough permission control.
>
>>
>> There are a few areas where we know we'll need to extend the VFIO API to
>> make this work, but it seems like they can all be done generically.  One
>> is that PCI BARs are described through the VFIO API as regions and each
>> region has a single flag describing whether mmap (ie. direct mapping) of
>> that region is possible.  We expect that vGPUs likely need finer
>> granularity, enabling some areas within a BAR to be trapped and fowarded
>> as a read or write access for the vGPU-vfio-device module to emulate,
>> while other regions, like framebuffers or texture regions, are directly
>> mapped.  I have prototype code to enable this already.
>
> Yes in GVT-g one BAR resource might be partitioned among multiple vGPUs.
> If VFIO can support such partial resource assignment, it'd be great. Similar
> parent/child concept might also be required here, so any resource enumerated
> on a vGPU shouldn't break limitations enforced on the physical device.
>
> One unique requirement for GVT-g here, though, is that vGPU device model
> need to know guest BAR configuration for proper emulation (e.g. register
> IO emulation handler to KVM). Similar is about guest MSI vector for virtual
> interrupt injection. Not sure how this can be fit into common VFIO model.
> Does VFIO allow vendor specific extension today?
>
>>
>> Another area is that we really don't want to proliferate each vGPU
>> needing a new IOMMU type within vfio.  The existing type1 IOMMU provides
>> potentially the most simple mapping and unmapping interface possible.
>> We'd therefore need to allow multiple "type1" IOMMU drivers for vfio,
>> making type1 be more of an interface specification rather than a single
>> implementation.  This is a trivial change to make within vfio and one
>> that I believe is compatible with the existing API.  Note that
>> implementing a type1-compliant vfio IOMMU does not imply pinning an
>> mapping every registered page.  A vGPU, with mediated device access, may
>> use this only to track the current HVA to GPA mappings for a VM.  Only
>> when a DMA is enabled for the vGPU instance is that HVA pinned and an
>> HPA to GPA translation programmed into the GPU MMU.
>>
>> Another area of extension is how to expose a framebuffer to QEMU for
>> seamless integration into a SPICE/VNC channel.  For this I believe we
>> could use a new region, much like we've done to expose VGA access
>> through a vfio device file descriptor.  An area within this new
>> framebuffer region could be directly mappable in QEMU while a
>> non-mappable page, at a standard location with standardized format,
>> provides a description of framebuffer and potentially even a
>> communication channel to synchronize framebuffer captures.  This would
>> be new code for QEMU, but something we could share among all vGPU
>> implementations.
>
> Now GVT-g already provides an interface to decode framebuffer information,
> w/ an assumption that the framebuffer will be further composited into
> OpenGL APIs. So the format is defined according to OpenGL definition.
> Does that meet SPICE requirement?
>
> Another thing to be added. Framebuffers are frequently switched in
> reality. So either Qemu needs to poll or a notification mechanism is required.
> And since it's dynamic, having framebuffer page directly exposed in the
> new region might be tricky. We can just expose framebuffer information
> (including base, format, etc.) and let Qemu to map separately out of VFIO
> interface.
>
> And... this works fine with vGPU model since software knows all the
> detail about framebuffer. However in pass-through case, who do you expect
> to provide that information? Is it OK to introduce vGPU specific APIs in
> VFIO?
>
>>
>> Another obvious area to be standardized would be how to discover,
>> create, and destroy vGPU instances.  SR-IOV has a standard mechanism to
>> create VFs in sysfs and I would propose that vGPU vendors try to
>> standardize on similar interfaces to enable libvirt to easily discover
>> the vGPU capabilities of a given GPU and manage the lifecycle of a vGPU
>> instance.
>
> Now there is no standard. We expose vGPU life-cycle mgmt. APIs through
> sysfs (under i915 node), which is very Intel specific. In reality different
> vendors have quite different capabilities for their own vGPUs, so not sure
> how standard we can define such a mechanism. But this code should be
> minor to be maintained in libvirt.
>
>>
>> This is obviously a lot to digest, but I'd certainly be interested in
>> hearing feedback on this proposal as well as try to clarify anything
>> I've left out or misrepresented above.  Another benefit to this
>> mechanism is that direct GPU assignment and vGPU assignment use the same
>> code within QEMU and same API to the kernel, which should make debugging
>> and code support between the two easier.  I'd really like to start a
>> discussion around this proposal, and of course the first open source
>> implementation of this sort of model will really help to drive the
>> direction it takes.  Thanks!
>>
>
> Thanks for starting this discussion. Intel will definitely work with
> community on this work. Based on earlier comments, I'm not sure
> whether we can exactly same code for direct GPU assignment and
> vGPU assignment, since even we extend VFIO some interfaces might
> be vGPU specific. Does this way still achieve your end goal?
>
> Thanks
> Kevin
>

WARNING: multiple messages have this Message-ID (diff)
From: Jike Song <jike.song@intel.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: "igvt-g@ml01.01.org" <igvt-g@ml01.01.org>,
	"Tian, Kevin" <kevin.tian@intel.com>,
	"Reddy, Raghuveer" <raghuveer.reddy@intel.com>,
	qemu-devel <qemu-devel@nongnu.org>,
	"White, Michael L" <michael.l.white@intel.com>,
	"Cowperthwaite, David J" <david.j.cowperthwaite@intel.com>,
	"intel-gfx@lists.freedesktop.org"
	<intel-gfx@lists.freedesktop.org>,
	"Li, Susie" <susie.li@intel.com>,
	"Dong, Eddie" <eddie.dong@intel.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	Gerd Hoffmann <kraxel@redhat.com>,
	"Zhou, Chao" <chao.zhou@intel.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	"Zhu, Libo" <libo.zhu@intel.com>,
	"Wang, Hongbo" <hongbo.wang@intel.com>,
	"Lv, Zhiyuan" <zhiyuan.lv@intel.com>
Subject: Re: [Qemu-devel] [Intel-gfx] [Announcement] 2015-Q3 release of XenGT - a Mediated Graphics Passthrough Solution from Intel
Date: Thu, 19 Nov 2015 15:22:56 +0800	[thread overview]
Message-ID: <564D78D0.80904@intel.com> (raw)
In-Reply-To: <AADFC41AFE54684AB9EE6CBC0274A5D15F7152DB@SHSMSX101.ccr.corp.intel.com>

Hi Alex,
On 11/19/2015 12:06 PM, Tian, Kevin wrote:
>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>> Sent: Thursday, November 19, 2015 2:12 AM
>>
>> [cc +qemu-devel, +paolo, +gerd]
>>
>> On Tue, 2015-10-27 at 17:25 +0800, Jike Song wrote:
>>> {snip}
>>
>> Hi!
>>
>> At redhat we've been thinking about how to support vGPUs from multiple
>> vendors in a common way within QEMU.  We want to enable code sharing
>> between vendors and give new vendors an easy path to add their own
>> support.  We also have the complication that not all vGPU vendors are as
>> open source friendly as Intel, so being able to abstract the device
>> mediation and access outside of QEMU is a big advantage.
>>
>> The proposal I'd like to make is that a vGPU, whether it is from Intel
>> or another vendor, is predominantly a PCI(e) device.  We have an
>> interface in QEMU already for exposing arbitrary PCI devices, vfio-pci.
>> Currently vfio-pci uses the VFIO API to interact with "physical" devices
>> and system IOMMUs.  I highlight /physical/ there because some of these
>> physical devices are SR-IOV VFs, which is somewhat of a fuzzy concept,
>> somewhere between fixed hardware and a virtual device implemented in
>> software.  That software just happens to be running on the physical
>> endpoint.
>
> Agree.
>
> One clarification for rest discussion, is that we're talking about GVT-g vGPU
> here which is a pure software GPU virtualization technique. GVT-d (note
> some use in the text) refers to passing through the whole GPU or a specific
> VF. GVT-d already falls into existing VFIO APIs nicely (though some on-going
> effort to remove Intel specific platform stickness from gfx driver). :-)
>

Hi Alex, thanks for the discussion.

In addition to Kevin's replies, I have a high-level question: can VFIO
be used by QEMU for both KVM and Xen?

--
Thanks,
Jike

  
>>
>> vGPUs are similar, with the virtual device created at a different point,
>> host software.  They also rely on different IOMMU constructs, making use
>> of the MMU capabilities of the GPU (GTTs and such), but really having
>> similar requirements.
>
> One important difference between system IOMMU and GPU-MMU here.
> System IOMMU is very much about translation from a DMA target
> (IOVA on native, or GPA in virtualization case) to HPA. However GPU
> internal MMUs is to translate from Graphics Memory Address (GMA)
> to DMA target (HPA if system IOMMU is disabled, or IOVA/GPA if system
> IOMMU is enabled). GMA is an internal addr space within GPU, not
> exposed to Qemu and fully managed by GVT-g device model. Since it's
> not a standard PCI defined resource, we don't need abstract this capability
> in VFIO interface.
>
>>
>> The proposal is therefore that GPU vendors can expose vGPUs to
>> userspace, and thus to QEMU, using the VFIO API.  For instance, vfio
>> supports modular bus drivers and IOMMU drivers.  An intel-vfio-gvt-d
>> module (or extension of i915) can register as a vfio bus driver, create
>> a struct device per vGPU, create an IOMMU group for that device, and
>> register that device with the vfio-core.  Since we don't rely on the
>> system IOMMU for GVT-d vGPU assignment, another vGPU vendor driver (or
>> extension of the same module) can register a "type1" compliant IOMMU
>> driver into vfio-core.  From the perspective of QEMU then, all of the
>> existing vfio-pci code is re-used, QEMU remains largely unaware of any
>> specifics of the vGPU being assigned, and the only necessary change so
>> far is how QEMU traverses sysfs to find the device and thus the IOMMU
>> group leading to the vfio group.
>
> GVT-g requires to pin guest memory and query GPA->HPA information,
> upon which shadow GTTs will be updated accordingly from (GMA->GPA)
> to (GMA->HPA). So yes, here a dummy or simple "type1" compliant IOMMU
> can be introduced just for this requirement.
>
> However there's one tricky point which I'm not sure whether overall
> VFIO concept will be violated. GVT-g doesn't require system IOMMU
> to function, however host system may enable system IOMMU just for
> hardening purpose. This means two-level translations existing (GMA->
> IOVA->HPA), so the dummy IOMMU driver has to request system IOMMU
> driver to allocate IOVA for VMs and then setup IOVA->HPA mapping
> in IOMMU page table. In this case, multiple VM's translations are
> multiplexed in one IOMMU page table.
>
> We might need create some group/sub-group or parent/child concepts
> among those IOMMUs for thorough permission control.
>
>>
>> There are a few areas where we know we'll need to extend the VFIO API to
>> make this work, but it seems like they can all be done generically.  One
>> is that PCI BARs are described through the VFIO API as regions and each
>> region has a single flag describing whether mmap (ie. direct mapping) of
>> that region is possible.  We expect that vGPUs likely need finer
>> granularity, enabling some areas within a BAR to be trapped and fowarded
>> as a read or write access for the vGPU-vfio-device module to emulate,
>> while other regions, like framebuffers or texture regions, are directly
>> mapped.  I have prototype code to enable this already.
>
> Yes in GVT-g one BAR resource might be partitioned among multiple vGPUs.
> If VFIO can support such partial resource assignment, it'd be great. Similar
> parent/child concept might also be required here, so any resource enumerated
> on a vGPU shouldn't break limitations enforced on the physical device.
>
> One unique requirement for GVT-g here, though, is that vGPU device model
> need to know guest BAR configuration for proper emulation (e.g. register
> IO emulation handler to KVM). Similar is about guest MSI vector for virtual
> interrupt injection. Not sure how this can be fit into common VFIO model.
> Does VFIO allow vendor specific extension today?
>
>>
>> Another area is that we really don't want to proliferate each vGPU
>> needing a new IOMMU type within vfio.  The existing type1 IOMMU provides
>> potentially the most simple mapping and unmapping interface possible.
>> We'd therefore need to allow multiple "type1" IOMMU drivers for vfio,
>> making type1 be more of an interface specification rather than a single
>> implementation.  This is a trivial change to make within vfio and one
>> that I believe is compatible with the existing API.  Note that
>> implementing a type1-compliant vfio IOMMU does not imply pinning an
>> mapping every registered page.  A vGPU, with mediated device access, may
>> use this only to track the current HVA to GPA mappings for a VM.  Only
>> when a DMA is enabled for the vGPU instance is that HVA pinned and an
>> HPA to GPA translation programmed into the GPU MMU.
>>
>> Another area of extension is how to expose a framebuffer to QEMU for
>> seamless integration into a SPICE/VNC channel.  For this I believe we
>> could use a new region, much like we've done to expose VGA access
>> through a vfio device file descriptor.  An area within this new
>> framebuffer region could be directly mappable in QEMU while a
>> non-mappable page, at a standard location with standardized format,
>> provides a description of framebuffer and potentially even a
>> communication channel to synchronize framebuffer captures.  This would
>> be new code for QEMU, but something we could share among all vGPU
>> implementations.
>
> Now GVT-g already provides an interface to decode framebuffer information,
> w/ an assumption that the framebuffer will be further composited into
> OpenGL APIs. So the format is defined according to OpenGL definition.
> Does that meet SPICE requirement?
>
> Another thing to be added. Framebuffers are frequently switched in
> reality. So either Qemu needs to poll or a notification mechanism is required.
> And since it's dynamic, having framebuffer page directly exposed in the
> new region might be tricky. We can just expose framebuffer information
> (including base, format, etc.) and let Qemu to map separately out of VFIO
> interface.
>
> And... this works fine with vGPU model since software knows all the
> detail about framebuffer. However in pass-through case, who do you expect
> to provide that information? Is it OK to introduce vGPU specific APIs in
> VFIO?
>
>>
>> Another obvious area to be standardized would be how to discover,
>> create, and destroy vGPU instances.  SR-IOV has a standard mechanism to
>> create VFs in sysfs and I would propose that vGPU vendors try to
>> standardize on similar interfaces to enable libvirt to easily discover
>> the vGPU capabilities of a given GPU and manage the lifecycle of a vGPU
>> instance.
>
> Now there is no standard. We expose vGPU life-cycle mgmt. APIs through
> sysfs (under i915 node), which is very Intel specific. In reality different
> vendors have quite different capabilities for their own vGPUs, so not sure
> how standard we can define such a mechanism. But this code should be
> minor to be maintained in libvirt.
>
>>
>> This is obviously a lot to digest, but I'd certainly be interested in
>> hearing feedback on this proposal as well as try to clarify anything
>> I've left out or misrepresented above.  Another benefit to this
>> mechanism is that direct GPU assignment and vGPU assignment use the same
>> code within QEMU and same API to the kernel, which should make debugging
>> and code support between the two easier.  I'd really like to start a
>> discussion around this proposal, and of course the first open source
>> implementation of this sort of model will really help to drive the
>> direction it takes.  Thanks!
>>
>
> Thanks for starting this discussion. Intel will definitely work with
> community on this work. Based on earlier comments, I'm not sure
> whether we can exactly same code for direct GPU assignment and
> vGPU assignment, since even we extend VFIO some interfaces might
> be vGPU specific. Does this way still achieve your end goal?
>
> Thanks
> Kevin
>

  reply	other threads:[~2015-11-19  7:23 UTC|newest]

Thread overview: 176+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-25  8:31 [Announcement] Updates to XenGT - a Mediated Graphics Passthrough Solution from Intel Jike Song
2014-07-25  8:31 ` [Intel-gfx] " Jike Song
2014-07-29 10:09 ` Dario Faggioli
2014-07-29 10:09 ` [Xen-devel] " Dario Faggioli
2014-07-29 10:09   ` Dario Faggioli
2014-07-30  9:39   ` [Xen-devel] " Jike Song
2014-07-30  9:39     ` [Xen-devel] [Intel-gfx] " Jike Song
2014-07-31 13:58     ` Dario Faggioli
2014-07-31 13:58     ` [Xen-devel] " Dario Faggioli
2014-07-30  9:39   ` [Intel-gfx] " Jike Song
2014-12-04  2:45 ` [Intel-gfx] [Announcement] 2014-Q3 release of " Jike Song
2014-12-04  2:45 ` Jike Song
2014-12-04  2:45   ` [Intel-gfx] " Jike Song
2014-12-04 10:20   ` Fabio Fantoni
2014-12-04 10:20   ` [Xen-devel] " Fabio Fantoni
2014-12-04 10:26     ` Tian, Kevin
2015-01-09  8:51   ` [Announcement] 2015-Q1 " Jike Song
2015-01-09  8:51     ` [Intel-gfx] " Jike Song
2015-01-12  3:04     ` [Announcement] 2014-Q4 " Jike Song
2015-01-12  3:04       ` [Intel-gfx] " Jike Song
2015-01-12  3:04     ` Jike Song
2015-04-10 13:23     ` [Announcement] 2015-Q1 " Jike Song
2015-04-10 13:23       ` [Intel-gfx] " Jike Song
2015-07-07  2:49       ` [Announcement] 2015-Q2 " Jike Song
2015-07-07  2:49         ` [Intel-gfx] " Jike Song
2015-10-27  9:25         ` [Announcement] 2015-Q3 " Jike Song
2015-10-27  9:25           ` [Intel-gfx] " Jike Song
2015-11-18 18:12           ` Alex Williamson
2015-11-18 18:12             ` [Qemu-devel] [Intel-gfx] " Alex Williamson
2015-11-18 18:12             ` Alex Williamson
2015-11-19  4:06             ` Tian, Kevin
2015-11-19  4:06             ` Tian, Kevin
2015-11-19  4:06               ` [Qemu-devel] [Intel-gfx] " Tian, Kevin
2015-11-19  4:06               ` Tian, Kevin
2015-11-19  7:22               ` Jike Song [this message]
2015-11-19  7:22                 ` [Qemu-devel] " Jike Song
2015-11-19  7:22                 ` Jike Song
2015-11-19 15:32                 ` Stefano Stabellini
2015-11-19 15:32                   ` [Qemu-devel] " Stefano Stabellini
2015-11-19 15:32                   ` Stefano Stabellini
2015-11-19 15:49                   ` Paolo Bonzini
2015-11-19 15:49                     ` [Qemu-devel] [Intel-gfx] " Paolo Bonzini
2015-11-19 15:49                     ` Paolo Bonzini
2015-11-19 16:12                     ` Stefano Stabellini
2015-11-19 16:12                     ` Stefano Stabellini
2015-11-19 16:12                       ` [Qemu-devel] " Stefano Stabellini
2015-11-19 15:49                   ` Paolo Bonzini
2015-11-19 15:52                   ` Alex Williamson
2015-11-19 15:52                     ` [Qemu-devel] [Intel-gfx] " Alex Williamson
2015-11-19 15:52                     ` Alex Williamson
2015-11-20  2:58                     ` Jike Song
2015-11-20  2:58                       ` [Qemu-devel] [Intel-gfx] " Jike Song
2015-11-20  2:58                       ` Jike Song
2015-11-20  4:22                       ` Alex Williamson
2015-11-20  4:22                         ` [Qemu-devel] [Intel-gfx] " Alex Williamson
2015-11-20  4:22                         ` Alex Williamson
2015-11-20  5:51                         ` Jike Song
2015-11-20  5:51                           ` [Qemu-devel] [Intel-gfx] " Jike Song
2015-11-20  5:51                           ` Jike Song
2015-11-20  6:01                           ` Tian, Kevin
2015-11-20  6:01                             ` [Qemu-devel] [Intel-gfx] " Tian, Kevin
2015-11-20  6:01                             ` Tian, Kevin
2015-11-20  6:01                           ` Tian, Kevin
2015-11-20 16:40                           ` Alex Williamson
2015-11-20 16:40                             ` [Qemu-devel] [Intel-gfx] " Alex Williamson
2015-11-20 16:40                             ` Alex Williamson
2015-11-23  4:52                             ` [Qemu-devel] " Jike Song
2015-11-23  4:52                             ` Jike Song
2015-11-23  4:52                               ` Jike Song
2015-11-20 16:40                           ` Alex Williamson
2015-11-20  5:51                         ` Jike Song
2015-11-20  4:22                       ` Alex Williamson
2015-11-20  2:58                     ` Jike Song
2015-11-19 15:52                   ` Alex Williamson
2015-11-19 15:32                 ` Stefano Stabellini
2015-11-19  7:22               ` Jike Song
2015-11-19  8:40               ` Gerd Hoffmann
2015-11-19  8:40               ` Gerd Hoffmann
2015-11-19  8:40                 ` [Qemu-devel] [Intel-gfx] " Gerd Hoffmann
2015-11-19  8:40                 ` Gerd Hoffmann
2015-11-19 11:09                 ` Paolo Bonzini
2015-11-19 11:09                 ` Paolo Bonzini
2015-11-19 11:09                   ` [Qemu-devel] " Paolo Bonzini
2015-11-20  2:46                   ` Jike Song
2015-11-20  2:46                     ` [Qemu-devel] [Intel-gfx] " Jike Song
2015-11-20  2:46                     ` Jike Song
2015-11-20  2:46                   ` Jike Song
2015-11-20  6:12                 ` Tian, Kevin
2015-11-20  6:12                   ` [Qemu-devel] " Tian, Kevin
2015-11-20  6:12                   ` Tian, Kevin
2015-11-20  8:26                   ` Gerd Hoffmann
2015-11-20  8:26                   ` Gerd Hoffmann
2015-11-20  8:26                     ` [Qemu-devel] [Intel-gfx] " Gerd Hoffmann
2015-11-20  8:26                     ` Gerd Hoffmann
2015-11-20  8:36                     ` Tian, Kevin
2015-11-20  8:36                     ` Tian, Kevin
2015-11-20  8:36                       ` [Qemu-devel] [Intel-gfx] " Tian, Kevin
2015-11-20  8:36                       ` Tian, Kevin
2015-11-20  8:46                       ` Zhiyuan Lv
2015-11-20  8:46                         ` [Qemu-devel] [Intel-gfx] " Zhiyuan Lv
2015-11-20  8:46                         ` Zhiyuan Lv
2015-11-20  8:46                       ` Zhiyuan Lv
2015-12-03  6:57                     ` Tian, Kevin
2015-12-03  6:57                     ` Tian, Kevin
2015-12-03  6:57                       ` [Qemu-devel] [Intel-gfx] " Tian, Kevin
2015-12-03  6:57                       ` Tian, Kevin
2015-12-04 10:13                       ` Gerd Hoffmann
2015-12-04 10:13                         ` [Qemu-devel] [Intel-gfx] " Gerd Hoffmann
2015-12-04 10:13                         ` Gerd Hoffmann
2015-12-04 10:13                       ` Gerd Hoffmann
2015-11-20  6:12                 ` Tian, Kevin
2015-11-19 20:02               ` Alex Williamson
2015-11-19 20:02               ` Alex Williamson
2015-11-19 20:02                 ` [Qemu-devel] [Intel-gfx] " Alex Williamson
2015-11-19 20:02                 ` Alex Williamson
2015-11-20  7:09                 ` Tian, Kevin
2015-11-20  7:09                   ` [Qemu-devel] [Intel-gfx] " Tian, Kevin
2015-11-20  7:09                   ` Tian, Kevin
2015-11-20 17:03                   ` Alex Williamson
2015-11-20 17:03                     ` [Qemu-devel] [Intel-gfx] " Alex Williamson
2015-11-20 17:03                     ` Alex Williamson
2015-11-20 17:03                   ` Alex Williamson
2015-11-20  7:09                 ` Tian, Kevin
2015-11-20  8:10                 ` Tian, Kevin
2015-11-20  8:10                   ` [Qemu-devel] [Intel-gfx] " Tian, Kevin
2015-11-20  8:10                   ` Tian, Kevin
2015-11-20 17:25                   ` Alex Williamson
2015-11-20 17:25                     ` [Qemu-devel] [Intel-gfx] " Alex Williamson
2015-11-20 17:25                     ` Alex Williamson
2015-11-23  5:05                     ` Jike Song
2015-11-23  5:05                     ` Jike Song
2015-11-23  5:05                       ` [Qemu-devel] [Intel-gfx] " Jike Song
2015-11-23  5:05                       ` Jike Song
2015-11-20 17:25                   ` Alex Williamson
2015-11-20  8:10                 ` Tian, Kevin
2015-11-24 11:19                 ` Daniel Vetter
2015-11-24 11:19                   ` [Qemu-devel] [Intel-gfx] " Daniel Vetter
2015-11-24 11:19                   ` Daniel Vetter
2015-11-24 11:49                   ` Chris Wilson
2015-11-24 11:49                     ` [Qemu-devel] [Intel-gfx] " Chris Wilson
2015-11-24 11:49                     ` Chris Wilson
2015-11-24 11:49                   ` Chris Wilson
2015-11-24 12:38                   ` Gerd Hoffmann
2015-11-24 12:38                     ` [Qemu-devel] [Intel-gfx] " Gerd Hoffmann
2015-11-24 12:38                     ` Gerd Hoffmann
2015-11-24 13:31                     ` Daniel Vetter
2015-11-24 13:31                       ` [Qemu-devel] [Intel-gfx] " Daniel Vetter
2015-11-24 13:31                       ` Daniel Vetter
2015-11-24 14:12                       ` Gerd Hoffmann
2015-11-24 14:12                       ` Gerd Hoffmann
2015-11-24 14:12                         ` [Qemu-devel] [Intel-gfx] " Gerd Hoffmann
2015-11-24 14:12                         ` Gerd Hoffmann
2015-11-24 14:19                         ` Daniel Vetter
2015-11-24 14:19                           ` [Qemu-devel] [Intel-gfx] " Daniel Vetter
2015-11-24 14:19                           ` Daniel Vetter
2015-11-24 14:19                         ` Daniel Vetter
2015-11-24 13:31                     ` Daniel Vetter
2015-11-24 12:38                   ` Gerd Hoffmann
2015-11-24 11:19                 ` Daniel Vetter
2015-11-18 18:12           ` Alex Williamson
2016-01-27  6:21           ` [Intel-gfx] [Announcement] 2015-Q4 " Jike Song
2016-01-27  6:21           ` Jike Song
2016-01-27  6:21             ` [Intel-gfx] " Jike Song
2016-04-28  5:29             ` [Intel-gfx] [Announcement] 2016-Q1 " Jike Song
2016-04-28  5:29             ` Jike Song
2016-04-28  5:29               ` [Intel-gfx] " Jike Song
2016-07-22  5:42               ` [Intel-gfx] [Announcement] 2016-Q2 " Jike Song
2016-07-22  5:42                 ` Jike Song
2016-11-06 14:59                 ` [Announcement] 2016-Q3 " Jike Song
2016-11-06 14:59                   ` [Intel-gfx] " Jike Song
2016-11-06 14:59                 ` Jike Song
2016-07-22  5:42               ` [Intel-gfx] [Announcement] 2016-Q2 " Jike Song
2015-10-27  9:25         ` [Intel-gfx] [Announcement] 2015-Q3 " Jike Song
2015-07-07  2:49       ` [Intel-gfx] [Announcement] 2015-Q2 " Jike Song
2015-04-10 13:23     ` [Intel-gfx] [Announcement] 2015-Q1 " Jike Song
2015-01-09  8:51   ` Jike Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=564D78D0.80904@intel.com \
    --to=jike.song@intel.com \
    --cc=alex.williamson@redhat.com \
    --cc=chao.zhou@intel.com \
    --cc=david.j.cowperthwaite@intel.com \
    --cc=eddie.dong@intel.com \
    --cc=hongbo.wang@intel.com \
    --cc=igvt-g@ml01.01.org \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=libo.zhu@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=michael.l.white@intel.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=raghuveer.reddy@intel.com \
    --cc=susie.li@intel.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.