From: Jike Song <jike.song@intel.com>
To: "Tian, Kevin" <kevin.tian@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>,
Kirti Wankhede <kwankhede@nvidia.com>,
"pbonzini@redhat.com" <pbonzini@redhat.com>,
"kraxel@redhat.com" <kraxel@redhat.com>,
"cjia@nvidia.com" <cjia@nvidia.com>,
"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
"bjsdjshi@linux.vnet.ibm.com" <bjsdjshi@linux.vnet.ibm.com>,
"libvir-list@redhat.com" <libvir-list@redhat.com>,
"Daniel P. Berrange" <berrange@redhat.com>,
Laine Stump <laine@redhat.com>
Subject: Re: [PATCH v7 0/4] Add Mediated device support
Date: Wed, 31 Aug 2016 15:04:13 +0800 [thread overview]
Message-ID: <57C6816D.2010201@intel.com> (raw)
In-Reply-To: <AADFC41AFE54684AB9EE6CBC0274A5D18DEBD970@SHSMSX101.ccr.corp.intel.com>
On 08/31/2016 02:12 PM, Tian, Kevin wrote:
>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>> Sent: Wednesday, August 31, 2016 12:17 AM
>>
>> Hi folks,
>>
>> At KVM Forum we had a BoF session primarily around the mediated device
>> sysfs interface. I'd like to share what I think we agreed on and the
>> "problem areas" that still need some work so we can get the thoughts
>> and ideas from those who weren't able to attend.
>>
>> DanPB expressed some concern about the mdev_supported_types sysfs
>> interface, which exposes a flat csv file with fields like "type",
>> "number of instance", "vendor string", and then a bunch of type
>> specific fields like "framebuffer size", "resolution", "frame rate
>> limit", etc. This is not entirely machine parsing friendly and sort of
>> abuses the sysfs concept of one value per file. Example output taken
>> from Neo's libvirt RFC:
>>
>> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
>> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
>> max_resolution
>> 11 ,"GRID M60-0B", 16, 2, 45, 512M, 2560x1600
>> 12 ,"GRID M60-0Q", 16, 2, 60, 512M, 2560x1600
>> 13 ,"GRID M60-1B", 8, 2, 45, 1024M, 2560x1600
>> 14 ,"GRID M60-1Q", 8, 2, 60, 1024M, 2560x1600
>> 15 ,"GRID M60-2B", 4, 2, 45, 2048M, 2560x1600
>> 16 ,"GRID M60-2Q", 4, 4, 60, 2048M, 2560x1600
>> 17 ,"GRID M60-4Q", 2, 4, 60, 4096M, 3840x2160
>> 18 ,"GRID M60-8Q", 1, 4, 60, 8192M, 3840x2160
>>
>> The create/destroy then looks like this:
>>
>> echo "$mdev_UUID:vendor_specific_argument_list" >
>> /sys/bus/pci/devices/.../mdev_create
>>
>> echo "$mdev_UUID:vendor_specific_argument_list" >
>> /sys/bus/pci/devices/.../mdev_destroy
>>
>> "vendor_specific_argument_list" is nebulous.
>>
>> So the idea to fix this is to explode this into a directory structure,
>> something like:
>>
>> ├── mdev_destroy
>> └── mdev_supported_types
>> ├── 11
>> │ ├── create
>> │ ├── description
>> │ └── max_instances
>> ├── 12
>> │ ├── create
>> │ ├── description
>> │ └── max_instances
>> └── 13
>> ├── create
>> ├── description
>> └── max_instances
>>
>> Note that I'm only exposing the minimal attributes here for simplicity,
>> the other attributes would be included in separate files and we would
>> require vendors to create standard attributes for common device classes.
>
> I like this idea. All standard attributes are reflected into this hierarchy.
> In the meantime, can we still allow optional vendor string in create
> interface? libvirt doesn't need to know the meaning, but allows upper
> layer to do some vendor specific tweak if necessary.
>
Not sure whether this can done within MDEV framework (attrs provided by
vendor driver of course), or must be within the vendor driver.
>>
>> For vGPUs like NVIDIA where we don't support multiple types
>> concurrently, this directory structure would update as mdev devices are
>> created, removing no longer available types. I carried forward
>
> or keep the type with max_instances cleared to ZERO.
>
+1 :)
>> max_instances here, but perhaps we really want to copy SR-IOV and
>> report a max and current allocation. Creation and deletion is
>
> right, cur/max_instances look reasonable.
>
>> simplified as we can simply "echo $UUID > create" per type. I don't
>> understand why destroy had a parameter list, so here I imagine we can
>> simply do the same... in fact, I'd actually rather see a "remove" sysfs
>> entry under each mdev device, so we remove it at the device rather than
>> in some central location (any objections?).
>
> OK to me.
IIUC, "destroy" has a parameter list is only because the previous
$VM_UUID + instnace implementation. It should be safe to move the "destroy"
file under mdev now.
>> We discussed how this might look with Intel devices which do allow
>> mixed vGPU types concurrently. We believe, but need confirmation, that
>> the vendor driver could still make a finite set of supported types,
>> perhaps with additional module options to the vendor driver to enable
>> more "exotic" types. So for instance if IGD vGPUs are based on
>> power-of-2 portions of the framebuffer size, then the vendor driver
>> could list types with 32MB, 64MB, 128MB, etc in useful and popular
>> sizes. As vGPUs are allocated, the larger sizes may become unavailable.
>
> Yes, Intel can do such type of definition. One thing I'm not sure is
> about impact cross listed types, i.e. when creating a new instance
> under a given type, max_instances under other types would be
> dynamically decremented based on available resource. Would it be
> a problem for libvirt or upper level stack, since a natural interpretation
> of max_instances should be a static number?
>
> An alternative is to make max_instances configurable, so libvirt has
> chance to define a pool of available instances with different types
> before creating any instance. For example, initially IGD driver may
> report max_instances only for a minimal sharing granularity:
> 128MB:
> max_instances (8)
> 256MB:
> max_instances (0)
> 512MB:
> max_instances (0)
>
> Then libvirt can configure more types as:
> 128MB:
> max_instances (2)
> 256MB:
> max_instances (1)
> 512MB:
> max_instances (1)
>
> Starting from this point, max_instances would be static and then
> mdev instance can be created under each type. But I'm not
> sure whether such additional configuration role is reasonable to libvirt...
>>
>> We still don't have any way for the admin to learn in advance how the
>> available supported types will change once mdev devices start to be
>> created. I'm not sure how we can create a specification for this, so
>> probing by creating devices may be the most flexible model.
>>
>> The other issue is the start/stop requirement, which was revealed to
>> setup peer-to-peer resources between vGPUs which is a limited hardware
>> resource. We'd really like to have these happen automatically on the
>> first open of a vfio mdev device file and final release. So we
>> brainstormed how the open/release callbacks could know the other mdev
>> devices for a given user. This is where the instance number came into
>> play previously. This is an area that needs work.
>
> IGD doesn't have such peer-to-peer resource setup requirement. So
> it's sufficient to create/destroy a mdev instance in a single action on
> IGD. However I'd expect we still keep the "start/stop" interface (
> maybe not exposed as sysfs node, instead being a VFIO API), as
> required to support future live migration usage. We've made prototype
> working for KVMGT today.
It's good for the framework to define start/stop interfaces, but as Alex
said below, it should be MDEV oriented, not VM oriented.
I don't know a lot about the peer-to-peer resource, but to me, although
VM_UUID + instance is not applicable, userspace can always achieve the
same purpose by, let us assume a mdev hierarchy, providing the VM UUID
under every mdev:
/sys/bus/pci/devices/<sbdf>/mdev/
|-- mdev01/
| `-- vm_uuid
`-- mdev02/
`-- vm_uuid
Did I miss something?
>>
>> There was a thought that perhaps on open() the vendor driver could look
>> at the user pid and use that to associate with other devices, but the
>> problem here is that we open and begin access to each device, so
>> devices do this discovery serially rather than in parallel as desired.
>> (we might not fault in mmio space yet though, so I wonder if open()
>> could set the association of mdev to pid, then the first mmio fault
>> would trigger the resource allocation? Then all the "magic" would live
>> in the vendor driver. open() could fail if the pid already has running
>> mdev devices and the vendor driver chooses not to support hotplug)
>>
>> One comment was that for a GPU that only supports homogeneous vGPUs,
>> libvirt may choose to create all the vGPUs in advance and handle them
>> as we do SR-IOV VFs. The UUID+instance model would preclude such a use
>> case.
>>
>> We also considered whether iommu groups could be (ab)used for this use
>> case, peer-to-peer would in fact be an iommu grouping constraint
>> afterall. This would have the same UUID+instance constraint as above
>> though and would require some sort of sysfs interface for the user to
>> be able to create multiple mdevs within a group.
>>
>> Everyone was given homework to think about this on their flights home,
>> so I expect plenty of ideas by now ;)
>>
>> Overall I think mediated devices were well received by the community,
>> so let's keep up the development and discussion to bring it to
>> fruition. Thanks,
>
> Thanks a lot Alex for your help on driving this discussion. Mediated device
> technique has the potential to be used for other type of I/O virtualizations
> in the future, not limited to GPU virtualization. So getting the core framework
> ready earlier would be highly welcomed. :-)
>
--
Thanks,
Jike
next prev parent reply other threads:[~2016-08-31 7:06 UTC|newest]
Thread overview: 95+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-08-25 3:53 [PATCH v7 0/4] Add Mediated device support Kirti Wankhede
2016-08-25 3:53 ` [PATCH v7 1/4] vfio: Mediated device Core driver Kirti Wankhede
2016-09-08 8:09 ` Jike Song
2016-09-08 9:38 ` Neo Jia
2016-09-09 6:26 ` Jike Song
2016-09-09 17:48 ` Kirti Wankhede
2016-09-09 18:42 ` Alex Williamson
2016-09-09 19:55 ` Kirti Wankhede
2016-09-12 5:10 ` Jike Song
2016-09-12 7:49 ` Kirti Wankhede
2016-09-12 15:53 ` Alex Williamson
2016-09-19 7:08 ` Jike Song
2016-09-19 17:29 ` Kirti Wankhede
2016-09-19 18:11 ` Alex Williamson
2016-09-19 20:09 ` Kirti Wankhede
2016-09-19 20:59 ` [Qemu-devel] " Alex Williamson
2016-09-20 12:48 ` Jike Song
2016-08-25 3:53 ` [PATCH v7 2/4] vfio: VFIO driver for mediated devices Kirti Wankhede
2016-08-25 9:22 ` Dong Jia
2016-08-26 14:13 ` Kirti Wankhede
2016-09-08 2:38 ` Jike Song
2016-09-19 18:22 ` [Qemu-devel] " Kirti Wankhede
2016-09-19 18:36 ` Alex Williamson
2016-09-19 19:13 ` Kirti Wankhede
2016-09-19 20:03 ` Alex Williamson
2016-09-20 2:50 ` Jike Song
2016-09-20 16:24 ` Alex Williamson
2016-09-21 3:19 ` Jike Song
2016-09-21 4:51 ` Alex Williamson
2016-09-21 5:02 ` Jike Song
2016-09-08 2:45 ` Jike Song
2016-09-13 2:35 ` Jike Song
2016-09-20 5:48 ` Dong Jia Shi
[not found] ` <20160920054851.GA2186@bjsdjshi@linux.vnet.ibm.com>
2016-09-20 6:37 ` Jike Song
2016-09-20 12:53 ` Jike Song
2016-08-25 3:53 ` [PATCH v7 3/4] vfio iommu: Add support " Kirti Wankhede
2016-08-25 7:29 ` Dong Jia
2016-08-26 13:50 ` Kirti Wankhede
2016-09-29 2:17 ` Jike Song
2016-09-29 15:06 ` Kirti Wankhede
2016-09-30 2:58 ` Jike Song
2016-09-30 3:10 ` Jike Song
2016-09-30 11:44 ` Kirti Wankhede
2016-10-08 7:09 ` Jike Song
2016-08-25 3:53 ` [PATCH v7 4/4] docs: Add Documentation for Mediated devices Kirti Wankhede
2016-09-03 16:40 ` Kirti Wankhede
2016-08-30 16:16 ` [PATCH v7 0/4] Add Mediated device support Alex Williamson
2016-08-31 6:12 ` Tian, Kevin
2016-08-31 7:04 ` Jike Song [this message]
2016-08-31 15:48 ` Alex Williamson
2016-09-01 4:09 ` Tian, Kevin
2016-09-01 4:10 ` Tian, Kevin
2016-09-01 18:22 ` Kirti Wankhede
2016-09-01 20:01 ` Alex Williamson
2016-09-02 6:17 ` Kirti Wankhede
2016-09-01 16:47 ` [Qemu-devel] " Michal Privoznik
2016-09-01 16:59 ` Alex Williamson
2016-09-02 4:48 ` [Qemu-devel] " Michal Privoznik
2016-09-02 5:21 ` Kirti Wankhede
2016-09-02 10:05 ` Paolo Bonzini
2016-09-02 17:15 ` Kirti Wankhede
2016-09-02 17:25 ` Paolo Bonzini
2016-09-02 18:33 ` Kirti Wankhede
2016-09-02 20:29 ` [libvirt] " John Ferlan
2016-09-03 16:31 ` [libvirt] " Kirti Wankhede
2016-09-06 17:54 ` [libvirt] [Qemu-devel] " Alex Williamson
2016-09-02 21:48 ` Paolo Bonzini
2016-09-03 11:56 ` [libvirt] " John Ferlan
2016-09-03 13:07 ` Paolo Bonzini
2016-09-03 17:47 ` [libvirt] " Kirti Wankhede
2016-09-03 16:34 ` [Qemu-devel] " Kirti Wankhede
2016-09-06 17:40 ` Alex Williamson
2016-09-06 19:35 ` Kirti Wankhede
2016-09-06 21:28 ` Alex Williamson
2016-09-07 8:22 ` Tian, Kevin
2016-09-07 16:00 ` Alex Williamson
2016-09-07 16:15 ` Kirti Wankhede
2016-09-07 16:44 ` Alex Williamson
2016-09-07 18:06 ` Kirti Wankhede
2016-09-07 22:13 ` Alex Williamson
2016-09-08 18:48 ` Kirti Wankhede
2016-09-08 20:51 ` Alex Williamson
2016-09-07 18:17 ` Neo Jia
2016-09-07 18:27 ` Daniel P. Berrange
2016-09-07 18:32 ` Neo Jia
2016-09-07 6:48 ` Tian, Kevin
2016-09-02 20:19 ` [libvirt] " John Ferlan
2016-09-02 21:44 ` Paolo Bonzini
2016-09-02 23:57 ` Laine Stump
2016-09-03 16:49 ` Kirti Wankhede
2016-09-05 7:52 ` Paolo Bonzini
2016-09-03 11:57 ` John Ferlan
2016-09-05 7:54 ` Paolo Bonzini
2016-09-02 17:55 ` Laine Stump
2016-09-02 19:15 ` [libvirt] " Alex Williamson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=57C6816D.2010201@intel.com \
--to=jike.song@intel.com \
--cc=alex.williamson@redhat.com \
--cc=berrange@redhat.com \
--cc=bjsdjshi@linux.vnet.ibm.com \
--cc=cjia@nvidia.com \
--cc=kevin.tian@intel.com \
--cc=kraxel@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=kwankhede@nvidia.com \
--cc=laine@redhat.com \
--cc=libvir-list@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).