kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jike Song <jike.song@intel.com>
To: "Tian, Kevin" <kevin.tian@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>,
	Kirti Wankhede <kwankhede@nvidia.com>,
	"pbonzini@redhat.com" <pbonzini@redhat.com>,
	"kraxel@redhat.com" <kraxel@redhat.com>,
	"cjia@nvidia.com" <cjia@nvidia.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"bjsdjshi@linux.vnet.ibm.com" <bjsdjshi@linux.vnet.ibm.com>,
	"libvir-list@redhat.com" <libvir-list@redhat.com>,
	"Daniel P. Berrange" <berrange@redhat.com>,
	Laine Stump <laine@redhat.com>
Subject: Re: [PATCH v7 0/4] Add Mediated device support
Date: Wed, 31 Aug 2016 15:04:13 +0800	[thread overview]
Message-ID: <57C6816D.2010201@intel.com> (raw)
In-Reply-To: <AADFC41AFE54684AB9EE6CBC0274A5D18DEBD970@SHSMSX101.ccr.corp.intel.com>

On 08/31/2016 02:12 PM, Tian, Kevin wrote:
>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>> Sent: Wednesday, August 31, 2016 12:17 AM
>>
>> Hi folks,
>>
>> At KVM Forum we had a BoF session primarily around the mediated device
>> sysfs interface.  I'd like to share what I think we agreed on and the
>> "problem areas" that still need some work so we can get the thoughts
>> and ideas from those who weren't able to attend.
>>
>> DanPB expressed some concern about the mdev_supported_types sysfs
>> interface, which exposes a flat csv file with fields like "type",
>> "number of instance", "vendor string", and then a bunch of type
>> specific fields like "framebuffer size", "resolution", "frame rate
>> limit", etc.  This is not entirely machine parsing friendly and sort of
>> abuses the sysfs concept of one value per file.  Example output taken
>> from Neo's libvirt RFC:
>>
>> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
>> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
>> max_resolution
>> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
>> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
>> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
>> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
>> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
>> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
>> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
>> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
>>
>> The create/destroy then looks like this:
>>
>> echo "$mdev_UUID:vendor_specific_argument_list" >
>> 	/sys/bus/pci/devices/.../mdev_create
>>
>> echo "$mdev_UUID:vendor_specific_argument_list" >
>> 	/sys/bus/pci/devices/.../mdev_destroy
>>
>> "vendor_specific_argument_list" is nebulous.
>>
>> So the idea to fix this is to explode this into a directory structure,
>> something like:
>>
>> ├── mdev_destroy
>> └── mdev_supported_types
>>     ├── 11
>>     │   ├── create
>>     │   ├── description
>>     │   └── max_instances
>>     ├── 12
>>     │   ├── create
>>     │   ├── description
>>     │   └── max_instances
>>     └── 13
>>         ├── create
>>         ├── description
>>         └── max_instances
>>
>> Note that I'm only exposing the minimal attributes here for simplicity,
>> the other attributes would be included in separate files and we would
>> require vendors to create standard attributes for common device classes.
> 
> I like this idea. All standard attributes are reflected into this hierarchy.
> In the meantime, can we still allow optional vendor string in create 
> interface? libvirt doesn't need to know the meaning, but allows upper
> layer to do some vendor specific tweak if necessary.
> 

Not sure whether this can done within MDEV framework (attrs provided by
vendor driver of course), or must be within the vendor driver.

>>
>> For vGPUs like NVIDIA where we don't support multiple types
>> concurrently, this directory structure would update as mdev devices are
>> created, removing no longer available types.  I carried forward
> 
> or keep the type with max_instances cleared to ZERO.
>

+1 :)

>> max_instances here, but perhaps we really want to copy SR-IOV and
>> report a max and current allocation.  Creation and deletion is
> 
> right, cur/max_instances look reasonable.
> 
>> simplified as we can simply "echo $UUID > create" per type.  I don't
>> understand why destroy had a parameter list, so here I imagine we can
>> simply do the same... in fact, I'd actually rather see a "remove" sysfs
>> entry under each mdev device, so we remove it at the device rather than
>> in some central location (any objections?).
> 
> OK to me. 

IIUC, "destroy" has a parameter list is only because the previous
$VM_UUID + instnace implementation. It should be safe to move the "destroy"
file under mdev now.

>> We discussed how this might look with Intel devices which do allow
>> mixed vGPU types concurrently.  We believe, but need confirmation, that
>> the vendor driver could still make a finite set of supported types,
>> perhaps with additional module options to the vendor driver to enable
>> more "exotic" types.  So for instance if IGD vGPUs are based on
>> power-of-2 portions of the framebuffer size, then the vendor driver
>> could list types with 32MB, 64MB, 128MB, etc in useful and popular
>> sizes.  As vGPUs are allocated, the larger sizes may become unavailable.
>
> Yes, Intel can do such type of definition. One thing I'm not sure is 
> about impact cross listed types, i.e. when creating a new instance
> under a given type, max_instances under other types would be 
> dynamically decremented based on available resource. Would it be
> a problem for libvirt or upper level stack, since a natural interpretation
> of max_instances should be a static number?
>
> An alternative is to make max_instances configurable, so libvirt has
> chance to define a pool of available instances with different types
> before creating any instance. For example, initially IGD driver may 
> report max_instances only for a minimal sharing granularity:
> 	128MB:
> 		max_instances (8)
> 	256MB:
> 		max_instances (0)
> 	512MB:
> 		max_instances (0)
> 
> Then libvirt can configure more types as:
> 	128MB:
> 		max_instances (2)
> 	256MB:
> 		max_instances (1)
> 	512MB:
> 		max_instances (1)
> 
> Starting from this point, max_instances would be static and then
> mdev instance can be created under each type. But I'm not
> sure whether such additional configuration role is reasonable to libvirt...
>>
>> We still don't have any way for the admin to learn in advance how the
>> available supported types will change once mdev devices start to be
>> created.  I'm not sure how we can create a specification for this, so
>> probing by creating devices may be the most flexible model.
>>
>> The other issue is the start/stop requirement, which was revealed to
>> setup peer-to-peer resources between vGPUs which is a limited hardware
>> resource.  We'd really like to have these happen automatically on the
>> first open of a vfio mdev device file and final release.  So we
>> brainstormed how the open/release callbacks could know the other mdev
>> devices for a given user.  This is where the instance number came into
>> play previously.  This is an area that needs work.
> 
> IGD doesn't have such peer-to-peer resource setup requirement. So
> it's sufficient to create/destroy a mdev instance in a single action on
> IGD. However I'd expect we still keep the "start/stop" interface (
> maybe not exposed as sysfs node, instead being a VFIO API), as 
> required to support future live migration usage. We've made prototype
> working for KVMGT today.

It's good for the framework to define start/stop interfaces, but as Alex
said below, it should be MDEV oriented, not VM oriented.

I don't know a lot about the peer-to-peer resource, but to me, although
VM_UUID + instance is not applicable, userspace can always achieve the
same purpose by, let us assume a mdev hierarchy, providing the VM UUID
under every mdev:

	/sys/bus/pci/devices/<sbdf>/mdev/
	|-- mdev01/
	|   `-- vm_uuid
	`-- mdev02/
	    `-- vm_uuid

Did I miss something?

>>
>> There was a thought that perhaps on open() the vendor driver could look
>> at the user pid and use that to associate with other devices, but the
>> problem here is that we open and begin access to each device, so
>> devices do this discovery serially rather than in parallel as desired.
>> (we might not fault in mmio space yet though, so I wonder if open()
>> could set the association of mdev to pid, then the first mmio fault
>> would trigger the resource allocation?  Then all the "magic" would live
>> in the vendor driver.  open() could fail if the pid already has running
>> mdev devices and the vendor driver chooses not to support hotplug)
>>
>> One comment was that for a GPU that only supports homogeneous vGPUs,
>> libvirt may choose to create all the vGPUs in advance and handle them
>> as we do SR-IOV VFs.  The UUID+instance model would preclude such a use
>> case.
>>
>> We also considered whether iommu groups could be (ab)used for this use
>> case, peer-to-peer would in fact be an iommu grouping constraint
>> afterall.  This would have the same UUID+instance constraint as above
>> though and would require some sort of sysfs interface for the user to
>> be able to create multiple mdevs within a group.
>>
>> Everyone was given homework to think about this on their flights home,
>> so I expect plenty of ideas by now ;)
>>
>> Overall I think mediated devices were well received by the community,
>> so let's keep up the development and discussion to bring it to
>> fruition.  Thanks,
> 
> Thanks a lot Alex for your help on driving this discussion. Mediated device
> technique has the potential to be used for other type of I/O virtualizations
> in the future, not limited to GPU virtualization. So getting the core framework
> ready earlier would be highly welcomed. :-)
> 
--
Thanks,
Jike


  reply	other threads:[~2016-08-31  7:06 UTC|newest]

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-25  3:53 [PATCH v7 0/4] Add Mediated device support Kirti Wankhede
2016-08-25  3:53 ` [PATCH v7 1/4] vfio: Mediated device Core driver Kirti Wankhede
2016-09-08  8:09   ` Jike Song
2016-09-08  9:38     ` Neo Jia
2016-09-09  6:26       ` Jike Song
2016-09-09 17:48     ` Kirti Wankhede
2016-09-09 18:42       ` Alex Williamson
2016-09-09 19:55         ` Kirti Wankhede
2016-09-12  5:10           ` Jike Song
2016-09-12  7:49             ` Kirti Wankhede
2016-09-12 15:53               ` Alex Williamson
2016-09-19  7:08                 ` Jike Song
2016-09-19 17:29                 ` Kirti Wankhede
2016-09-19 18:11                   ` Alex Williamson
2016-09-19 20:09                     ` Kirti Wankhede
2016-09-19 20:59                       ` [Qemu-devel] " Alex Williamson
2016-09-20 12:48   ` Jike Song
2016-08-25  3:53 ` [PATCH v7 2/4] vfio: VFIO driver for mediated devices Kirti Wankhede
2016-08-25  9:22   ` Dong Jia
2016-08-26 14:13     ` Kirti Wankhede
2016-09-08  2:38       ` Jike Song
2016-09-19 18:22       ` [Qemu-devel] " Kirti Wankhede
2016-09-19 18:36         ` Alex Williamson
2016-09-19 19:13           ` Kirti Wankhede
2016-09-19 20:03             ` Alex Williamson
2016-09-20  2:50               ` Jike Song
2016-09-20 16:24                 ` Alex Williamson
2016-09-21  3:19                   ` Jike Song
2016-09-21  4:51                     ` Alex Williamson
2016-09-21  5:02                       ` Jike Song
2016-09-08  2:45     ` Jike Song
2016-09-13  2:35       ` Jike Song
2016-09-20  5:48         ` Dong Jia Shi
     [not found]         ` <20160920054851.GA2186@bjsdjshi@linux.vnet.ibm.com>
2016-09-20  6:37           ` Jike Song
2016-09-20 12:53   ` Jike Song
2016-08-25  3:53 ` [PATCH v7 3/4] vfio iommu: Add support " Kirti Wankhede
2016-08-25  7:29   ` Dong Jia
2016-08-26 13:50     ` Kirti Wankhede
2016-09-29  2:17   ` Jike Song
2016-09-29 15:06     ` Kirti Wankhede
2016-09-30  2:58       ` Jike Song
2016-09-30  3:10         ` Jike Song
2016-09-30 11:44           ` Kirti Wankhede
2016-10-08  7:09             ` Jike Song
2016-08-25  3:53 ` [PATCH v7 4/4] docs: Add Documentation for Mediated devices Kirti Wankhede
2016-09-03 16:40   ` Kirti Wankhede
2016-08-30 16:16 ` [PATCH v7 0/4] Add Mediated device support Alex Williamson
2016-08-31  6:12   ` Tian, Kevin
2016-08-31  7:04     ` Jike Song [this message]
2016-08-31 15:48       ` Alex Williamson
2016-09-01  4:09         ` Tian, Kevin
2016-09-01  4:10         ` Tian, Kevin
2016-09-01 18:22         ` Kirti Wankhede
2016-09-01 20:01           ` Alex Williamson
2016-09-02  6:17             ` Kirti Wankhede
2016-09-01 16:47     ` [Qemu-devel] " Michal Privoznik
2016-09-01 16:59       ` Alex Williamson
2016-09-02  4:48         ` [Qemu-devel] " Michal Privoznik
2016-09-02  5:21           ` Kirti Wankhede
2016-09-02 10:05             ` Paolo Bonzini
2016-09-02 17:15               ` Kirti Wankhede
2016-09-02 17:25                 ` Paolo Bonzini
2016-09-02 18:33                   ` Kirti Wankhede
2016-09-02 20:29                     ` [libvirt] " John Ferlan
2016-09-03 16:31                       ` [libvirt] " Kirti Wankhede
2016-09-06 17:54                         ` [libvirt] [Qemu-devel] " Alex Williamson
2016-09-02 21:48                     ` Paolo Bonzini
2016-09-03 11:56                       ` [libvirt] " John Ferlan
2016-09-03 13:07                         ` Paolo Bonzini
2016-09-03 17:47                           ` [libvirt] " Kirti Wankhede
2016-09-03 16:34                       ` [Qemu-devel] " Kirti Wankhede
2016-09-06 17:40                         ` Alex Williamson
2016-09-06 19:35                           ` Kirti Wankhede
2016-09-06 21:28                             ` Alex Williamson
2016-09-07  8:22                               ` Tian, Kevin
2016-09-07 16:00                                 ` Alex Williamson
2016-09-07 16:15                               ` Kirti Wankhede
2016-09-07 16:44                                 ` Alex Williamson
2016-09-07 18:06                                   ` Kirti Wankhede
2016-09-07 22:13                                     ` Alex Williamson
2016-09-08 18:48                                       ` Kirti Wankhede
2016-09-08 20:51                                         ` Alex Williamson
2016-09-07 18:17                                   ` Neo Jia
2016-09-07 18:27                                     ` Daniel P. Berrange
2016-09-07 18:32                                       ` Neo Jia
2016-09-07  6:48                           ` Tian, Kevin
2016-09-02 20:19               ` [libvirt] " John Ferlan
2016-09-02 21:44                 ` Paolo Bonzini
2016-09-02 23:57                   ` Laine Stump
2016-09-03 16:49                     ` Kirti Wankhede
2016-09-05  7:52                     ` Paolo Bonzini
2016-09-03 11:57                   ` John Ferlan
2016-09-05  7:54                     ` Paolo Bonzini
2016-09-02 17:55         ` Laine Stump
2016-09-02 19:15           ` [libvirt] " Alex Williamson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=57C6816D.2010201@intel.com \
    --to=jike.song@intel.com \
    --cc=alex.williamson@redhat.com \
    --cc=berrange@redhat.com \
    --cc=bjsdjshi@linux.vnet.ibm.com \
    --cc=cjia@nvidia.com \
    --cc=kevin.tian@intel.com \
    --cc=kraxel@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=laine@redhat.com \
    --cc=libvir-list@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).