From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:52193)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <kwankhede@nvidia.com>) id 1bgDrB-0007Ht-Af
	for qemu-devel@nongnu.org; Sat, 03 Sep 2016 12:31:50 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <kwankhede@nvidia.com>) id 1bgDr7-0005nI-5q
	for qemu-devel@nongnu.org; Sat, 03 Sep 2016 12:31:48 -0400
Received: from hqemgate14.nvidia.com ([216.228.121.143]:17967)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <kwankhede@nvidia.com>) id 1bgDr6-0005nC-Qs
	for qemu-devel@nongnu.org; Sat, 03 Sep 2016 12:31:45 -0400
References: <1472097235-6332-1-git-send-email-kwankhede@nvidia.com>
	<20160830101638.49df467d@t450s.home>
	<AADFC41AFE54684AB9EE6CBC0274A5D18DEBD970@SHSMSX101.ccr.corp.intel.com>
	<78fedd65-6d62-e849-ff3b-d5105b2da816@redhat.com>
	<20160901105948.62f750aa@t450s.home>
	<b4f7b36a-7bca-ef04-e6f5-8b817700250c@redhat.com>
	<98bbdbbf-c388-9120-3306-64f0cfb820a7@nvidia.com>
	<8682faeb-0331-f014-c13e-03c20f3f2bdf@redhat.com>
	<a5476012-0d17-d7bc-b63e-e361b5d7d34e@nvidia.com>
	<f3da1991-f8aa-e330-4b80-509e4f13896b@redhat.com>
	<fde200db-85d2-d979-1f8c-85dada4d6efe@nvidia.com>
	<9863c9f8-77fd-61e8-708c-a6747dcd64ea@redhat.com>
From: Kirti Wankhede <kwankhede@nvidia.com>
Message-ID: <d2dee16c-1c69-159f-dac8-42ff90ad15cd@nvidia.com>
Date: Sat, 3 Sep 2016 22:01:13 +0530
MIME-Version: 1.0
In-Reply-To: <9863c9f8-77fd-61e8-708c-a6747dcd64ea@redhat.com>
Content-Type: text/plain; charset="windows-1252"
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [libvirt] [PATCH v7 0/4] Add Mediated device
 support
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: John Ferlan <jferlan@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, Michal Privoznik <mprivozn@redhat.com>, Alex Williamson <alex.williamson@redhat.com>
Cc: "Song, Jike" <jike.song@intel.com>, "cjia@nvidia.com" <cjia@nvidia.com>, "kvm@vger.kernel.org" <kvm@vger.kernel.org>, "libvir-list@redhat.com" <libvir-list@redhat.com>, "Tian, Kevin" <kevin.tian@intel.com>, "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>, "kraxel@redhat.com" <kraxel@redhat.com>, Laine Stump <laine@redhat.com>, "bjsdjshi@linux.vnet.ibm.com" <bjsdjshi@linux.vnet.ibm.com>



On 9/3/2016 1:59 AM, John Ferlan wrote:
> 
> 
> On 09/02/2016 02:33 PM, Kirti Wankhede wrote:
>>
>> On 9/2/2016 10:55 PM, Paolo Bonzini wrote:
>>>
>>>
>>> On 02/09/2016 19:15, Kirti Wankhede wrote:
>>>> On 9/2/2016 3:35 PM, Paolo Bonzini wrote:
>>>>>    <device>
>>>>>      <name>my-vgpu</name>
>>>>>      <parent>pci_0000_86_00_0</parent>
>>>>>      <capability type='mdev'>
>>>>>        <type id='11'/>
>>>>>        <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>>>>      </capability>
>>>>>    </device>
>>>>>
>>>>> After creating the vGPU, if required by the host driver, all the other
>>>>> type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
>>>>
>>>> Thanks Paolo for details.
>>>> 'nodedev-create' parse the xml file and accordingly write to 'create'
>>>> file in sysfs to create mdev device. Right?
>>>> At this moment, does libvirt know which VM this device would be
>>>> associated with?
>>>
>>> No, the VM will associate to the nodedev through the UUID.  The nodedev
>>> is created separately from the VM.
>>>
>>>>> When dumping the mdev with nodedev-dumpxml, it could show more complete
>>>>> info, again taken from sysfs:
>>>>>
>>>>>    <device>
>>>>>      <name>my-vgpu</name>
>>>>>      <parent>pci_0000_86_00_0</parent>
>>>>>      <capability type='mdev'>
>>>>>        <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>>>>        <!-- only the chosen type -->
>>>>>        <type id='11'>
>>>>>          <!-- ... snip ... -->
>>>>>        </type>
>>>>>        <capability type='pci'>
>>>>>          <!-- no domain/bus/slot/function of course -->
>>>>>          <!-- could show whatever PCI IDs are seen by the guest: -->
>>>>>          <product id='...'>...</product>
>>>>>          <vendor id='0x10de'>NVIDIA</vendor>
>>>>>        </capability>
>>>>>      </capability>
>>>>>    </device>
>>>>>
>>>>> Notice how the parent has mdev inside pci; the vGPU, if it has to have
>>>>> pci at all, would have it inside mdev.  This represents the difference
>>>>> between the mdev provider and the mdev device.
>>>>
>>>> Parent of mdev device might not always be a PCI device. I think we
>>>> shouldn't consider it as PCI capability.
>>>
>>> The <capability type='pci'> in the vGPU means that it _will_ be exposed
>>> as a PCI device by VFIO.
>>>
>>> The <capability type='pci'> in the physical GPU means that the GPU is a
>>> PCI device.
>>>
>>
>> Ok. Got that.
>>
>>>>> Random proposal for the domain XML too:
>>>>>
>>>>>   <hostdev mode='subsystem' type='pci'>
>>>>>     <source type='mdev'>
>>>>>       <!-- possible alternative to uuid: <name>my-vgpu</name> ?!? -->
>>>>>       <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>>>>     </source>
>>>>>     <address type='pci' bus='0' slot='2' function='0'/>
>>>>>   </hostdev>
>>>>>
>>>>
>>>> When user wants to assign two mdev devices to one VM, user have to add
>>>> such two entries or group the two devices in one entry?
>>>
>>> Two entries, one per UUID, each with its own PCI address in the guest.
>>>
>>>> On other mail thread with same subject we are thinking of creating group
>>>> of mdev devices to assign multiple mdev devices to one VM.
>>>
>>> What is the advantage in managing mdev groups?  (Sorry didn't follow the
>>> other thread).
>>>
>>
>> When mdev device is created, resources from physical device is assigned
>> to this device. But resources are committed only when device goes
>> 'online' ('start' in v6 patch)
>> In case of multiple vGPUs in a VM for Nvidia vGPU solution, resources
>> for all vGPU devices in a VM are committed at one place. So we need to
>> know the vGPUs assigned to a VM before QEMU starts.
>>
>> Grouping would help here as Alex suggested in that mail. Pulling only
>> that part of discussion here:
>>
>> <Alex> It seems then that the grouping needs to affect the iommu group
>> so that
>>> you know that there's only a single owner for all the mdev devices
>>> within the group.  IIRC, the bus drivers don't have any visibility
>>> to opening and releasing of the group itself to trigger the
>>> online/offline, but they can track opening of the device file
>>> descriptors within the group.  Within the VFIO API the user cannot
>>> access the device without the device file descriptor, so a "first
>>> device opened" and "last device closed" trigger would provide the
>>> trigger points you need.  Some sort of new sysfs interface would need
>>> to be invented to allow this sort of manipulation.
>>> Also we should probably keep sight of whether we feel this is
>>> sufficiently necessary for the complexity.  If we can get by with only
>>> doing this grouping at creation time then we could define the "create"
>>> interface in various ways.  For example:
>>>
>>> echo $UUID0 > create
>>>
>>> would create a single mdev named $UUID0 in it's own group.
>>>
>>> echo {$UUID0,$UUID1} > create
>>>
>>> could create mdev devices $UUID0 and $UUID1 grouped together.
>>>
>> </Alex>
>>
>> <Kirti>
>> I think this would create mdev device of same type on same parent
>> device. We need to consider the case of multiple mdev devices of
>> different types and with different parents to be grouped together.
>> </Kirti>
>>
>> <Alex> We could even do:
>>>
>>> echo $UUID1:$GROUPA > create
>>>
>>> where $GROUPA is the group ID of a previously created mdev device into
>>> which $UUID1 is to be created and added to the same group.
>> </Alex>
>>
>> <Kirti>
>> I was thinking about:
>>
>>   echo $UUID0 > create
>>
>> would create mdev device
>>
>>   echo $UUID0 > /sys/class/mdev/create_group
>>
>> would add created device to group.
>>
>> For multiple devices case:
>>   echo $UUID0 > create
>>   echo $UUID1 > create
>>
>> would create mdev devices which could be of different types and
>> different parents.
>>   echo $UUID0, $UUID1 > /sys/class/mdev/create_group
>>
>> would add devices in a group.
>> Mdev core module would create a new group with unique number.  On mdev
>> device 'destroy' that mdev device would be removed from the group. When
>> there are no devices left in the group, group would be deleted. With
>> this "first device opened" and "last device closed" trigger can be used
>> to commit resources.
>> Then libvirt use mdev device path to pass as argument to QEMU, same as
>> it does for VFIO. Libvirt don't have to care about group number.
>> </Kirti>
>>
> 
> The more complicated one makes this, the more difficult it is for the
> customer to configure and the more difficult it is and the longer it
> takes to get something out. I didn't follow the details of groups...
> 
> What gets created from a pass through some *mdev/create_group?  

My proposal here is, on
  echo $UUID1, $UUID2 > /sys/class/mdev/create_group
would create a group in mdev core driver, which should be internal to
mdev core module. In mdev core module, a unique group number would be
saved in mdev_device structure for each device belonging to a that group.

> Does
> some new udev device get create that then is fed to the guest?

No, group is not a device. It will be like a identifier for the use of
vendor driver to identify devices in a group.

> Seems
> painful to make two distinct/async passes through systemd/udev. I
> foresee testing nightmares with creating 3 vGPU's, processing a group
> request, while some other process/thread is deleting a vGPU... How do
> the vGPU's get marked so that the delete cannot happen.
> 

How is the same case handled for direct assigned device? I mean a device
is unbound from its vendors driver, bound to vfio_pci device. How is it
guaranteed to be assigned to vfio_pci module? some other process/thread
might unbound it from vfio_pci module?

> If a vendor wants to create their own utility to group vHBA's together
> and manage that grouping, then have at it...  Doesn't seem to be
> something libvirt needs to be or should be managing...  As I go running
> for cover...
> 
> If having multiple types generated for a single vGPU, then consider the
> following XML:
> 
>    <capability type='mdev'>
>      <type id='11' [other attributes]/>
>      <type id='11' [other attributes]/>
>      <type id='12' [other attributes]/>
>      [<uuid>...</uuid>]
>     </capability>
> 
> then perhaps building the mdev_create input would be a comma separated
> list of type's to be added... "$UUID:11,11,12". Just a thought...
> 

In that case the vGPUs are created on same physical GPUs. Consider the
case two vGPUs on different physical devices need to be assigned to a
VM. Then those should be two different create commands:

   echo $UUID0 > /sys/../<bdf1>/mdev_create
   echo $UUID1 > /sys/../<bdf2>/mdev_create

Kirti.
> 
> John
> 
>> Thanks,
>> Kirti
>>
>> --
>> libvir-list mailing list
>> libvir-list@redhat.com
>> https://www.redhat.com/mailman/listinfo/libvir-list
>>