Re: [PATCH kernel 3/3] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

From: Alexey Kardashevskiy <aik@ozlabs.ru>
To: Alex Williamson <alex.williamson@redhat.com>,
	Piotr Jaroszynski <pjaroszynski@nvidia.com>
Cc: kvm@vger.kernel.org, Alistair Popple <alistair@popple.id.au>,
	linuxppc-dev@lists.ozlabs.org, kvm-ppc@vger.kernel.org,
	Reza Arbab <arbab@linux.ibm.com>,
	David Gibson <david@gibson.dropbear.id.au>
Subject: Re: [PATCH kernel 3/3] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
Date: Fri, 19 Oct 2018 11:53:53 +1100	[thread overview]
Message-ID: <918290dc-59c3-f269-38d4-a07d323173f9@ozlabs.ru> (raw)
In-Reply-To: <20181018120502.057feb7a@w520.home>



On 19/10/2018 05:05, Alex Williamson wrote:
> On Thu, 18 Oct 2018 10:37:46 -0700
> Piotr Jaroszynski <pjaroszynski@nvidia.com> wrote:
> 
>> On 10/18/18 9:55 AM, Alex Williamson wrote:
>>> On Thu, 18 Oct 2018 11:31:33 +1100
>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>   
>>>> On 18/10/2018 08:52, Alex Williamson wrote:  
>>>>> On Wed, 17 Oct 2018 12:19:20 +1100
>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>      
>>>>>> On 17/10/2018 06:08, Alex Williamson wrote:  
>>>>>>> On Mon, 15 Oct 2018 20:42:33 +1100
>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:    
>>>>>>>> +
>>>>>>>> +	if (pdev->vendor == PCI_VENDOR_ID_IBM &&
>>>>>>>> +			pdev->device == 0x04ea) {
>>>>>>>> +		ret = vfio_pci_ibm_npu2_init(vdev);
>>>>>>>> +		if (ret) {
>>>>>>>> +			dev_warn(&vdev->pdev->dev,
>>>>>>>> +					"Failed to setup NVIDIA NV2 ATSD region\n");
>>>>>>>> +			goto disable_exit;
>>>>>>>>   		}  
>>>>>>>
>>>>>>> So the NPU is also actually owned by vfio-pci and assigned to the VM?  
>>>>>>
>>>>>> Yes. On a running system it looks like:
>>>>>>
>>>>>> 0007:00:00.0 Bridge: IBM Device 04ea (rev 01)
>>>>>> 0007:00:00.1 Bridge: IBM Device 04ea (rev 01)
>>>>>> 0007:00:01.0 Bridge: IBM Device 04ea (rev 01)
>>>>>> 0007:00:01.1 Bridge: IBM Device 04ea (rev 01)
>>>>>> 0007:00:02.0 Bridge: IBM Device 04ea (rev 01)
>>>>>> 0007:00:02.1 Bridge: IBM Device 04ea (rev 01)
>>>>>> 0035:00:00.0 PCI bridge: IBM Device 04c1
>>>>>> 0035:01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
>>>>>> 0035:02:04.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
>>>>>> 0035:02:05.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
>>>>>> 0035:02:0d.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
>>>>>> 0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
>>>>>> (rev a1
>>>>>> 0035:04:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
>>>>>> (rev a1)
>>>>>> 0035:05:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
>>>>>> (rev a1)
>>>>>>
>>>>>> One "IBM Device" bridge represents one NVLink2, i.e. a piece of NPU.
>>>>>> They all and 3 GPUs go to the same IOMMU group and get passed through to
>>>>>> a guest.
>>>>>>
>>>>>> The entire NPU does not have representation via sysfs as a whole though.  
>>>>>
>>>>> So the NPU is a bridge, but it uses a normal header type so vfio-pci
>>>>> will bind to it?  
>>>>
>>>> An NPU is a NVLink bridge, it is not PCI in any sense. We (the host
>>>> powerpc firmware known as "skiboot" or "opal") have chosen to emulate a
>>>> virtual bridge per 1 NVLink on the firmware level. So for each physical
>>>> NPU there are 6 virtual bridges. So the NVIDIA driver does not need to
>>>> know much about NPUs.
>>>>  
>>>>> And the ATSD register that we need on it is not
>>>>> accessible through these PCI representations of the sub-pieces of the
>>>>> NPU?  Thanks,  
>>>>
>>>> No, only via the device tree. The skiboot puts the ATSD register address
>>>> to the PHB's DT property called 'ibm,mmio-atsd' of these virtual bridges.  
>>>
>>> Ok, so the NPU is essential a virtual device already, mostly just a
>>> stub.  But it seems that each NPU is associated to a specific GPU, how
>>> is that association done?  In the use case here it seems like it's just
>>> a vehicle to provide this ibm,mmio-atsd property to guest DT and the tgt
>>> routing information to the GPU.  So if both of those were attached to
>>> the GPU, there'd be no purpose in assigning the NPU other than it's in
>>> the same IOMMU group with a type 0 header, so something needs to be
>>> done with it.  If it's a virtual device, perhaps it could have a type 1
>>> header so vfio wouldn't care about it, then we would only assign the
>>> GPU with these extra properties, which seems easier for management
>>> tools and users.  If the guest driver needs a visible NPU device, QEMU
>>> could possibly emulate one to make the GPU association work
>>> automatically.  Maybe this isn't really a problem, but I wonder if
>>> you've looked up the management stack to see what tools need to know to
>>> assign these NPU devices and whether specific configurations are
>>> required to make the NPU to GPU association work.  Thanks,  
>>
>> I'm not that familiar with how this was originally set up, but note that 
>> Alexey is just making it work exactly like baremetal does. The baremetal 
>> GPU driver works as-is in the VM and expects the same properties in the 
>> device-tree. Obviously it doesn't have to be that way, but there is 
>> value in keeping it identical.
>>
>> Another probably bigger point is that the NPU device also implements the 
>> nvlink HW interface and is required for actually training and 
>> maintaining the link up. The driver in the guest trains the links by 
>> programming both the GPU end and the NPU end of each link so the NPU 
>> device needs to be exposed to the guest.
> 
> Ok, so there is functionality in assigning the NPU device itself, it's
> not just an attachment point for meta data.  But it still seems there
> must be some association of NPU to GPU, the tgt address seems to pair
> the NPU with a specific GPU, they're not simply a fungible set of NPUs
> and GPUs.  Is that association explicit anywhere or is it related to
> the topology or device numbering that needs to match between the host
> and guest?  Thanks,

It is in the device tree (phandle is a node ID).

NPU:
xscom@623fc00000000/npu@5011000

NVLinks:
xscom@623fc00000000/npu@5011000/link@0
xscom@623fc00000000/npu@5011000/link@1
xscom@623fc00000000/npu@5011000/link@2
xscom@623fc00000000/npu@5011000/link@3
xscom@623fc00000000/npu@5011000/link@5
xscom@623fc00000000/npu@5011000/link@6

GPU RAM:
memory@240000000000
memory@242000000000
memory@244000000000

GPUs:
pciex@620c3c0500000/pci@0/pci@0/pci@4/3d-controller@0
	ibm,npu property - 2 phandles of associated virtual
		bridges as in my config a GPU has 2 NVLinks
		to the CPU (or NPU in particular)
pciex@620c3c0500000/pci@0/pci@0/pci@5/3d-controller@0
pciex@620c3c0500000/pci@0/pci@0/pci@d/3d-controller@0

Virtual bridges:
pciex@6230200000000/pci@0
	ibm,gpu property - a phandle of associated GPU
	memory-region property - a phandle of a GPU RAM block
	ibm,nvlink property - a phandle of an NVLink
	ibm,device-tgt-addr property - the short physical
		address of a GPU RAM (0x00000c00.00000000
		in this example)
pciex@6230200000000/pci@0,1
pciex@6230200000000/pci@1
pciex@6230200000000/pci@1,1
pciex@6230200000000/pci@2
pciex@6230200000000/pci@2,1


-- 
Alexey

next prev parent reply	other threads:[~2018-10-19  0:56 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-15  9:42 [PATCH kernel 0/3] vfio/spapr_tce: Reworks for NVIDIA V100 + P9 passthrough (part 3) Alexey Kardashevskiy
2018-10-15  9:42 ` [PATCH kernel 1/3] vfio_pci: Allow mapping extra regions Alexey Kardashevskiy
2018-11-08  6:04   ` David Gibson
2018-10-15  9:42 ` [PATCH kernel 2/3] vfio_pci: Allow regions to add own capabilities Alexey Kardashevskiy
2018-11-08  6:21   ` David Gibson
2018-11-08  6:48     ` Alexey Kardashevskiy
2018-11-08  7:16       ` David Gibson
2018-10-15  9:42 ` [PATCH kernel 3/3] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver Alexey Kardashevskiy
2018-10-16 19:08   ` Alex Williamson
2018-10-17  1:19     ` Alexey Kardashevskiy
2018-10-17 21:52       ` Alex Williamson
2018-10-17 23:42         ` Piotr Jaroszyński
2018-10-18  0:31         ` Alexey Kardashevskiy
2018-10-18 16:55           ` Alex Williamson
2018-10-18 17:37             ` Piotr Jaroszynski
2018-10-18 18:05               ` Alex Williamson
2018-10-18 18:40                 ` Piotr Jaroszynski
2018-10-19  6:25                   ` Christoph Hellwig
2018-10-19  0:53                 ` Alexey Kardashevskiy [this message]
2018-11-12  1:08                   ` David Gibson
2018-11-12  2:36                     ` Alexey Kardashevskiy
2018-11-12  4:23                       ` David Gibson
2018-11-12  4:56                         ` Alexey Kardashevskiy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=918290dc-59c3-f269-38d4-a07d323173f9@ozlabs.ru \
    --to=aik@ozlabs.ru \
    --cc=alex.williamson@redhat.com \
    --cc=alistair@popple.id.au \
    --cc=arbab@linux.ibm.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=kvm-ppc@vger.kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=pjaroszynski@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).