Cross-device and cross-driver HMM support

All of lore.kernel.org
 help / color / mirror / Atom feed

* Cross-device and cross-driver HMM support
@ 2024-03-27  9:52 Thomas Hellström
  2024-04-02 22:57 ` Dave Airlie
  0 siblings, 1 reply; 7+ messages in thread
From: Thomas Hellström @ 2024-03-27  9:52 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: Matthew Brost, oak.zeng, Dave Airlie, Daniel Vetter,
	Christian König

Hi!

With our SVM mirror work we'll soon start looking at HMM cross-device
support. The identified needs are

1) Instead of migrating foreign device memory to system when the
current device is faulting, leave it in place...
1a) for access using internal interconnect,
1b) for access using PCIE p2p (probably mostly as a reference)

2) Request a foreign device to migrate memory range a..b of a CPU
mm_struct to local shareable device memory on that foreign device.

and we plan to add an infrastructure for this. Probably this can be
done initially without too much (or any) changes to the hmm code
itself.

So the question is basically whether anybody is interested in a 
drm-wide solution for this and in that case also whether anybody sees
the need for cross-driver support?

Otherwise any objections against us starting out with an xe driver
helper implementation that could be lifted to drm-level when needed?

Finally any suggestions or pointers to existing solutions for this?

Any comments / suggestions greatly appreciated.

Thanks,
Thomas

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Cross-device and cross-driver HMM support
  2024-03-27  9:52 Cross-device and cross-driver HMM support Thomas Hellström
@ 2024-04-02 22:57 ` Dave Airlie
  2024-04-03  9:16   ` Christian König
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Airlie @ 2024-04-02 22:57 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: dri-devel, intel-xe, Matthew Brost, oak.zeng, Daniel Vetter,
	Christian König

On Wed, 27 Mar 2024 at 19:52, Thomas Hellström
<thomas.hellstrom@linux.intel.com> wrote:
>
> Hi!
>
> With our SVM mirror work we'll soon start looking at HMM cross-device
> support. The identified needs are
>
> 1) Instead of migrating foreign device memory to system when the
> current device is faulting, leave it in place...
> 1a) for access using internal interconnect,
> 1b) for access using PCIE p2p (probably mostly as a reference)
>
> 2) Request a foreign device to migrate memory range a..b of a CPU
> mm_struct to local shareable device memory on that foreign device.
>
> and we plan to add an infrastructure for this. Probably this can be
> done initially without too much (or any) changes to the hmm code
> itself.
>
> So the question is basically whether anybody is interested in a
> drm-wide solution for this and in that case also whether anybody sees
> the need for cross-driver support?
>
> Otherwise any objections against us starting out with an xe driver
> helper implementation that could be lifted to drm-level when needed?

I think you'd probably have a better chance of getting others to help
review it, if we started out outside the driver as much as possible.

I don't think gpuvm would have worked out as well if we'd just kept it
inside nouveau from the start, it at least forces you to think about
what should be driver specific here.

> Finally any suggestions or pointers to existing solutions for this?

I think nvidia's uvm might have some of this type of code, but no idea
how you'd even consider starting to use it as a reference,

Dave.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Cross-device and cross-driver HMM support
  2024-04-02 22:57 ` Dave Airlie
@ 2024-04-03  9:16   ` Christian König
  2024-04-03 12:57     ` Jason Gunthorpe
  0 siblings, 1 reply; 7+ messages in thread
From: Christian König @ 2024-04-03  9:16 UTC (permalink / raw)
  To: Dave Airlie, Thomas Hellström
  Cc: dri-devel, intel-xe, Matthew Brost, oak.zeng, Daniel Vetter

Am 03.04.24 um 00:57 schrieb Dave Airlie:
> On Wed, 27 Mar 2024 at 19:52, Thomas Hellström
> <thomas.hellstrom@linux.intel.com> wrote:
>> Hi!
>>
>> With our SVM mirror work we'll soon start looking at HMM cross-device
>> support. The identified needs are
>>
>> 1) Instead of migrating foreign device memory to system when the
>> current device is faulting, leave it in place...
>> 1a) for access using internal interconnect,
>> 1b) for access using PCIE p2p (probably mostly as a reference)

I still agree with Sima that we won't see P2P based on HMM between 
devices anytime soon if ever.

The basic problem is that you are missing a lot of fundamental inter 
device infrastructure.

E.g. there is no common representation of DMA addresses with address 
spaces. In other words you need to know the device which does DMA for an 
address to make sense.

Additional to that we don't have a representation for internal 
connections, e.g. the common kernel has no idea that device A and device 
B can talk directly to each other, but not with device C.

>>
>> 2) Request a foreign device to migrate memory range a..b of a CPU
>> mm_struct to local shareable device memory on that foreign device.
>>
>> and we plan to add an infrastructure for this. Probably this can be
>> done initially without too much (or any) changes to the hmm code
>> itself.
>>
>> So the question is basically whether anybody is interested in a
>> drm-wide solution for this and in that case also whether anybody sees
>> the need for cross-driver support?

We have use cases for this as well, yes.

For now XGMI support is something purely AMDGPU internal, but 
essentially we would like to have that as common framework so that NICs 
and other devices could interconnect as well.

>>
>> Otherwise any objections against us starting out with an xe driver
>> helper implementation that could be lifted to drm-level when needed?
> I think you'd probably have a better chance of getting others to help
> review it, if we started out outside the driver as much as possible.

Yeah, completely agree. Especially we need to start with infrastructure 
and not some in driver hack, we already have the later and it's clearly 
a dead end.

Regards,
Christian.

>
> I don't think gpuvm would have worked out as well if we'd just kept it
> inside nouveau from the start, it at least forces you to think about
> what should be driver specific here.
>
>> Finally any suggestions or pointers to existing solutions for this?
> I think nvidia's uvm might have some of this type of code, but no idea
> how you'd even consider starting to use it as a reference,
>
> Dave.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Cross-device and cross-driver HMM support
  2024-04-03  9:16   ` Christian König
@ 2024-04-03 12:57     ` Jason Gunthorpe
  2024-04-03 14:06       ` Christian König
  0 siblings, 1 reply; 7+ messages in thread
From: Jason Gunthorpe @ 2024-04-03 12:57 UTC (permalink / raw)
  To: Christian König
  Cc: Dave Airlie, Thomas Hellström, dri-devel, intel-xe,
	Matthew Brost, oak.zeng, Daniel Vetter

On Wed, Apr 03, 2024 at 11:16:36AM +0200, Christian König wrote:
> Am 03.04.24 um 00:57 schrieb Dave Airlie:
> > On Wed, 27 Mar 2024 at 19:52, Thomas Hellström
> > <thomas.hellstrom@linux.intel.com> wrote:
> > > Hi!
> > > 
> > > With our SVM mirror work we'll soon start looking at HMM cross-device
> > > support. The identified needs are
> > > 
> > > 1) Instead of migrating foreign device memory to system when the
> > > current device is faulting, leave it in place...
> > > 1a) for access using internal interconnect,
> > > 1b) for access using PCIE p2p (probably mostly as a reference)
> 
> I still agree with Sima that we won't see P2P based on HMM between devices
> anytime soon if ever.

We've got a team working on the subset of this problem where we can
have a GPU driver install DEVICE_PRIVATE pages and the RDMA driver use
hmm_range_fault() to take the DEVICE_PRIVATE and return an equivilent
P2P page for DMA.

We already have a working prototype that is not too bad code wise.

> E.g. there is no common representation of DMA addresses with address spaces.
> In other words you need to know the device which does DMA for an address to
> make sense.

? Every device device calls hmm_range_fault() on it's own, to populate
its own private mirror page table, and gets a P2P page. The device can
DMA map that P2P for its own use to get a topologically appropriate
DMA address for its own private page table. The struct page for P2P
references the pgmap which references the target struct device, the
DMA API provides the requesting struct device. The infrastructure for
all this is all there already.

There is a seperate discussion about optimizing away the P2P pgmap,
but for the moment I'm focused on getting things working by relying on
it.

> Additional to that we don't have a representation for internal connections,
> e.g. the common kernel has no idea that device A and device B can talk
> directly to each other, but not with device C.

We do have this in the PCI P2P framework, it just isn't very complete,
but it does handle the immediate cases I see people building where we
have switches and ACS/!ACS paths with different addressing depending
on topology.

> > > and we plan to add an infrastructure for this. Probably this can be
> > > done initially without too much (or any) changes to the hmm code
> > > itself.

It is essential any work in this area is not tied to DRM.
hmm_range_fault() and DEVICE_PRIVATE are generic kernel concepts we
need to make them work better not build weird DRM side channels.

> > > So the question is basically whether anybody is interested in a
> > > drm-wide solution for this and in that case also whether anybody sees
> > > the need for cross-driver support?
> 
> We have use cases for this as well, yes.

Unfortunately this is a long journey. The immediate next steps are
Alistair's work to untangle the DAX refcounting mess from ZONE_DEVICE
pages:

https://lore.kernel.org/linux-mm/87ttlhmj9p.fsf@nvdebian.thelocal/

Leon is working on improving the DMA API and RDMA's ODP to
be better setup for this:

https://lore.kernel.org/linux-rdma/cover.1709635535.git.leon@kernel.org/

[Which is also the basis for fixing DMABUF's abuse of the DMA API]

Then it is pretty simple to teach hmm_range_fault() to convert a
DEVICE_PRIVATE page into a P2P page using a new pgmap op and from
there the rest already basically exists.

Folks doing non-PCIe topologies will need to teach the P2P layer how
address translation works on those buses.

Jason

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Cross-device and cross-driver HMM support
  2024-04-03 12:57     ` Jason Gunthorpe
@ 2024-04-03 14:06       ` Christian König
  2024-04-03 15:09         ` Jason Gunthorpe
  0 siblings, 1 reply; 7+ messages in thread
From: Christian König @ 2024-04-03 14:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Airlie, Thomas Hellström, dri-devel, intel-xe,
	Matthew Brost, oak.zeng, Daniel Vetter

[-- Attachment #1: Type: text/plain, Size: 4638 bytes --]

Am 03.04.24 um 14:57 schrieb Jason Gunthorpe:
> On Wed, Apr 03, 2024 at 11:16:36AM +0200, Christian König wrote:
>> Am 03.04.24 um 00:57 schrieb Dave Airlie:
>>> On Wed, 27 Mar 2024 at 19:52, Thomas Hellström
>>> <thomas.hellstrom@linux.intel.com>  wrote:
>>>> Hi!
>>>>
>>>> With our SVM mirror work we'll soon start looking at HMM cross-device
>>>> support. The identified needs are
>>>>
>>>> 1) Instead of migrating foreign device memory to system when the
>>>> current device is faulting, leave it in place...
>>>> 1a) for access using internal interconnect,
>>>> 1b) for access using PCIE p2p (probably mostly as a reference)
>> I still agree with Sima that we won't see P2P based on HMM between devices
>> anytime soon if ever.
> We've got a team working on the subset of this problem where we can
> have a GPU driver install DEVICE_PRIVATE pages and the RDMA driver use
> hmm_range_fault() to take the DEVICE_PRIVATE and return an equivilent
> P2P page for DMA.
>
> We already have a working prototype that is not too bad code wise.

The problem with that isn't the software but the hardware.

At least on the AMD GPUs and Intels Xe accelerators we have seen so far 
page faults are not fast enough to actually work with the semantics the 
Linux kernel uses for struct pages.

That's why for example the SVM implementation really suck with fork(), 
the transparent huge page deamon and NUMA migrations.

Somebody should probably sit down and write a performance measurement 
tool for page faults so that we can start to compare vendors regarding this.

>> E.g. there is no common representation of DMA addresses with address spaces.
>> In other words you need to know the device which does DMA for an address to
>> make sense.
> ? Every device device calls hmm_range_fault() on it's own, to populate
> its own private mirror page table, and gets a P2P page. The device can
> DMA map that P2P for its own use to get a topologically appropriate
> DMA address for its own private page table. The struct page for P2P
> references the pgmap which references the target struct device, the
> DMA API provides the requesting struct device. The infrastructure for
> all this is all there already.

The problem is the DMA API currently has no idea of inter device 
connectors like XGMI.

So it can create P2P mappings for PCIe, but anything which isn't part of 
those interconnects is ignore at the moment as far as I can see.

> There is a seperate discussion about optimizing away the P2P pgmap,
> but for the moment I'm focused on getting things working by relying on
> it.
>
>> Additional to that we don't have a representation for internal connections,
>> e.g. the common kernel has no idea that device A and device B can talk
>> directly to each other, but not with device C.
> We do have this in the PCI P2P framework, it just isn't very complete,
> but it does handle the immediate cases I see people building where we
> have switches and ACS/!ACS paths with different addressing depending
> on topology.

That's not what I meant. I'm talking about direct interconnects which a 
parallel to the PCIe bus.

As far as I know we haven't even started looking into those.

>>>> and we plan to add an infrastructure for this. Probably this can be
>>>> done initially without too much (or any) changes to the hmm code
>>>> itself.
> It is essential any work in this area is not tied to DRM.
> hmm_range_fault() and DEVICE_PRIVATE are generic kernel concepts we
> need to make them work better not build weird DRM side channels.

Completely agree.

>>>> So the question is basically whether anybody is interested in a
>>>> drm-wide solution for this and in that case also whether anybody sees
>>>> the need for cross-driver support?
>> We have use cases for this as well, yes.
> Unfortunately this is a long journey. The immediate next steps are
> Alistair's work to untangle the DAX refcounting mess from ZONE_DEVICE
> pages:
>
> https://lore.kernel.org/linux-mm/87ttlhmj9p.fsf@nvdebian.thelocal/
>
> Leon is working on improving the DMA API and RDMA's ODP to
> be better setup for this:
>
> https://lore.kernel.org/linux-rdma/cover.1709635535.git.leon@kernel.org/
>
> [Which is also the basis for fixing DMABUF's abuse of the DMA API]
>
> Then it is pretty simple to teach hmm_range_fault() to convert a
> DEVICE_PRIVATE page into a P2P page using a new pgmap op and from
> there the rest already basically exists.

Nice, that's at least one step further than I expected.

> Folks doing non-PCIe topologies will need to teach the P2P layer how
> address translation works on those buses.

Where to start with that?

Christian.

>
> Jason

[-- Attachment #2: Type: text/html, Size: 7481 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Cross-device and cross-driver HMM support
  2024-04-03 14:06       ` Christian König
@ 2024-04-03 15:09         ` Jason Gunthorpe
  2024-04-09 10:18           ` Thomas Hellström
  0 siblings, 1 reply; 7+ messages in thread
From: Jason Gunthorpe @ 2024-04-03 15:09 UTC (permalink / raw)
  To: Christian König
  Cc: Dave Airlie, Thomas Hellström, dri-devel, intel-xe,
	Matthew Brost, oak.zeng, Daniel Vetter

On Wed, Apr 03, 2024 at 04:06:11PM +0200, Christian König wrote:

[UGH html emails, try to avoid those they don't get archived!]

>    The problem with that isn't the software but the hardware.
>    At least on the AMD GPUs and Intels Xe accelerators we have seen so far
>    page faults are not fast enough to actually work with the semantics the
>    Linux kernel uses for struct pages.
>    That's why for example the SVM implementation really suck with fork(),
>    the transparent huge page deamon and NUMA migrations.
>    Somebody should probably sit down and write a performance measurement
>    tool for page faults so that we can start to compare vendors regarding
>    this.

Yes, all these page fault implementations I've seen are really
slow. Even SVA/PRI is really slow. The only way it works usefully
today is for the application/userspace environment to co-operate and
avoid causing faults.

Until someone invents a faster PRI interface this is what we have.. It
is limited but still useful.

>    The problem is the DMA API currently has no idea of inter device
>    connectors like XGMI.
>    So it can create P2P mappings for PCIe, but anything which isn't part
>    of those interconnects is ignore at the moment as far as I can see.

Speaking broadly - a "multi-path" device is one that has multiple DMA
initiators and thus multiple paths the DMA can travel. The different
paths may have different properties, like avoiding the iommu or what
not. This might be a private hidden bus (XGMI/nvlink/etc) in a GPU
complex or just two PCI end ports on the same chip like a socket
direct mlx5 device.

The device HW itself must have a way to select which path each DMA
goes thorugh because the paths are going to have different address
spaces. A multi-path PCI device will have different PCI RID's and thus
different iommu_domains/IO pagetables/IOVAs, for instance. A GPU will
alias its internal memory with the PCI IOMMU IOVA.

So, in the case of something like a GPU I expect the private PTE
itself to have bit(s) indicating if the address is PCI, local memory
or internal interconnect.

When the hmm_range_fault() encounters a DEVICE_PRIVATE page the GPU
driver must make a decision on how to set that bit.

My advice would be to organize the GPU driver so that the
"dev_private_owner" is the same value for all GPU's that share a
private address space. IOW dev_private_owner represents the physical
*address space* that the DEVICE_PRIVATE's hidden address lives in, not
the owning HW. Perhaps we will want to improve on this by adding to
the pgmap an explicit address space void * private data as well.

When setup like this hmm_range_fault() will naturally return
DEVICE_PRIVATE pages which map to the address space for which the
requesting GPU can trivially set the PTE bit on. Easy. No DMA API
fussing needed.

Otherwise hmm_range_fault() returns the CPU/P2P page. The GPU should
select the PCI path and the DMA API will check the PCI topology and
generate a correct PCI address.

If the device driver needs/wants to create driver core bus's and
devices to help it model and discover the dev_private_owner groups, I
don't know. Clearly the driver must be able to do this grouping to
make it work, and all this setup is just done when creating the pgmap.

I don't think the DMA API should become involved here. The layering in
a multi-path scenario should have the DMA API caller decide on the
path then the DMA API will map for the specific path. The caller needs
to expressly opt into this because there is additional HW - the
multi-path selector - that needs to be programmed and the DMA API
cannot make that transparent.

A similar approach works for going from P2P pages as well, the driver
can inspect the pgmap owner and similarly check the pgmap private data
to learn the address space and internal address then decide to choose
the non-PCI path.

This scales to a world without P2P struct pages because we will still
have some kind of 'pgmap' similar structure that holds meta data for a
uniform chunk of MMIO.

Jason

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Cross-device and cross-driver HMM support
  2024-04-03 15:09         ` Jason Gunthorpe
@ 2024-04-09 10:18           ` Thomas Hellström
  0 siblings, 0 replies; 7+ messages in thread
From: Thomas Hellström @ 2024-04-09 10:18 UTC (permalink / raw)
  To: Jason Gunthorpe, Christian König
  Cc: Dave Airlie, dri-devel, intel-xe, Matthew Brost, oak.zeng,
	Daniel Vetter

Hi,

On Wed, 2024-04-03 at 12:09 -0300, Jason Gunthorpe wrote:
> On Wed, Apr 03, 2024 at 04:06:11PM +0200, Christian König wrote:
> 
> [UGH html emails, try to avoid those they don't get archived!]
> 
> >    The problem with that isn't the software but the hardware.
> >    At least on the AMD GPUs and Intels Xe accelerators we have seen
> > so far
> >    page faults are not fast enough to actually work with the
> > semantics the
> >    Linux kernel uses for struct pages.
> >    That's why for example the SVM implementation really suck with
> > fork(),
> >    the transparent huge page deamon and NUMA migrations.
> >    Somebody should probably sit down and write a performance
> > measurement
> >    tool for page faults so that we can start to compare vendors
> > regarding
> >    this.
> 
> Yes, all these page fault implementations I've seen are really
> slow. Even SVA/PRI is really slow. The only way it works usefully
> today is for the application/userspace environment to co-operate and
> avoid causing faults.
> 
> Until someone invents a faster PRI interface this is what we have..
> It
> is limited but still useful.
>  
> >    The problem is the DMA API currently has no idea of inter device
> >    connectors like XGMI.
> >    So it can create P2P mappings for PCIe, but anything which isn't
> > part
> >    of those interconnects is ignore at the moment as far as I can
> > see.
> 
> Speaking broadly - a "multi-path" device is one that has multiple DMA
> initiators and thus multiple paths the DMA can travel. The different
> paths may have different properties, like avoiding the iommu or what
> not. This might be a private hidden bus (XGMI/nvlink/etc) in a GPU
> complex or just two PCI end ports on the same chip like a socket
> direct mlx5 device.
> 
> The device HW itself must have a way to select which path each DMA
> goes thorugh because the paths are going to have different address
> spaces. A multi-path PCI device will have different PCI RID's and
> thus
> different iommu_domains/IO pagetables/IOVAs, for instance. A GPU will
> alias its internal memory with the PCI IOMMU IOVA.
> 
> So, in the case of something like a GPU I expect the private PTE
> itself to have bit(s) indicating if the address is PCI, local memory
> or internal interconnect.
> 
> When the hmm_range_fault() encounters a DEVICE_PRIVATE page the GPU
> driver must make a decision on how to set that bit.
> 
> My advice would be to organize the GPU driver so that the
> "dev_private_owner" is the same value for all GPU's that share a
> private address space. IOW dev_private_owner represents the physical
> *address space* that the DEVICE_PRIVATE's hidden address lives in,
> not
> the owning HW. Perhaps we will want to improve on this by adding to
> the pgmap an explicit address space void * private data as well.
> 
> When setup like this hmm_range_fault() will naturally return
> DEVICE_PRIVATE pages which map to the address space for which the
> requesting GPU can trivially set the PTE bit on. Easy. No DMA API
> fussing needed.
> 
> Otherwise hmm_range_fault() returns the CPU/P2P page. The GPU should
> select the PCI path and the DMA API will check the PCI topology and
> generate a correct PCI address.
> 
> If the device driver needs/wants to create driver core bus's and
> devices to help it model and discover the dev_private_owner groups, I
> don't know. Clearly the driver must be able to do this grouping to
> make it work, and all this setup is just done when creating the
> pgmap.
> 
> I don't think the DMA API should become involved here. The layering
> in
> a multi-path scenario should have the DMA API caller decide on the
> path then the DMA API will map for the specific path. The caller
> needs
> to expressly opt into this because there is additional HW - the
> multi-path selector - that needs to be programmed and the DMA API
> cannot make that transparent.
> 
> A similar approach works for going from P2P pages as well, the driver
> can inspect the pgmap owner and similarly check the pgmap private
> data
> to learn the address space and internal address then decide to choose
> the non-PCI path.
> 
> This scales to a world without P2P struct pages because we will still
> have some kind of 'pgmap' similar structure that holds meta data for
> a
> uniform chunk of MMIO.

Thanks everyone for suggestions and feedback. We've been discussion
something like what Jason is describing above although I haven't had
time to digest all the details yet.

It sounds like common drm- or core code is the preferred way to go
here. I also recognize that gpuvm was successful in this respect but I
think that gpuvm also had a couple of active reviwers and multiple
drivers that were able to spend time to implement and test the code, so
let's hope for at least some active review participation and feedback
here.

Thanks,
Thomas



> 
> Jason


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-04-09 10:19 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-27  9:52 Cross-device and cross-driver HMM support Thomas Hellström
2024-04-02 22:57 ` Dave Airlie
2024-04-03  9:16   ` Christian König
2024-04-03 12:57     ` Jason Gunthorpe
2024-04-03 14:06       ` Christian König
2024-04-03 15:09         ` Jason Gunthorpe
2024-04-09 10:18           ` Thomas Hellström

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.