Design session notes: GPU acceleration in Xen

All of lore.kernel.org
 help / color / mirror / Atom feed

* Design session notes: GPU acceleration in Xen
@ 2024-06-13 18:43 Demi Marie Obenour
  2024-06-14  6:38 ` Jan Beulich
  0 siblings, 1 reply; 26+ messages in thread
From: Demi Marie Obenour @ 2024-06-13 18:43 UTC (permalink / raw)
  To: Xen developer discussion
  Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Andrew Cooper,
	Ray Huang

[-- Attachment #1: Type: text/plain, Size: 2580 bytes --]

GPU acceleration requires that pageable host memory be able to be mapped
into a guest.  This requires changes to all of the Xen hypervisor, Linux
kernel, and userspace device model.

### Goals

 - Allow any userspace pages to be mapped into a guest.
 - Support deprivileged operation: this API must not be usable for privilege escalation.
 - Use MMU notifiers to ensure safety with respect to use-after-free.

### Hypervisor changes

There are at least two Xen changes required:

1. Add a new flag to IOREQ that means "retry this instruction".

   An IOREQ server can set this flag after having successfully handled a
   page fault.  It is expected that the IOREQ server has successfully
   mapped a page into the guest at the location of the fault.
   Otherwise, the same fault will likely happen again.

2. Add support for `XEN_DOMCTL_memory_mapping` to use system RAM, not
   just IOMEM.  Mappings made with `XEN_DOMCTL_memory_mapping` are
   guaranteed to be able to be successfully revoked with
   `XEN_DOMCTL_memory_mapping`, so all operations that would create
   extra references to the mapped memory must be forbidden.  These
   include, but may not be limited to:

   1. Granting the pages to the same or other domains.
   2. Mapping into another domain using `XEN_DOMCTL_memory_mapping`.
   3. Another domain accessing the pages using the foreign memory APIs,
      unless it is privileged over the domain that owns the pages.

   Open question: what if the other domain goes away?  Ideally,
   unmapping would (vacuously) succeed in this case.  Qubes OS doesn't
   care about domid reuse but others might.

### Kernel changes

Linux will add support for mapping userspace memory into an emulated PCI
BAR.  This requires Linux to automatically revoke access when needed.

There will be an IOREQ server that handles page faults.  The discussion
assumed that this handling will happen in kernel mode, but if handling
in user mode is simpler that is also an option.

There is no async #PF in Xen (yet), so the entire vCPU will be blocked
while the fault is handled.  This is not great for performance, but
correctness comes first.

There will be a new kernel ioctl to perform the mapping.  A possible C
prototype (presented at design session, but not discussed there):

    struct xen_linux_register_memory {
        uint64_t pointer;
        uint64_t size;
        uint64_t gpa;
        uint32_t id;
        uint32_t guest_domid;
    };
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Design session notes: GPU acceleration in Xen
  2024-06-13 18:43 Design session notes: GPU acceleration in Xen Demi Marie Obenour
@ 2024-06-14  6:38 ` Jan Beulich
  2024-06-14  7:21   ` Roger Pau Monné
                     ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Jan Beulich @ 2024-06-14  6:38 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang,
	Xen developer discussion, Andrew Cooper

On 13.06.2024 20:43, Demi Marie Obenour wrote:
> GPU acceleration requires that pageable host memory be able to be mapped
> into a guest.

I'm sure it was explained in the session, which sadly I couldn't attend.
I've been asking Ray and Xenia the same before, but I'm afraid it still
hasn't become clear to me why this is a _requirement_. After all that's
against what we're doing elsewhere (i.e. so far it has always been
guest memory that's mapped in the host). I can appreciate that it might
be more difficult to implement, but avoiding to violate this fundamental
(kind of) rule might be worth the price (and would avoid other
complexities, of which there may be lurking more than what you enumerate
below).

>  This requires changes to all of the Xen hypervisor, Linux
> kernel, and userspace device model.
> 
> ### Goals
> 
>  - Allow any userspace pages to be mapped into a guest.
>  - Support deprivileged operation: this API must not be usable for privilege escalation.
>  - Use MMU notifiers to ensure safety with respect to use-after-free.
> 
> ### Hypervisor changes
> 
> There are at least two Xen changes required:
> 
> 1. Add a new flag to IOREQ that means "retry this instruction".
> 
>    An IOREQ server can set this flag after having successfully handled a
>    page fault.  It is expected that the IOREQ server has successfully
>    mapped a page into the guest at the location of the fault.
>    Otherwise, the same fault will likely happen again.

Were there any thoughts on how to prevent this becoming an infinite loop?
I.e. how to (a) guarantee forward progress in the guest and (b) deal with
misbehaving IOREQ servers?

> 2. Add support for `XEN_DOMCTL_memory_mapping` to use system RAM, not
>    just IOMEM.  Mappings made with `XEN_DOMCTL_memory_mapping` are
>    guaranteed to be able to be successfully revoked with
>    `XEN_DOMCTL_memory_mapping`, so all operations that would create
>    extra references to the mapped memory must be forbidden.  These
>    include, but may not be limited to:
> 
>    1. Granting the pages to the same or other domains.
>    2. Mapping into another domain using `XEN_DOMCTL_memory_mapping`.
>    3. Another domain accessing the pages using the foreign memory APIs,
>       unless it is privileged over the domain that owns the pages.

All of which may call for actually converting the memory to kind-of-MMIO,
with a means to later convert it back.

Jan

>    Open question: what if the other domain goes away?  Ideally,
>    unmapping would (vacuously) succeed in this case.  Qubes OS doesn't
>    care about domid reuse but others might.
> 
> ### Kernel changes
> 
> Linux will add support for mapping userspace memory into an emulated PCI
> BAR.  This requires Linux to automatically revoke access when needed.
> 
> There will be an IOREQ server that handles page faults.  The discussion
> assumed that this handling will happen in kernel mode, but if handling
> in user mode is simpler that is also an option.
> 
> There is no async #PF in Xen (yet), so the entire vCPU will be blocked
> while the fault is handled.  This is not great for performance, but
> correctness comes first.
> 
> There will be a new kernel ioctl to perform the mapping.  A possible C
> prototype (presented at design session, but not discussed there):
> 
>     struct xen_linux_register_memory {
>         uint64_t pointer;
>         uint64_t size;
>         uint64_t gpa;
>         uint32_t id;
>         uint32_t guest_domid;
>     };



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Design session notes: GPU acceleration in Xen
  2024-06-14  6:38 ` Jan Beulich
@ 2024-06-14  7:21   ` Roger Pau Monné
  2024-06-14  8:12     ` Jan Beulich
  2024-06-14 17:55     ` Demi Marie Obenour
  2024-06-14 16:35   ` Demi Marie Obenour
  2024-06-14 20:56   ` Demi Marie Obenour
  2 siblings, 2 replies; 26+ messages in thread
From: Roger Pau Monné @ 2024-06-14  7:21 UTC (permalink / raw)
  To: Jan Beulich, Demi Marie Obenour
  Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang,
	Xen developer discussion, Andrew Cooper

On Fri, Jun 14, 2024 at 08:38:51AM +0200, Jan Beulich wrote:
> On 13.06.2024 20:43, Demi Marie Obenour wrote:
> > GPU acceleration requires that pageable host memory be able to be mapped
> > into a guest.
> 
> I'm sure it was explained in the session, which sadly I couldn't attend.
> I've been asking Ray and Xenia the same before, but I'm afraid it still
> hasn't become clear to me why this is a _requirement_. After all that's
> against what we're doing elsewhere (i.e. so far it has always been
> guest memory that's mapped in the host). I can appreciate that it might
> be more difficult to implement, but avoiding to violate this fundamental
> (kind of) rule might be worth the price (and would avoid other
> complexities, of which there may be lurking more than what you enumerate
> below).

My limited understanding (please someone correct me if wrong) is that
the GPU buffer (or context I think it's also called?) is always
allocated from dom0 (the owner of the GPU).  The underling memory
addresses of such buffer needs to be mapped into the guest.  The
buffer backing memory might be GPU MMIO from the device BAR(s) or
system RAM, and such buffer can be paged by the dom0 kernel at any
time (iow: changing the backing memory from MMIO to RAM or vice
versa).  Also, the buffer must be contiguous in physical address
space.

I'm not sure it's possible to ensure that when using system RAM such
memory comes from the guest rather than the host, as it would likely
require some very intrusive hooks into the kernel logic, and
negotiation with the guest to allocate the requested amount of
memory and hand it over to dom0.  If the maximum size of the buffer is
known in advance maybe dom0 can negotiate with the guest to allocate
such a region and grant it access to dom0 at driver attachment time.

One aspect that I'm lacking clarity is better understanding of how the
process of allocating and assigning a GPU buffer to a guest is
performed (I think this is the key to how GPU VirtIO native contexts
work?).

Another question I have, are guest expected to have a single GPU
buffer, or they can have multiple GPU buffers simultaneously
allocated?

Regards, Roger.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Design session notes: GPU acceleration in Xen
  2024-06-14  7:21   ` Roger Pau Monné
@ 2024-06-14  8:12     ` Jan Beulich
  2024-06-14  8:39       ` Roger Pau Monné
  2024-06-14 16:44       ` Demi Marie Obenour
  2024-06-14 17:55     ` Demi Marie Obenour
  1 sibling, 2 replies; 26+ messages in thread
From: Jan Beulich @ 2024-06-14  8:12 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang,
	Xen developer discussion, Andrew Cooper, Demi Marie Obenour

On 14.06.2024 09:21, Roger Pau Monné wrote:
> On Fri, Jun 14, 2024 at 08:38:51AM +0200, Jan Beulich wrote:
>> On 13.06.2024 20:43, Demi Marie Obenour wrote:
>>> GPU acceleration requires that pageable host memory be able to be mapped
>>> into a guest.
>>
>> I'm sure it was explained in the session, which sadly I couldn't attend.
>> I've been asking Ray and Xenia the same before, but I'm afraid it still
>> hasn't become clear to me why this is a _requirement_. After all that's
>> against what we're doing elsewhere (i.e. so far it has always been
>> guest memory that's mapped in the host). I can appreciate that it might
>> be more difficult to implement, but avoiding to violate this fundamental
>> (kind of) rule might be worth the price (and would avoid other
>> complexities, of which there may be lurking more than what you enumerate
>> below).
> 
> My limited understanding (please someone correct me if wrong) is that
> the GPU buffer (or context I think it's also called?) is always
> allocated from dom0 (the owner of the GPU).  The underling memory
> addresses of such buffer needs to be mapped into the guest.  The
> buffer backing memory might be GPU MMIO from the device BAR(s) or
> system RAM, and such buffer can be paged by the dom0 kernel at any
> time (iow: changing the backing memory from MMIO to RAM or vice
> versa).  Also, the buffer must be contiguous in physical address
> space.

This last one in particular would of course be a severe restriction.
Yet: There's an IOMMU involved, isn't there?

> I'm not sure it's possible to ensure that when using system RAM such
> memory comes from the guest rather than the host, as it would likely
> require some very intrusive hooks into the kernel logic, and
> negotiation with the guest to allocate the requested amount of
> memory and hand it over to dom0.  If the maximum size of the buffer is
> known in advance maybe dom0 can negotiate with the guest to allocate
> such a region and grant it access to dom0 at driver attachment time.

Besides the thought of transiently converting RAM to kind-of-MMIO, this
makes me think of another possible option: Could Dom0 transfer ownership
of the RAM that wants mapping in the guest (remotely resembling
grant-transfer)? Would require the guest to have ballooned down enough
first, of course. (In both cases it would certainly need working out how
the conversion / transfer back could be made work safely and reasonably
cleanly.)

Jan

> One aspect that I'm lacking clarity is better understanding of how the
> process of allocating and assigning a GPU buffer to a guest is
> performed (I think this is the key to how GPU VirtIO native contexts
> work?).
> 
> Another question I have, are guest expected to have a single GPU
> buffer, or they can have multiple GPU buffers simultaneously
> allocated?
> 
> Regards, Roger.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Design session notes: GPU acceleration in Xen
  2024-06-14  8:12     ` Jan Beulich
@ 2024-06-14  8:39       ` Roger Pau Monné
  2024-06-17  0:38         ` Demi Marie Obenour
  2024-06-14 16:44       ` Demi Marie Obenour
  1 sibling, 1 reply; 26+ messages in thread
From: Roger Pau Monné @ 2024-06-14  8:39 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang,
	Xen developer discussion, Andrew Cooper, Demi Marie Obenour

On Fri, Jun 14, 2024 at 10:12:40AM +0200, Jan Beulich wrote:
> On 14.06.2024 09:21, Roger Pau Monné wrote:
> > On Fri, Jun 14, 2024 at 08:38:51AM +0200, Jan Beulich wrote:
> >> On 13.06.2024 20:43, Demi Marie Obenour wrote:
> >>> GPU acceleration requires that pageable host memory be able to be mapped
> >>> into a guest.
> >>
> >> I'm sure it was explained in the session, which sadly I couldn't attend.
> >> I've been asking Ray and Xenia the same before, but I'm afraid it still
> >> hasn't become clear to me why this is a _requirement_. After all that's
> >> against what we're doing elsewhere (i.e. so far it has always been
> >> guest memory that's mapped in the host). I can appreciate that it might
> >> be more difficult to implement, but avoiding to violate this fundamental
> >> (kind of) rule might be worth the price (and would avoid other
> >> complexities, of which there may be lurking more than what you enumerate
> >> below).
> > 
> > My limited understanding (please someone correct me if wrong) is that
> > the GPU buffer (or context I think it's also called?) is always
> > allocated from dom0 (the owner of the GPU).  The underling memory
> > addresses of such buffer needs to be mapped into the guest.  The
> > buffer backing memory might be GPU MMIO from the device BAR(s) or
> > system RAM, and such buffer can be paged by the dom0 kernel at any
> > time (iow: changing the backing memory from MMIO to RAM or vice
> > versa).  Also, the buffer must be contiguous in physical address
> > space.
> 
> This last one in particular would of course be a severe restriction.
> Yet: There's an IOMMU involved, isn't there?

Yup, IIRC that's why Ray said it was much more easier for them to
support VirtIO GPUs from a PVH dom0 rather than classic PV one.

It might be easier to implement from a classic PV dom0 if there's
pv-iommu support, so that dom0 can create it's own contiguous memory
buffers from the device PoV.

> > I'm not sure it's possible to ensure that when using system RAM such
> > memory comes from the guest rather than the host, as it would likely
> > require some very intrusive hooks into the kernel logic, and
> > negotiation with the guest to allocate the requested amount of
> > memory and hand it over to dom0.  If the maximum size of the buffer is
> > known in advance maybe dom0 can negotiate with the guest to allocate
> > such a region and grant it access to dom0 at driver attachment time.
> 
> Besides the thought of transiently converting RAM to kind-of-MMIO, this

As a note here, changing the type to MMIO would likely involve
modifying the EPT/NPT tables to propagate the new type.  On a PVH dom0
this would likely involve shattering superpages in order to set the
correct memory types.

Depending on how often and how random those system RAM changes are
necessary this could also create contention on the p2m lock.

> makes me think of another possible option: Could Dom0 transfer ownership
> of the RAM that wants mapping in the guest (remotely resembling
> grant-transfer)? Would require the guest to have ballooned down enough
> first, of course. (In both cases it would certainly need working out how
> the conversion / transfer back could be made work safely and reasonably
> cleanly.)

Maybe.  The fact the guest needs to balloon down that amount of memory
seems weird to me, as from the guest PoV that mapped memory is
MMIO-like and not system RAM.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Design session notes: GPU acceleration in Xen
  2024-06-14  8:39       ` Roger Pau Monné
@ 2024-06-17  0:38         ` Demi Marie Obenour
  2024-06-17  7:46           ` Roger Pau Monné
  2024-06-17  9:13           ` Jan Beulich
  0 siblings, 2 replies; 26+ messages in thread
From: Demi Marie Obenour @ 2024-06-17  0:38 UTC (permalink / raw)
  To: Roger Pau Monné, Jan Beulich
  Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang,
	Xen developer discussion, Andrew Cooper

[-- Attachment #1: Type: text/plain, Size: 4284 bytes --]

On Fri, Jun 14, 2024 at 10:39:37AM +0200, Roger Pau Monné wrote:
> On Fri, Jun 14, 2024 at 10:12:40AM +0200, Jan Beulich wrote:
> > On 14.06.2024 09:21, Roger Pau Monné wrote:
> > > On Fri, Jun 14, 2024 at 08:38:51AM +0200, Jan Beulich wrote:
> > >> On 13.06.2024 20:43, Demi Marie Obenour wrote:
> > >>> GPU acceleration requires that pageable host memory be able to be mapped
> > >>> into a guest.
> > >>
> > >> I'm sure it was explained in the session, which sadly I couldn't attend.
> > >> I've been asking Ray and Xenia the same before, but I'm afraid it still
> > >> hasn't become clear to me why this is a _requirement_. After all that's
> > >> against what we're doing elsewhere (i.e. so far it has always been
> > >> guest memory that's mapped in the host). I can appreciate that it might
> > >> be more difficult to implement, but avoiding to violate this fundamental
> > >> (kind of) rule might be worth the price (and would avoid other
> > >> complexities, of which there may be lurking more than what you enumerate
> > >> below).
> > > 
> > > My limited understanding (please someone correct me if wrong) is that
> > > the GPU buffer (or context I think it's also called?) is always
> > > allocated from dom0 (the owner of the GPU).  The underling memory
> > > addresses of such buffer needs to be mapped into the guest.  The
> > > buffer backing memory might be GPU MMIO from the device BAR(s) or
> > > system RAM, and such buffer can be paged by the dom0 kernel at any
> > > time (iow: changing the backing memory from MMIO to RAM or vice
> > > versa).  Also, the buffer must be contiguous in physical address
> > > space.
> > 
> > This last one in particular would of course be a severe restriction.
> > Yet: There's an IOMMU involved, isn't there?
> 
> Yup, IIRC that's why Ray said it was much more easier for them to
> support VirtIO GPUs from a PVH dom0 rather than classic PV one.
> 
> It might be easier to implement from a classic PV dom0 if there's
> pv-iommu support, so that dom0 can create it's own contiguous memory
> buffers from the device PoV.

What makes PVH an improvement here?  I thought PV dom0 uses an identity
mapping for the IOMMU, while a PVH dom0 uses an IOMMU that mirrors the
dom0 second-stage page tables.  In both cases, the device physical
addresses are identical to dom0’s physical addresses.

PV is terrible for many reasons, so I’m okay with focusing on PVH dom0,
but I’d like to know why there is a difference.

> > > I'm not sure it's possible to ensure that when using system RAM such
> > > memory comes from the guest rather than the host, as it would likely
> > > require some very intrusive hooks into the kernel logic, and
> > > negotiation with the guest to allocate the requested amount of
> > > memory and hand it over to dom0.  If the maximum size of the buffer is
> > > known in advance maybe dom0 can negotiate with the guest to allocate
> > > such a region and grant it access to dom0 at driver attachment time.
> > 
> > Besides the thought of transiently converting RAM to kind-of-MMIO, this
> 
> As a note here, changing the type to MMIO would likely involve
> modifying the EPT/NPT tables to propagate the new type.  On a PVH dom0
> this would likely involve shattering superpages in order to set the
> correct memory types.
> 
> Depending on how often and how random those system RAM changes are
> necessary this could also create contention on the p2m lock.
> 
> > makes me think of another possible option: Could Dom0 transfer ownership
> > of the RAM that wants mapping in the guest (remotely resembling
> > grant-transfer)? Would require the guest to have ballooned down enough
> > first, of course. (In both cases it would certainly need working out how
> > the conversion / transfer back could be made work safely and reasonably
> > cleanly.)
> 
> Maybe.  The fact the guest needs to balloon down that amount of memory
> seems weird to me, as from the guest PoV that mapped memory is
> MMIO-like and not system RAM.

I don’t like it either.  Furthermore, this would require changes to the
virtio-GPU driver in the guest, which I’d prefer to avoid.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Design session notes: GPU acceleration in Xen
  2024-06-17  0:38         ` Demi Marie Obenour
@ 2024-06-17  7:46           ` Roger Pau Monné
  2024-06-17 15:07             ` Demi Marie Obenour
  2024-06-17 20:46             ` Marek Marczykowski-Górecki
  2024-06-17  9:13           ` Jan Beulich
  1 sibling, 2 replies; 26+ messages in thread
From: Roger Pau Monné @ 2024-06-17  7:46 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Jan Beulich, Xenia Ragiadakou, Marek Marczykowski-Górecki,
	Ray Huang, Xen developer discussion, Andrew Cooper

On Sun, Jun 16, 2024 at 08:38:19PM -0400, Demi Marie Obenour wrote:
> On Fri, Jun 14, 2024 at 10:39:37AM +0200, Roger Pau Monné wrote:
> > On Fri, Jun 14, 2024 at 10:12:40AM +0200, Jan Beulich wrote:
> > > On 14.06.2024 09:21, Roger Pau Monné wrote:
> > > > On Fri, Jun 14, 2024 at 08:38:51AM +0200, Jan Beulich wrote:
> > > >> On 13.06.2024 20:43, Demi Marie Obenour wrote:
> > > >>> GPU acceleration requires that pageable host memory be able to be mapped
> > > >>> into a guest.
> > > >>
> > > >> I'm sure it was explained in the session, which sadly I couldn't attend.
> > > >> I've been asking Ray and Xenia the same before, but I'm afraid it still
> > > >> hasn't become clear to me why this is a _requirement_. After all that's
> > > >> against what we're doing elsewhere (i.e. so far it has always been
> > > >> guest memory that's mapped in the host). I can appreciate that it might
> > > >> be more difficult to implement, but avoiding to violate this fundamental
> > > >> (kind of) rule might be worth the price (and would avoid other
> > > >> complexities, of which there may be lurking more than what you enumerate
> > > >> below).
> > > > 
> > > > My limited understanding (please someone correct me if wrong) is that
> > > > the GPU buffer (or context I think it's also called?) is always
> > > > allocated from dom0 (the owner of the GPU).  The underling memory
> > > > addresses of such buffer needs to be mapped into the guest.  The
> > > > buffer backing memory might be GPU MMIO from the device BAR(s) or
> > > > system RAM, and such buffer can be paged by the dom0 kernel at any
> > > > time (iow: changing the backing memory from MMIO to RAM or vice
> > > > versa).  Also, the buffer must be contiguous in physical address
> > > > space.
> > > 
> > > This last one in particular would of course be a severe restriction.
> > > Yet: There's an IOMMU involved, isn't there?
> > 
> > Yup, IIRC that's why Ray said it was much more easier for them to
> > support VirtIO GPUs from a PVH dom0 rather than classic PV one.
> > 
> > It might be easier to implement from a classic PV dom0 if there's
> > pv-iommu support, so that dom0 can create it's own contiguous memory
> > buffers from the device PoV.
> 
> What makes PVH an improvement here?  I thought PV dom0 uses an identity
> mapping for the IOMMU, while a PVH dom0 uses an IOMMU that mirrors the
> dom0 second-stage page tables.

Indeed, hence finding a physically contiguous buffer on classic PV is
way more complicated, because the IOMMU identity maps mfns, and the PV
address space can be completely scattered.

OTOH, on PVH the IOMMU page tables are the same as the second stage
translation, and hence the physical address is way more compact (as it
would be on native).

> In both cases, the device physical
> addresses are identical to dom0’s physical addresses.

Yes, but a PV dom0 physical address space can be very scattered.

IIRC there's an hypercall to request physically contiguous memory for
PV, but you don't want to be using that every time you allocate a
buffer (not sure it would support the sizes needed by the GPU
anyway).

> PV is terrible for many reasons, so I’m okay with focusing on PVH dom0,
> but I’d like to know why there is a difference.
> 
> > > > I'm not sure it's possible to ensure that when using system RAM such
> > > > memory comes from the guest rather than the host, as it would likely
> > > > require some very intrusive hooks into the kernel logic, and
> > > > negotiation with the guest to allocate the requested amount of
> > > > memory and hand it over to dom0.  If the maximum size of the buffer is
> > > > known in advance maybe dom0 can negotiate with the guest to allocate
> > > > such a region and grant it access to dom0 at driver attachment time.
> > > 
> > > Besides the thought of transiently converting RAM to kind-of-MMIO, this
> > 
> > As a note here, changing the type to MMIO would likely involve
> > modifying the EPT/NPT tables to propagate the new type.  On a PVH dom0
> > this would likely involve shattering superpages in order to set the
> > correct memory types.
> > 
> > Depending on how often and how random those system RAM changes are
> > necessary this could also create contention on the p2m lock.
> > 
> > > makes me think of another possible option: Could Dom0 transfer ownership
> > > of the RAM that wants mapping in the guest (remotely resembling
> > > grant-transfer)? Would require the guest to have ballooned down enough
> > > first, of course. (In both cases it would certainly need working out how
> > > the conversion / transfer back could be made work safely and reasonably
> > > cleanly.)
> > 
> > Maybe.  The fact the guest needs to balloon down that amount of memory
> > seems weird to me, as from the guest PoV that mapped memory is
> > MMIO-like and not system RAM.
> 
> I don’t like it either.  Furthermore, this would require changes to the
> virtio-GPU driver in the guest, which I’d prefer to avoid.

IMO it would be helpful if you (or someone) could write the full
specification of how VirtIO GPU is supposed to work right now (with
the KVM model I assume?) as it would be a good starting point to
provide suggestions about how to make it work (or adapt it) on Xen.

I don't think the high level layers on top of VirtIO GPU are relevant,
but it's important to understand the protocol between the VirtIO GPU
front and back ends.

So far I only had scattered conversation about what's needed, but not
a formal write-up of how this is supposed to work.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Design session notes: GPU acceleration in Xen
  2024-06-17  7:46           ` Roger Pau Monné
@ 2024-06-17 15:07             ` Demi Marie Obenour
  2024-06-17 20:46             ` Marek Marczykowski-Górecki
  1 sibling, 0 replies; 26+ messages in thread
From: Demi Marie Obenour @ 2024-06-17 15:07 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Jan Beulich, Xenia Ragiadakou, Marek Marczykowski-Górecki,
	Ray Huang, Xen developer discussion, Andrew Cooper

[-- Attachment #1: Type: text/plain, Size: 6667 bytes --]

On Mon, Jun 17, 2024 at 09:46:29AM +0200, Roger Pau Monné wrote:
> On Sun, Jun 16, 2024 at 08:38:19PM -0400, Demi Marie Obenour wrote:
> > On Fri, Jun 14, 2024 at 10:39:37AM +0200, Roger Pau Monné wrote:
> > > On Fri, Jun 14, 2024 at 10:12:40AM +0200, Jan Beulich wrote:
> > > > On 14.06.2024 09:21, Roger Pau Monné wrote:
> > > > > On Fri, Jun 14, 2024 at 08:38:51AM +0200, Jan Beulich wrote:
> > > > >> On 13.06.2024 20:43, Demi Marie Obenour wrote:
> > > > >>> GPU acceleration requires that pageable host memory be able to be mapped
> > > > >>> into a guest.
> > > > >>
> > > > >> I'm sure it was explained in the session, which sadly I couldn't attend.
> > > > >> I've been asking Ray and Xenia the same before, but I'm afraid it still
> > > > >> hasn't become clear to me why this is a _requirement_. After all that's
> > > > >> against what we're doing elsewhere (i.e. so far it has always been
> > > > >> guest memory that's mapped in the host). I can appreciate that it might
> > > > >> be more difficult to implement, but avoiding to violate this fundamental
> > > > >> (kind of) rule might be worth the price (and would avoid other
> > > > >> complexities, of which there may be lurking more than what you enumerate
> > > > >> below).
> > > > > 
> > > > > My limited understanding (please someone correct me if wrong) is that
> > > > > the GPU buffer (or context I think it's also called?) is always
> > > > > allocated from dom0 (the owner of the GPU).  The underling memory
> > > > > addresses of such buffer needs to be mapped into the guest.  The
> > > > > buffer backing memory might be GPU MMIO from the device BAR(s) or
> > > > > system RAM, and such buffer can be paged by the dom0 kernel at any
> > > > > time (iow: changing the backing memory from MMIO to RAM or vice
> > > > > versa).  Also, the buffer must be contiguous in physical address
> > > > > space.
> > > > 
> > > > This last one in particular would of course be a severe restriction.
> > > > Yet: There's an IOMMU involved, isn't there?
> > > 
> > > Yup, IIRC that's why Ray said it was much more easier for them to
> > > support VirtIO GPUs from a PVH dom0 rather than classic PV one.
> > > 
> > > It might be easier to implement from a classic PV dom0 if there's
> > > pv-iommu support, so that dom0 can create it's own contiguous memory
> > > buffers from the device PoV.
> > 
> > What makes PVH an improvement here?  I thought PV dom0 uses an identity
> > mapping for the IOMMU, while a PVH dom0 uses an IOMMU that mirrors the
> > dom0 second-stage page tables.
> 
> Indeed, hence finding a physically contiguous buffer on classic PV is
> way more complicated, because the IOMMU identity maps mfns, and the PV
> address space can be completely scattered.
> 
> OTOH, on PVH the IOMMU page tables are the same as the second stage
> translation, and hence the physical address is way more compact (as it
> would be on native).

Ah, _that_ is what I missed.  I didn't realize that the physical address
space of PV guests was so scattered.

> > In both cases, the device physical
> > addresses are identical to dom0’s physical addresses.
> 
> Yes, but a PV dom0 physical address space can be very scattered.
> 
> IIRC there's an hypercall to request physically contiguous memory for
> PV, but you don't want to be using that every time you allocate a
> buffer (not sure it would support the sizes needed by the GPU
> anyway).

That makes sense, thanks!

> > PV is terrible for many reasons, so I’m okay with focusing on PVH dom0,
> > but I’d like to know why there is a difference.
> > 
> > > > > I'm not sure it's possible to ensure that when using system RAM such
> > > > > memory comes from the guest rather than the host, as it would likely
> > > > > require some very intrusive hooks into the kernel logic, and
> > > > > negotiation with the guest to allocate the requested amount of
> > > > > memory and hand it over to dom0.  If the maximum size of the buffer is
> > > > > known in advance maybe dom0 can negotiate with the guest to allocate
> > > > > such a region and grant it access to dom0 at driver attachment time.
> > > > 
> > > > Besides the thought of transiently converting RAM to kind-of-MMIO, this
> > > 
> > > As a note here, changing the type to MMIO would likely involve
> > > modifying the EPT/NPT tables to propagate the new type.  On a PVH dom0
> > > this would likely involve shattering superpages in order to set the
> > > correct memory types.
> > > 
> > > Depending on how often and how random those system RAM changes are
> > > necessary this could also create contention on the p2m lock.
> > > 
> > > > makes me think of another possible option: Could Dom0 transfer ownership
> > > > of the RAM that wants mapping in the guest (remotely resembling
> > > > grant-transfer)? Would require the guest to have ballooned down enough
> > > > first, of course. (In both cases it would certainly need working out how
> > > > the conversion / transfer back could be made work safely and reasonably
> > > > cleanly.)
> > > 
> > > Maybe.  The fact the guest needs to balloon down that amount of memory
> > > seems weird to me, as from the guest PoV that mapped memory is
> > > MMIO-like and not system RAM.
> > 
> > I don’t like it either.  Furthermore, this would require changes to the
> > virtio-GPU driver in the guest, which I’d prefer to avoid.
> 
> IMO it would be helpful if you (or someone) could write the full
> specification of how VirtIO GPU is supposed to work right now (with
> the KVM model I assume?) as it would be a good starting point to
> provide suggestions about how to make it work (or adapt it) on Xen.
> 
> I don't think the high level layers on top of VirtIO GPU are relevant,
> but it's important to understand the protocol between the VirtIO GPU
> front and back ends.

virtio-GPU is part of the OASIS VirtIO standard [1].

[1]: https://docs.oasis-open.org/virtio/virtio/v1.3/virtio-v1.3.html

> So far I only had scattered conversation about what's needed, but not
> a formal write-up of how this is supposed to work.

My understanding is that mapping GPU buffers into guests ("blob
resources" in virtio-GPU terms) is the only part of virtio-GPU that
didn't just work.  Furthermore, any solution that uses Linux's
kernel-mode GPU driver on the host will have the same requirements.
I don't consider writing a bespoke GPU driver that uses caller-allocated
buffers to be a reasonable solution that can support many GPU models.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Design session notes: GPU acceleration in Xen
  2024-06-17  7:46           ` Roger Pau Monné
  2024-06-17 15:07             ` Demi Marie Obenour
@ 2024-06-17 20:46             ` Marek Marczykowski-Górecki
  2024-06-18  0:57               ` Demi Marie Obenour
  1 sibling, 1 reply; 26+ messages in thread
From: Marek Marczykowski-Górecki @ 2024-06-17 20:46 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Demi Marie Obenour, Jan Beulich, Xenia Ragiadakou, Ray Huang,
	Xen developer discussion, Andrew Cooper

[-- Attachment #1: Type: text/plain, Size: 990 bytes --]

On Mon, Jun 17, 2024 at 09:46:29AM +0200, Roger Pau Monné wrote:
> On Sun, Jun 16, 2024 at 08:38:19PM -0400, Demi Marie Obenour wrote:
> > In both cases, the device physical
> > addresses are identical to dom0’s physical addresses.
> 
> Yes, but a PV dom0 physical address space can be very scattered.
> 
> IIRC there's an hypercall to request physically contiguous memory for
> PV, but you don't want to be using that every time you allocate a
> buffer (not sure it would support the sizes needed by the GPU
> anyway).

Indeed that isn't going to fly. In older Qubes versions we had PV
sys-net with PCI passthrough for a network card. After some uptime it
was basically impossible to restart and still have enough contagious
memory for a network driver, and there it was about _much_ smaller
buffers, like 2M or 4M. At least not without shutting down a lot more
things to free some more memory.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Design session notes: GPU acceleration in Xen
  2024-06-17 20:46             ` Marek Marczykowski-Górecki
@ 2024-06-18  0:57               ` Demi Marie Obenour
  2024-06-18  6:33                 ` Christian König
  2024-06-18 14:43                 ` Roger Pau Monné
  0 siblings, 2 replies; 26+ messages in thread
From: Demi Marie Obenour @ 2024-06-18  0:57 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki, Roger Pau Monné
  Cc: Jan Beulich, Xenia Ragiadakou, Ray Huang,
	Xen developer discussion, Andrew Cooper,
	Direct Rendering Infrastructure development, Christian König,
	Qubes OS Development Mailing List

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On Mon, Jun 17, 2024 at 10:46:13PM +0200, Marek Marczykowski-Górecki wrote:
> On Mon, Jun 17, 2024 at 09:46:29AM +0200, Roger Pau Monné wrote:
> > On Sun, Jun 16, 2024 at 08:38:19PM -0400, Demi Marie Obenour wrote:
> > > In both cases, the device physical
> > > addresses are identical to dom0’s physical addresses.
> > 
> > Yes, but a PV dom0 physical address space can be very scattered.
> > 
> > IIRC there's an hypercall to request physically contiguous memory for
> > PV, but you don't want to be using that every time you allocate a
> > buffer (not sure it would support the sizes needed by the GPU
> > anyway).
> 
> Indeed that isn't going to fly. In older Qubes versions we had PV
> sys-net with PCI passthrough for a network card. After some uptime it
> was basically impossible to restart and still have enough contagious
> memory for a network driver, and there it was about _much_ smaller
> buffers, like 2M or 4M. At least not without shutting down a lot more
> things to free some more memory.

Ouch!  That makes me wonder if all GPU drivers actually need physically
contiguous buffers, or if it is (as I suspect) driver-specific.  CCing
Christian König who has mentioned issues in this area.

Given the recent progress on PVH dom0, is it reasonable to assume that
PVH dom0 will be ready in time for R4.3, and that therefore Qubes OS
doesn't need to worry about this problem on x86?
- -- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCgAdFiEEdodNnxM2uiJZBxxxsoi1X/+cIsEFAmZw22kACgkQsoi1X/+c
IsGqtA/+INEbVP6pjKoMOJStaXajIvx19hJFU5HJQT0FBe4u2VXd3wfhK5gbJ90P
NrlE3Lstzper0qBG7Lt8lt4DAcL9Q3Ml9d8M0K7z6VYIKPqiu2Wh/P25HD7r+Adn
L2AMwnKUHtC02LJpT1Cjt/acKU3En9TMd35RhCNf4K+c9Swodtea3iOo7GzgQjNA
TFMAYiiIlhwQIvThrVlcKktCMZajvhudxwfZTd3EfUkIQbMtc/ydkeqL92nV9Fg4
uz+AEeDDNhCGsEjrFUFTXKnXc/28jpVIc4mXyGW+x4dginRjrjRVmtNrnz/1wO+S
X/xVUVnvLoTUXI+dKI9y5XmobVAJzLNZaEOEfnKePj5zA2ayRfnWybPBjzJuU+S4
wKevyBDlTuOdgtOT9nktd+qzXBQYtreEu8f+t9sEezURpVU/oOyrVn7Ui0RMtZID
W3sXJH3NfVb3mWCsYOMpJyzb5VYfYR5PWN6Ggln/CHvfLTDI8TKdaO41INkXLlTC
fA1cXVSKPn/VX9LRIFcQ81v9MGBAFkDX4Mf7z7xodi9Qopj+o2Yw66g5vLrPxPCH
asJSdnrnaZAtZSsbEhY4uV5+4QLD0dyNUqj+HxRlODFwhpDyervCikfp0MoSsWmT
qFvFHkiSqkx7E33QaVjmcGmFv4eWTVunYxW0j8tWnpWQLNLfPzY=
=H5gN
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Design session notes: GPU acceleration in Xen
  2024-06-18  0:57               ` Demi Marie Obenour
@ 2024-06-18  6:33                 ` Christian König
  2024-06-18 14:12                   ` Demi Marie Obenour
  2024-06-18 14:43                 ` Roger Pau Monné
  1 sibling, 1 reply; 26+ messages in thread
From: Christian König @ 2024-06-18  6:33 UTC (permalink / raw)
  To: Demi Marie Obenour, Marek Marczykowski-Górecki,
	Roger Pau Monné
  Cc: Jan Beulich, Xenia Ragiadakou, Ray Huang,
	Xen developer discussion, Andrew Cooper,
	Direct Rendering Infrastructure development,
	Qubes OS Development Mailing List

Am 18.06.24 um 02:57 schrieb Demi Marie Obenour:
> On Mon, Jun 17, 2024 at 10:46:13PM +0200, Marek Marczykowski-Górecki 
> wrote:
> > On Mon, Jun 17, 2024 at 09:46:29AM +0200, Roger Pau Monné wrote:
> >> On Sun, Jun 16, 2024 at 08:38:19PM -0400, Demi Marie Obenour wrote:
> >>> In both cases, the device physical
> >>> addresses are identical to dom0’s physical addresses.
> >>
> >> Yes, but a PV dom0 physical address space can be very scattered.
> >>
> >> IIRC there's an hypercall to request physically contiguous memory for
> >> PV, but you don't want to be using that every time you allocate a
> >> buffer (not sure it would support the sizes needed by the GPU
> >> anyway).
>
> > Indeed that isn't going to fly. In older Qubes versions we had PV
> > sys-net with PCI passthrough for a network card. After some uptime it
> > was basically impossible to restart and still have enough contagious
> > memory for a network driver, and there it was about _much_ smaller
> > buffers, like 2M or 4M. At least not without shutting down a lot more
> > things to free some more memory.
>
> Ouch!  That makes me wonder if all GPU drivers actually need physically
> contiguous buffers, or if it is (as I suspect) driver-specific. CCing
> Christian König who has mentioned issues in this area.

Well GPUs don't need physical contiguous memory to function, but if they 
only get 4k pages to work with it means a quite large (up to 30%) 
performance penalty.

So scattering memory like you described is probably a very bad idea if 
you want any halve way decent performance.

Regards,
Christian.

>
> Given the recent progress on PVH dom0, is it reasonable to assume that
> PVH dom0 will be ready in time for R4.3, and that therefore Qubes OS
> doesn't need to worry about this problem on x86?



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Design session notes: GPU acceleration in Xen
  2024-06-18  6:33                 ` Christian König
@ 2024-06-18 14:12                   ` Demi Marie Obenour
  2024-06-19  7:31                     ` Christian König
  0 siblings, 1 reply; 26+ messages in thread
From: Demi Marie Obenour @ 2024-06-18 14:12 UTC (permalink / raw)
  To: Christian König, Marek Marczykowski-Górecki,
	Roger Pau Monné
  Cc: Jan Beulich, Xenia Ragiadakou, Ray Huang,
	Xen developer discussion, Andrew Cooper,
	Direct Rendering Infrastructure development,
	Qubes OS Development Mailing List

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On Tue, Jun 18, 2024 at 08:33:38AM +0200, Christian König wrote:
> Am 18.06.24 um 02:57 schrieb Demi Marie Obenour:
> > On Mon, Jun 17, 2024 at 10:46:13PM +0200, Marek Marczykowski-Górecki
> > wrote:
> > > On Mon, Jun 17, 2024 at 09:46:29AM +0200, Roger Pau Monné wrote:
> > >> On Sun, Jun 16, 2024 at 08:38:19PM -0400, Demi Marie Obenour wrote:
> > >>> In both cases, the device physical
> > >>> addresses are identical to dom0’s physical addresses.
> > >>
> > >> Yes, but a PV dom0 physical address space can be very scattered.
> > >>
> > >> IIRC there's an hypercall to request physically contiguous memory for
> > >> PV, but you don't want to be using that every time you allocate a
> > >> buffer (not sure it would support the sizes needed by the GPU
> > >> anyway).
> > 
> > > Indeed that isn't going to fly. In older Qubes versions we had PV
> > > sys-net with PCI passthrough for a network card. After some uptime it
> > > was basically impossible to restart and still have enough contagious
> > > memory for a network driver, and there it was about _much_ smaller
> > > buffers, like 2M or 4M. At least not without shutting down a lot more
> > > things to free some more memory.
> > 
> > Ouch!  That makes me wonder if all GPU drivers actually need physically
> > contiguous buffers, or if it is (as I suspect) driver-specific. CCing
> > Christian König who has mentioned issues in this area.
> 
> Well GPUs don't need physical contiguous memory to function, but if they
> only get 4k pages to work with it means a quite large (up to 30%)
> performance penalty.

The status quo is "no GPU acceleration at all", so 70% of bare metal
performance would be amazing right now.  However, the implementation
should not preclude eliminating this performance penalty in the future.

What size pages do GPUs need for good performance?  Is it the same as
CPU huge pages?  PV dom0 doesn't get huge pages at all, but PVH and HVM
guests do, and the goal is to move away from PV guests as they have lots
of unrelated problems.

> So scattering memory like you described is probably a very bad idea if you
> want any halve way decent performance.

For an initial prototype a 30% performance penalty is acceptable, but
it's good to know that memory fragmentation needs to be avoided.

> Regards,
> Christian

Thanks for the prompt response!
- -- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCgAdFiEEdodNnxM2uiJZBxxxsoi1X/+cIsEFAmZxlbsACgkQsoi1X/+c
IsG+WhAA00y83cU94MMJCuDMqTCSOgJraPchvQHLBuMIB0cJkIbVxhA2T4yuvVZy
Bzg/oVvWJH8B+p47HHo6uyjoPoeO659q8Hyea6zT8yMrKhiwOF8UxFRyxakdYHRs
l793sCwUtMFwkJdsfacTSKjL6sMktWhicvOqX4rA/SIVpwzZh1auFjAIrZ2BENb/
YIRH18Dfl2iEOA2W3TQTNiaqLeT2qtYspDVVLuUeAe7OAFCJVSkeMpAPPR15jCzm
Ou0HP6JP2jH6h7Shd09ns+3UvQK4xaygpvEsj+BwpXPf2CDNgypKHezqgF1WMzCc
HGXK1deGXE35XNH4EL5jgRlF7FmLT54CXuMpPIGbfNWbT2fvpoS2tyrdQPHxwgr8
lqqqfjugZ9qzbqA4v/m+v0cKFclMvSYL8Rzn+tbz8kAFf7VTglypY55RIIStdnSZ
sLYStA6qv8Mcu4NHYvdGeatTS26XR72X+dB5ApTn4dLLttnzbXMAyqDSTys28XQb
jeHnh1uTOLChODJHu5prHJ6bN0MxmISwFuot58gW/iI0spyihRhPNjZ/6E/7BpIm
8AGiT+p96dvaymLB5k6dqj5ruqVPP8HLBibB8zafzJn3JIJpjCZm9HM5YcO7xMQ2
92ZNZ/XOswah+0s6MyWDCsU8jKnhQ87ESnB4JItI5skKj+001Jg=
=ddxn
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Design session notes: GPU acceleration in Xen
  2024-06-18 14:12                   ` Demi Marie Obenour
@ 2024-06-19  7:31                     ` Christian König
  2024-06-19 16:56                       ` Alex Deucher
  0 siblings, 1 reply; 26+ messages in thread
From: Christian König @ 2024-06-19  7:31 UTC (permalink / raw)
  To: Demi Marie Obenour, Marek Marczykowski-Górecki,
	Roger Pau Monné, Pelloux-prayer, Pierre-eric
  Cc: Jan Beulich, Xenia Ragiadakou, Ray Huang,
	Xen developer discussion, Andrew Cooper,
	Direct Rendering Infrastructure development,
	Qubes OS Development Mailing List

Am 18.06.24 um 16:12 schrieb Demi Marie Obenour:
> On Tue, Jun 18, 2024 at 08:33:38AM +0200, Christian König wrote:
> > Am 18.06.24 um 02:57 schrieb Demi Marie Obenour:
> >> On Mon, Jun 17, 2024 at 10:46:13PM +0200, Marek Marczykowski-Górecki
> >> wrote:
> >>> On Mon, Jun 17, 2024 at 09:46:29AM +0200, Roger Pau Monné wrote:
> >>>> On Sun, Jun 16, 2024 at 08:38:19PM -0400, Demi Marie Obenour wrote:
> >>>>> In both cases, the device physical
> >>>>> addresses are identical to dom0’s physical addresses.
> >>>>
> >>>> Yes, but a PV dom0 physical address space can be very scattered.
> >>>>
> >>>> IIRC there's an hypercall to request physically contiguous memory for
> >>>> PV, but you don't want to be using that every time you allocate a
> >>>> buffer (not sure it would support the sizes needed by the GPU
> >>>> anyway).
> >>
> >>> Indeed that isn't going to fly. In older Qubes versions we had PV
> >>> sys-net with PCI passthrough for a network card. After some uptime it
> >>> was basically impossible to restart and still have enough contagious
> >>> memory for a network driver, and there it was about _much_ smaller
> >>> buffers, like 2M or 4M. At least not without shutting down a lot more
> >>> things to free some more memory.
> >>
> >> Ouch!  That makes me wonder if all GPU drivers actually need physically
> >> contiguous buffers, or if it is (as I suspect) driver-specific. CCing
> >> Christian König who has mentioned issues in this area.
>
> > Well GPUs don't need physical contiguous memory to function, but if they
> > only get 4k pages to work with it means a quite large (up to 30%)
> > performance penalty.
>
> The status quo is "no GPU acceleration at all", so 70% of bare metal
> performance would be amazing right now.

Well AMD uses native context approach in XEN which which delivers over 
90% of bare metal performance.

Pierre-Eric can tell you more, but we certainly have GPU solutions in 
productions with XEN which would suffer greatly if we see the underlying 
memory fragmented like this.

>   However, the implementation
> should not preclude eliminating this performance penalty in the future.
>
> What size pages do GPUs need for good performance?  Is it the same as
> CPU huge pages?

2MiB are usually sufficient.

Regards,
Christian.

>   PV dom0 doesn't get huge pages at all, but PVH and HVM
> guests do, and the goal is to move away from PV guests as they have lots
> of unrelated problems.
>
> > So scattering memory like you described is probably a very bad idea 
> if you
> > want any halve way decent performance.
>
> For an initial prototype a 30% performance penalty is acceptable, but
> it's good to know that memory fragmentation needs to be avoided.
>
> > Regards,
> > Christian
>
> Thanks for the prompt response!



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Design session notes: GPU acceleration in Xen
  2024-06-19  7:31                     ` Christian König
@ 2024-06-19 16:56                       ` Alex Deucher
  0 siblings, 0 replies; 26+ messages in thread
From: Alex Deucher @ 2024-06-19 16:56 UTC (permalink / raw)
  To: Christian König
  Cc: Demi Marie Obenour, Marek Marczykowski-Górecki,
	Roger Pau Monné, Pelloux-prayer, Pierre-eric, Jan Beulich,
	Xenia Ragiadakou, Ray Huang, Xen developer discussion,
	Andrew Cooper, Direct Rendering Infrastructure development,
	Qubes OS Development Mailing List

On Wed, Jun 19, 2024 at 12:27 PM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 18.06.24 um 16:12 schrieb Demi Marie Obenour:
> > On Tue, Jun 18, 2024 at 08:33:38AM +0200, Christian König wrote:
> > > Am 18.06.24 um 02:57 schrieb Demi Marie Obenour:
> > >> On Mon, Jun 17, 2024 at 10:46:13PM +0200, Marek Marczykowski-Górecki
> > >> wrote:
> > >>> On Mon, Jun 17, 2024 at 09:46:29AM +0200, Roger Pau Monné wrote:
> > >>>> On Sun, Jun 16, 2024 at 08:38:19PM -0400, Demi Marie Obenour wrote:
> > >>>>> In both cases, the device physical
> > >>>>> addresses are identical to dom0’s physical addresses.
> > >>>>
> > >>>> Yes, but a PV dom0 physical address space can be very scattered.
> > >>>>
> > >>>> IIRC there's an hypercall to request physically contiguous memory for
> > >>>> PV, but you don't want to be using that every time you allocate a
> > >>>> buffer (not sure it would support the sizes needed by the GPU
> > >>>> anyway).
> > >>
> > >>> Indeed that isn't going to fly. In older Qubes versions we had PV
> > >>> sys-net with PCI passthrough for a network card. After some uptime it
> > >>> was basically impossible to restart and still have enough contagious
> > >>> memory for a network driver, and there it was about _much_ smaller
> > >>> buffers, like 2M or 4M. At least not without shutting down a lot more
> > >>> things to free some more memory.
> > >>
> > >> Ouch!  That makes me wonder if all GPU drivers actually need physically
> > >> contiguous buffers, or if it is (as I suspect) driver-specific. CCing
> > >> Christian König who has mentioned issues in this area.
> >
> > > Well GPUs don't need physical contiguous memory to function, but if they
> > > only get 4k pages to work with it means a quite large (up to 30%)
> > > performance penalty.
> >
> > The status quo is "no GPU acceleration at all", so 70% of bare metal
> > performance would be amazing right now.
>
> Well AMD uses native context approach in XEN which which delivers over
> 90% of bare metal performance.
>
> Pierre-Eric can tell you more, but we certainly have GPU solutions in
> productions with XEN which would suffer greatly if we see the underlying
> memory fragmented like this.
>
> >   However, the implementation
> > should not preclude eliminating this performance penalty in the future.
> >
> > What size pages do GPUs need for good performance?  Is it the same as
> > CPU huge pages?
>
> 2MiB are usually sufficient.

Larger pages are helpful for both system memory and VRAM, but it's
more important for VRAM.

Alex

>
> Regards,
> Christian.
>
> >   PV dom0 doesn't get huge pages at all, but PVH and HVM
> > guests do, and the goal is to move away from PV guests as they have lots
> > of unrelated problems.
> >
> > > So scattering memory like you described is probably a very bad idea
> > if you
> > > want any halve way decent performance.
> >
> > For an initial prototype a 30% performance penalty is acceptable, but
> > it's good to know that memory fragmentation needs to be avoided.
> >
> > > Regards,
> > > Christian
> >
> > Thanks for the prompt response!
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Design session notes: GPU acceleration in Xen
  2024-06-18  0:57               ` Demi Marie Obenour
  2024-06-18  6:33                 ` Christian König
@ 2024-06-18 14:43                 ` Roger Pau Monné
  2024-06-18 14:56                   ` Demi Marie Obenour
  1 sibling, 1 reply; 26+ messages in thread
From: Roger Pau Monné @ 2024-06-18 14:43 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Marek Marczykowski-Górecki, Jan Beulich, Xenia Ragiadakou,
	Ray Huang, Xen developer discussion, Andrew Cooper,
	Direct Rendering Infrastructure development, Christian König,
	Qubes OS Development Mailing List

On Mon, Jun 17, 2024 at 08:57:14PM -0400, Demi Marie Obenour wrote:
> Given the recent progress on PVH dom0, is it reasonable to assume that
> PVH dom0 will be ready in time for R4.3, and that therefore Qubes OS
> doesn't need to worry about this problem on x86?

PVH dom0 will only be ready (whatever ready means in your use-case)
when people test and fix the issues, otherwise it would stay in the
same limbo it's currently in.

I guess the main blocker for Qubes is the lack of PCI passthrough
support in order to test it more aggressively?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Design session notes: GPU acceleration in Xen
  2024-06-18 14:43                 ` Roger Pau Monné
@ 2024-06-18 14:56                   ` Demi Marie Obenour
  0 siblings, 0 replies; 26+ messages in thread
From: Demi Marie Obenour @ 2024-06-18 14:56 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Marek Marczykowski-Górecki, Jan Beulich, Xenia Ragiadakou,
	Ray Huang, Xen developer discussion, Andrew Cooper,
	Direct Rendering Infrastructure development, Christian König,
	Qubes OS Development Mailing List

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On Tue, Jun 18, 2024 at 04:43:50PM +0200, Roger Pau Monné wrote:
> On Mon, Jun 17, 2024 at 08:57:14PM -0400, Demi Marie Obenour wrote:
> > Given the recent progress on PVH dom0, is it reasonable to assume that
> > PVH dom0 will be ready in time for R4.3, and that therefore Qubes OS
> > doesn't need to worry about this problem on x86?
> 
> PVH dom0 will only be ready (whatever ready means in your use-case)
> when people test and fix the issues, otherwise it would stay in the
> same limbo it's currently in.
> 
> I guess the main blocker for Qubes is the lack of PCI passthrough
> support in order to test it more aggressively?

I suspect so, though Marek would need to confirm.
- -- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCgAdFiEEdodNnxM2uiJZBxxxsoi1X/+cIsEFAmZxoAgACgkQsoi1X/+c
IsH7QhAAjZiGKHUUE62xWcI4bxz/ebW6hS9eMEgRPpd9NSOt2slf5NBdGnKtYj1y
mCE+hpyBS3ZKD+4ERbJ4U6K/MrwXaUHc/gqwnRB+rrrKevP/oy/+mI8z8OPrGSc0
0ZCu3AfNKk5Bohf15IMtiqKkk+tsztLTfjgso7lJ1sK1wobdf8Ps97shdbCrnjlI
QlHIXWtYIJse4UKR1aZ1eZ/dggLKOyye3ukF6OSet8tLWbG258wdhRDwC57So5nI
xZdZayCpbixhFQLxbSy+L5lbEVTaq7Ymkoca33Fhn6kFtxzXv/gBoHz+nZBiqVZG
6fSQrIxr0MgDvQRzEvh90fnIDcAQtqRuvDJvB3jjkHjkQzuWsOpZycJytSEfisCw
//Z/T7DsbE581T9sBCpoZ4a7k89zsnZfT2MK7pypPL+spxtVTK2man6Us8mdEj85
5d+f3MGaoHQBPAbn5eoSWCzJCmdDBHIvMnIrxvvx+ZyD74nv4v8OMfUeMbDK8jz0
Z4LKG+cF0hc9pl/DlewrvP3spuw/a3KyxeKZBPKiZmArxuUbiuarbowauBT+YmgT
KTkWs/hL2VIq2+kX82DckABvroIhDm/YVF4miX4WIJMhoiEE0+zB35Gjyw19QvXr
+WviUWvA3a6icPzCUz2tIZlBabQ3fcgD/+IWVuuDv+7x9Kwy088=
=U2jm
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Design session notes: GPU acceleration in Xen
  2024-06-17  0:38         ` Demi Marie Obenour
  2024-06-17  7:46           ` Roger Pau Monné
@ 2024-06-17  9:13           ` Jan Beulich
  1 sibling, 0 replies; 26+ messages in thread
From: Jan Beulich @ 2024-06-17  9:13 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang,
	Xen developer discussion, Andrew Cooper, Roger Pau Monné

On 17.06.2024 02:38, Demi Marie Obenour wrote:
> On Fri, Jun 14, 2024 at 10:39:37AM +0200, Roger Pau Monné wrote:
>> On Fri, Jun 14, 2024 at 10:12:40AM +0200, Jan Beulich wrote:
>>> On 14.06.2024 09:21, Roger Pau Monné wrote:
>>>> On Fri, Jun 14, 2024 at 08:38:51AM +0200, Jan Beulich wrote:
>>>>> On 13.06.2024 20:43, Demi Marie Obenour wrote:
>>>>>> GPU acceleration requires that pageable host memory be able to be mapped
>>>>>> into a guest.
>>>>>
>>>>> I'm sure it was explained in the session, which sadly I couldn't attend.
>>>>> I've been asking Ray and Xenia the same before, but I'm afraid it still
>>>>> hasn't become clear to me why this is a _requirement_. After all that's
>>>>> against what we're doing elsewhere (i.e. so far it has always been
>>>>> guest memory that's mapped in the host). I can appreciate that it might
>>>>> be more difficult to implement, but avoiding to violate this fundamental
>>>>> (kind of) rule might be worth the price (and would avoid other
>>>>> complexities, of which there may be lurking more than what you enumerate
>>>>> below).
>>>>
>>>> My limited understanding (please someone correct me if wrong) is that
>>>> the GPU buffer (or context I think it's also called?) is always
>>>> allocated from dom0 (the owner of the GPU).  The underling memory
>>>> addresses of such buffer needs to be mapped into the guest.  The
>>>> buffer backing memory might be GPU MMIO from the device BAR(s) or
>>>> system RAM, and such buffer can be paged by the dom0 kernel at any
>>>> time (iow: changing the backing memory from MMIO to RAM or vice
>>>> versa).  Also, the buffer must be contiguous in physical address
>>>> space.
>>>
>>> This last one in particular would of course be a severe restriction.
>>> Yet: There's an IOMMU involved, isn't there?
>>
>> Yup, IIRC that's why Ray said it was much more easier for them to
>> support VirtIO GPUs from a PVH dom0 rather than classic PV one.
>>
>> It might be easier to implement from a classic PV dom0 if there's
>> pv-iommu support, so that dom0 can create it's own contiguous memory
>> buffers from the device PoV.
> 
> What makes PVH an improvement here?  I thought PV dom0 uses an identity
> mapping for the IOMMU,

True, but see how Roger mentioned PV IOMMU (which would allow a domain
to move away from this identity mapping).

Jan

> while a PVH dom0 uses an IOMMU that mirrors the
> dom0 second-stage page tables.  In both cases, the device physical
> addresses are identical to dom0’s physical addresses.
> 
> PV is terrible for many reasons, so I’m okay with focusing on PVH dom0,
> but I’d like to know why there is a difference.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Design session notes: GPU acceleration in Xen
  2024-06-14  8:12     ` Jan Beulich
  2024-06-14  8:39       ` Roger Pau Monné
@ 2024-06-14 16:44       ` Demi Marie Obenour
  2024-06-17  9:07         ` Jan Beulich
  1 sibling, 1 reply; 26+ messages in thread
From: Demi Marie Obenour @ 2024-06-14 16:44 UTC (permalink / raw)
  To: Jan Beulich, Roger Pau Monné
  Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang,
	Xen developer discussion, Andrew Cooper

[-- Attachment #1: Type: text/plain, Size: 3113 bytes --]

On Fri, Jun 14, 2024 at 10:12:40AM +0200, Jan Beulich wrote:
> On 14.06.2024 09:21, Roger Pau Monné wrote:
> > On Fri, Jun 14, 2024 at 08:38:51AM +0200, Jan Beulich wrote:
> >> On 13.06.2024 20:43, Demi Marie Obenour wrote:
> >>> GPU acceleration requires that pageable host memory be able to be mapped
> >>> into a guest.
> >>
> >> I'm sure it was explained in the session, which sadly I couldn't attend.
> >> I've been asking Ray and Xenia the same before, but I'm afraid it still
> >> hasn't become clear to me why this is a _requirement_. After all that's
> >> against what we're doing elsewhere (i.e. so far it has always been
> >> guest memory that's mapped in the host). I can appreciate that it might
> >> be more difficult to implement, but avoiding to violate this fundamental
> >> (kind of) rule might be worth the price (and would avoid other
> >> complexities, of which there may be lurking more than what you enumerate
> >> below).
> > 
> > My limited understanding (please someone correct me if wrong) is that
> > the GPU buffer (or context I think it's also called?) is always
> > allocated from dom0 (the owner of the GPU).  The underling memory
> > addresses of such buffer needs to be mapped into the guest.  The
> > buffer backing memory might be GPU MMIO from the device BAR(s) or
> > system RAM, and such buffer can be paged by the dom0 kernel at any
> > time (iow: changing the backing memory from MMIO to RAM or vice
> > versa).  Also, the buffer must be contiguous in physical address
> > space.
> 
> This last one in particular would of course be a severe restriction.
> Yet: There's an IOMMU involved, isn't there?

On x86 there is.  On Arm there might or might not be.  There are
non-embedded systems (such as Apple silicon) where the GPU is not behind
an IOMMU, for performance reasons IIUC.

> > I'm not sure it's possible to ensure that when using system RAM such
> > memory comes from the guest rather than the host, as it would likely
> > require some very intrusive hooks into the kernel logic, and
> > negotiation with the guest to allocate the requested amount of
> > memory and hand it over to dom0.  If the maximum size of the buffer is
> > known in advance maybe dom0 can negotiate with the guest to allocate
> > such a region and grant it access to dom0 at driver attachment time.
> 
> Besides the thought of transiently converting RAM to kind-of-MMIO, this
> makes me think of another possible option: Could Dom0 transfer ownership
> of the RAM that wants mapping in the guest (remotely resembling
> grant-transfer)? Would require the guest to have ballooned down enough
> first, of course. (In both cases it would certainly need working out how
> the conversion / transfer back could be made work safely and reasonably
> cleanly.)
> 
> Jan

The kernel driver needs to be able to reclaim the memory at any time.
My understanding is that this is used to migrate memory between VRAM and
system RAM.  It might also be used for other purposes.

-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Design session notes: GPU acceleration in Xen
  2024-06-14 16:44       ` Demi Marie Obenour
@ 2024-06-17  9:07         ` Jan Beulich
  2024-06-17 15:17           ` Demi Marie Obenour
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Beulich @ 2024-06-17  9:07 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang,
	Xen developer discussion, Andrew Cooper, Roger Pau Monné

On 14.06.2024 18:44, Demi Marie Obenour wrote:
> On Fri, Jun 14, 2024 at 10:12:40AM +0200, Jan Beulich wrote:
>> On 14.06.2024 09:21, Roger Pau Monné wrote:
>>> I'm not sure it's possible to ensure that when using system RAM such
>>> memory comes from the guest rather than the host, as it would likely
>>> require some very intrusive hooks into the kernel logic, and
>>> negotiation with the guest to allocate the requested amount of
>>> memory and hand it over to dom0.  If the maximum size of the buffer is
>>> known in advance maybe dom0 can negotiate with the guest to allocate
>>> such a region and grant it access to dom0 at driver attachment time.
>>
>> Besides the thought of transiently converting RAM to kind-of-MMIO, this
>> makes me think of another possible option: Could Dom0 transfer ownership
>> of the RAM that wants mapping in the guest (remotely resembling
>> grant-transfer)? Would require the guest to have ballooned down enough
>> first, of course. (In both cases it would certainly need working out how
>> the conversion / transfer back could be made work safely and reasonably
>> cleanly.)
> 
> The kernel driver needs to be able to reclaim the memory at any time.
> My understanding is that this is used to migrate memory between VRAM and
> system RAM.  It might also be used for other purposes.

Except: How would the kernel driver reclaim the memory when it's mapped
by a DomU?

Jan


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Design session notes: GPU acceleration in Xen
  2024-06-17  9:07         ` Jan Beulich
@ 2024-06-17 15:17           ` Demi Marie Obenour
  2024-06-17 15:39             ` Jan Beulich
  0 siblings, 1 reply; 26+ messages in thread
From: Demi Marie Obenour @ 2024-06-17 15:17 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang,
	Xen developer discussion, Andrew Cooper, Roger Pau Monné

[-- Attachment #1: Type: text/plain, Size: 2130 bytes --]

On Mon, Jun 17, 2024 at 11:07:54AM +0200, Jan Beulich wrote:
> On 14.06.2024 18:44, Demi Marie Obenour wrote:
> > On Fri, Jun 14, 2024 at 10:12:40AM +0200, Jan Beulich wrote:
> >> On 14.06.2024 09:21, Roger Pau Monné wrote:
> >>> I'm not sure it's possible to ensure that when using system RAM such
> >>> memory comes from the guest rather than the host, as it would likely
> >>> require some very intrusive hooks into the kernel logic, and
> >>> negotiation with the guest to allocate the requested amount of
> >>> memory and hand it over to dom0.  If the maximum size of the buffer is
> >>> known in advance maybe dom0 can negotiate with the guest to allocate
> >>> such a region and grant it access to dom0 at driver attachment time.
> >>
> >> Besides the thought of transiently converting RAM to kind-of-MMIO, this
> >> makes me think of another possible option: Could Dom0 transfer ownership
> >> of the RAM that wants mapping in the guest (remotely resembling
> >> grant-transfer)? Would require the guest to have ballooned down enough
> >> first, of course. (In both cases it would certainly need working out how
> >> the conversion / transfer back could be made work safely and reasonably
> >> cleanly.)
> > 
> > The kernel driver needs to be able to reclaim the memory at any time.
> > My understanding is that this is used to migrate memory between VRAM and
> > system RAM.  It might also be used for other purposes.
> 
> Except: How would the kernel driver reclaim the memory when it's mapped
> by a DomU?

The Xen driver in dom0 will register for MMU notifier callbacks.  When
the kernel driver reclaims the memory, the Xen driver will be notified,
and it will issue a hypercall that tells Xen to remove the memory from
the DomU's address space.  Subsequent accesses to the pages will trigger
a stage 2 translation fault that is handled by an IOREQ server.

For I/O memory, this is already possible via XEN_DOMCTL_memory_mapping.
The proposal in this thread is to make this possible for system RAM as
well.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Design session notes: GPU acceleration in Xen
  2024-06-17 15:17           ` Demi Marie Obenour
@ 2024-06-17 15:39             ` Jan Beulich
  2024-06-17 16:02               ` Demi Marie Obenour
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Beulich @ 2024-06-17 15:39 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang,
	Xen developer discussion, Andrew Cooper, Roger Pau Monné

On 17.06.2024 17:17, Demi Marie Obenour wrote:
> On Mon, Jun 17, 2024 at 11:07:54AM +0200, Jan Beulich wrote:
>> On 14.06.2024 18:44, Demi Marie Obenour wrote:
>>> On Fri, Jun 14, 2024 at 10:12:40AM +0200, Jan Beulich wrote:
>>>> On 14.06.2024 09:21, Roger Pau Monné wrote:
>>>>> I'm not sure it's possible to ensure that when using system RAM such
>>>>> memory comes from the guest rather than the host, as it would likely
>>>>> require some very intrusive hooks into the kernel logic, and
>>>>> negotiation with the guest to allocate the requested amount of
>>>>> memory and hand it over to dom0.  If the maximum size of the buffer is
>>>>> known in advance maybe dom0 can negotiate with the guest to allocate
>>>>> such a region and grant it access to dom0 at driver attachment time.
>>>>
>>>> Besides the thought of transiently converting RAM to kind-of-MMIO, this
>>>> makes me think of another possible option: Could Dom0 transfer ownership
>>>> of the RAM that wants mapping in the guest (remotely resembling
>>>> grant-transfer)? Would require the guest to have ballooned down enough
>>>> first, of course. (In both cases it would certainly need working out how
>>>> the conversion / transfer back could be made work safely and reasonably
>>>> cleanly.)
>>>
>>> The kernel driver needs to be able to reclaim the memory at any time.
>>> My understanding is that this is used to migrate memory between VRAM and
>>> system RAM.  It might also be used for other purposes.
>>
>> Except: How would the kernel driver reclaim the memory when it's mapped
>> by a DomU?
> 
> The Xen driver in dom0 will register for MMU notifier callbacks.  When
> the kernel driver reclaims the memory, the Xen driver will be notified,
> and it will issue a hypercall that tells Xen to remove the memory from
> the DomU's address space.  Subsequent accesses to the pages will trigger
> a stage 2 translation fault that is handled by an IOREQ server.

And such an ioreq server, which I assume isn't going to run in the Dom0
kernel, will then also need keeping up-to-date on holes in the (virtual)
BAR. Oh well ...

Jan


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Design session notes: GPU acceleration in Xen
  2024-06-17 15:39             ` Jan Beulich
@ 2024-06-17 16:02               ` Demi Marie Obenour
  0 siblings, 0 replies; 26+ messages in thread
From: Demi Marie Obenour @ 2024-06-17 16:02 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang,
	Xen developer discussion, Andrew Cooper, Roger Pau Monné

[-- Attachment #1: Type: text/plain, Size: 2567 bytes --]

On Mon, Jun 17, 2024 at 05:39:23PM +0200, Jan Beulich wrote:
> On 17.06.2024 17:17, Demi Marie Obenour wrote:
> > On Mon, Jun 17, 2024 at 11:07:54AM +0200, Jan Beulich wrote:
> >> On 14.06.2024 18:44, Demi Marie Obenour wrote:
> >>> On Fri, Jun 14, 2024 at 10:12:40AM +0200, Jan Beulich wrote:
> >>>> On 14.06.2024 09:21, Roger Pau Monné wrote:
> >>>>> I'm not sure it's possible to ensure that when using system RAM such
> >>>>> memory comes from the guest rather than the host, as it would likely
> >>>>> require some very intrusive hooks into the kernel logic, and
> >>>>> negotiation with the guest to allocate the requested amount of
> >>>>> memory and hand it over to dom0.  If the maximum size of the buffer is
> >>>>> known in advance maybe dom0 can negotiate with the guest to allocate
> >>>>> such a region and grant it access to dom0 at driver attachment time.
> >>>>
> >>>> Besides the thought of transiently converting RAM to kind-of-MMIO, this
> >>>> makes me think of another possible option: Could Dom0 transfer ownership
> >>>> of the RAM that wants mapping in the guest (remotely resembling
> >>>> grant-transfer)? Would require the guest to have ballooned down enough
> >>>> first, of course. (In both cases it would certainly need working out how
> >>>> the conversion / transfer back could be made work safely and reasonably
> >>>> cleanly.)
> >>>
> >>> The kernel driver needs to be able to reclaim the memory at any time.
> >>> My understanding is that this is used to migrate memory between VRAM and
> >>> system RAM.  It might also be used for other purposes.
> >>
> >> Except: How would the kernel driver reclaim the memory when it's mapped
> >> by a DomU?
> > 
> > The Xen driver in dom0 will register for MMU notifier callbacks.  When
> > the kernel driver reclaims the memory, the Xen driver will be notified,
> > and it will issue a hypercall that tells Xen to remove the memory from
> > the DomU's address space.  Subsequent accesses to the pages will trigger
> > a stage 2 translation fault that is handled by an IOREQ server.
> 
> And such an ioreq server, which I assume isn't going to run in the Dom0
> kernel, will then also need keeping up-to-date on holes in the (virtual)
> BAR. Oh well ...

My initial plan was that it _would_ run in the dom0 kernel, because this
results in a cleaner userspace API.  Ultimately I think it is best to go
with whichever approach keeps the kernel code simpler, but I'm not sure.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Design session notes: GPU acceleration in Xen
  2024-06-14  7:21   ` Roger Pau Monné
  2024-06-14  8:12     ` Jan Beulich
@ 2024-06-14 17:55     ` Demi Marie Obenour
  1 sibling, 0 replies; 26+ messages in thread
From: Demi Marie Obenour @ 2024-06-14 17:55 UTC (permalink / raw)
  To: Roger Pau Monné, Jan Beulich
  Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang,
	Xen developer discussion, Andrew Cooper

[-- Attachment #1: Type: text/plain, Size: 3341 bytes --]

On Fri, Jun 14, 2024 at 09:21:56AM +0200, Roger Pau Monné wrote:
> On Fri, Jun 14, 2024 at 08:38:51AM +0200, Jan Beulich wrote:
> > On 13.06.2024 20:43, Demi Marie Obenour wrote:
> > > GPU acceleration requires that pageable host memory be able to be mapped
> > > into a guest.
> > 
> > I'm sure it was explained in the session, which sadly I couldn't attend.
> > I've been asking Ray and Xenia the same before, but I'm afraid it still
> > hasn't become clear to me why this is a _requirement_. After all that's
> > against what we're doing elsewhere (i.e. so far it has always been
> > guest memory that's mapped in the host). I can appreciate that it might
> > be more difficult to implement, but avoiding to violate this fundamental
> > (kind of) rule might be worth the price (and would avoid other
> > complexities, of which there may be lurking more than what you enumerate
> > below).
> 
> My limited understanding (please someone correct me if wrong) is that
> the GPU buffer (or context I think it's also called?) is always
> allocated from dom0 (the owner of the GPU).

A GPU context is a GPU address space.  It's the GPU equivalent of a CPU
process.  I don't believe that the same context can be used by more than
one userspace process (though I could be wrong), but the same userspace
process can create and use as many contexts as it wants.

> The underling memory
> addresses of such buffer needs to be mapped into the guest.  The
> buffer backing memory might be GPU MMIO from the device BAR(s) or
> system RAM, and such buffer can be paged by the dom0 kernel at any
> time (iow: changing the backing memory from MMIO to RAM or vice
> versa).  Also, the buffer must be contiguous in physical address
> space.
> 
> I'm not sure it's possible to ensure that when using system RAM such
> memory comes from the guest rather than the host, as it would likely
> require some very intrusive hooks into the kernel logic, and
> negotiation with the guest to allocate the requested amount of
> memory and hand it over to dom0.  If the maximum size of the buffer is
> known in advance maybe dom0 can negotiate with the guest to allocate
> such a region and grant it access to dom0 at driver attachment time.

I don't think there is a useful maximum size known.  There may be a
limit, but it would be around 4GiB or more, which is far too high to
reserve physical memory for up front.

> One aspect that I'm lacking clarity is better understanding of how the
> process of allocating and assigning a GPU buffer to a guest is
> performed (I think this is the key to how GPU VirtIO native contexts
> work?).

The buffer is allocated by the GPU driver in response to an ioctl() made
by the userspace server process.  If the buffer needs to be accessed by
the guest CPU (not all do), it is mapped into part of an emulated PCI
BAR for access by the guest.  This mailing list thread is about making
that possible.

> Another question I have, are guest expected to have a single GPU
> buffer, or they can have multiple GPU buffers simultaneously
> allocated?

I believe there is only one emulated BAR, but this is very large (GiBs)
and sparsely populated.  There can be many GPU buffers mapped into the
BAR.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Design session notes: GPU acceleration in Xen
  2024-06-14  6:38 ` Jan Beulich
  2024-06-14  7:21   ` Roger Pau Monné
@ 2024-06-14 16:35   ` Demi Marie Obenour
  2024-06-17  9:05     ` Jan Beulich
  2024-06-14 20:56   ` Demi Marie Obenour
  2 siblings, 1 reply; 26+ messages in thread
From: Demi Marie Obenour @ 2024-06-14 16:35 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang,
	Xen developer discussion, Andrew Cooper, dri-devel

[-- Attachment #1: Type: text/plain, Size: 4732 bytes --]

On Fri, Jun 14, 2024 at 08:38:51AM +0200, Jan Beulich wrote:
> On 13.06.2024 20:43, Demi Marie Obenour wrote:
> > GPU acceleration requires that pageable host memory be able to be mapped
> > into a guest.
> 
> I'm sure it was explained in the session, which sadly I couldn't attend.
> I've been asking Ray and Xenia the same before, but I'm afraid it still
> hasn't become clear to me why this is a _requirement_. After all that's
> against what we're doing elsewhere (i.e. so far it has always been
> guest memory that's mapped in the host). I can appreciate that it might
> be more difficult to implement, but avoiding to violate this fundamental
> (kind of) rule might be worth the price (and would avoid other
> complexities, of which there may be lurking more than what you enumerate
> below).

My understanding is:

- Discrete GPUs require the memory to be VRAM, rather than system RAM.
- Various APIs require dmabufs.  Xen's support for dmabufs doesn't work
  with PV dom0.
- The existing virtio-GPU protocol (which is not Xen-specific and so
  gets more testing and has broader support than anything that _is_
  Xen-specific) requires backend allocation for native contexts.
- There might be other issues (caching?  memory management?) involved.

I'm CCing dri-devel in hopes of getting a better response.

> >  This requires changes to all of the Xen hypervisor, Linux
> > kernel, and userspace device model.
> > 
> > ### Goals
> > 
> >  - Allow any userspace pages to be mapped into a guest.
> >  - Support deprivileged operation: this API must not be usable for privilege escalation.
> >  - Use MMU notifiers to ensure safety with respect to use-after-free.
> > 
> > ### Hypervisor changes
> > 
> > There are at least two Xen changes required:
> > 
> > 1. Add a new flag to IOREQ that means "retry this instruction".
> > 
> >    An IOREQ server can set this flag after having successfully handled a
> >    page fault.  It is expected that the IOREQ server has successfully
> >    mapped a page into the guest at the location of the fault.
> >    Otherwise, the same fault will likely happen again.
> 
> Were there any thoughts on how to prevent this becoming an infinite loop?
> I.e. how to (a) guarantee forward progress in the guest and (b) deal with
> misbehaving IOREQ servers?

Guaranteeing forward progress is up to the IOREQ server.  If the IOREQ
server misbehaves, an infinite loop is possible, but the CPU time used
by it should be charged to the IOREQ server, so this isn't a
vulnerability.

> > 2. Add support for `XEN_DOMCTL_memory_mapping` to use system RAM, not
> >    just IOMEM.  Mappings made with `XEN_DOMCTL_memory_mapping` are
> >    guaranteed to be able to be successfully revoked with
> >    `XEN_DOMCTL_memory_mapping`, so all operations that would create
> >    extra references to the mapped memory must be forbidden.  These
> >    include, but may not be limited to:
> > 
> >    1. Granting the pages to the same or other domains.
> >    2. Mapping into another domain using `XEN_DOMCTL_memory_mapping`.
> >    3. Another domain accessing the pages using the foreign memory APIs,
> >       unless it is privileged over the domain that owns the pages.
> 
> All of which may call for actually converting the memory to kind-of-MMIO,
> with a means to later convert it back.

Would this support the case where the mapping domain is not fully
priviliged, and where it might be a PV guest?

> Jan
> 
> >    Open question: what if the other domain goes away?  Ideally,
> >    unmapping would (vacuously) succeed in this case.  Qubes OS doesn't
> >    care about domid reuse but others might.
> > 
> > ### Kernel changes
> > 
> > Linux will add support for mapping userspace memory into an emulated PCI
> > BAR.  This requires Linux to automatically revoke access when needed.
> > 
> > There will be an IOREQ server that handles page faults.  The discussion
> > assumed that this handling will happen in kernel mode, but if handling
> > in user mode is simpler that is also an option.
> > 
> > There is no async #PF in Xen (yet), so the entire vCPU will be blocked
> > while the fault is handled.  This is not great for performance, but
> > correctness comes first.
> > 
> > There will be a new kernel ioctl to perform the mapping.  A possible C
> > prototype (presented at design session, but not discussed there):
> > 
> >     struct xen_linux_register_memory {
> >         uint64_t pointer;
> >         uint64_t size;
> >         uint64_t gpa;
> >         uint32_t id;
> >         uint32_t guest_domid;
> >     };
> 

-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Design session notes: GPU acceleration in Xen
  2024-06-14 16:35   ` Demi Marie Obenour
@ 2024-06-17  9:05     ` Jan Beulich
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Beulich @ 2024-06-17  9:05 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang,
	Xen developer discussion, Andrew Cooper, dri-devel, Daniel Vetter,
	David Airlie

On 14.06.2024 18:35, Demi Marie Obenour wrote:
> On Fri, Jun 14, 2024 at 08:38:51AM +0200, Jan Beulich wrote:
>> On 13.06.2024 20:43, Demi Marie Obenour wrote:
>>> 2. Add support for `XEN_DOMCTL_memory_mapping` to use system RAM, not
>>>    just IOMEM.  Mappings made with `XEN_DOMCTL_memory_mapping` are
>>>    guaranteed to be able to be successfully revoked with
>>>    `XEN_DOMCTL_memory_mapping`, so all operations that would create
>>>    extra references to the mapped memory must be forbidden.  These
>>>    include, but may not be limited to:
>>>
>>>    1. Granting the pages to the same or other domains.
>>>    2. Mapping into another domain using `XEN_DOMCTL_memory_mapping`.
>>>    3. Another domain accessing the pages using the foreign memory APIs,
>>>       unless it is privileged over the domain that owns the pages.
>>
>> All of which may call for actually converting the memory to kind-of-MMIO,
>> with a means to later convert it back.
> 
> Would this support the case where the mapping domain is not fully
> priviliged, and where it might be a PV guest?

I suppose that should be a goal.

Jan


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Design session notes: GPU acceleration in Xen
  2024-06-14  6:38 ` Jan Beulich
  2024-06-14  7:21   ` Roger Pau Monné
  2024-06-14 16:35   ` Demi Marie Obenour
@ 2024-06-14 20:56   ` Demi Marie Obenour
  2 siblings, 0 replies; 26+ messages in thread
From: Demi Marie Obenour @ 2024-06-14 20:56 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang,
	Xen developer discussion, Andrew Cooper, Roger Pau Monné,
	Direct Rendering Infrastructure development, Daniel Vetter,
	David Airlie, Rob Clark

[-- Attachment #1: Type: text/plain, Size: 4007 bytes --]

On Fri, Jun 14, 2024 at 08:38:51AM +0200, Jan Beulich wrote:
> On 13.06.2024 20:43, Demi Marie Obenour wrote:
> > GPU acceleration requires that pageable host memory be able to be mapped
> > into a guest.
> 
> I'm sure it was explained in the session, which sadly I couldn't attend.
> I've been asking Ray and Xenia the same before, but I'm afraid it still
> hasn't become clear to me why this is a _requirement_. After all that's
> against what we're doing elsewhere (i.e. so far it has always been
> guest memory that's mapped in the host). I can appreciate that it might
> be more difficult to implement, but avoiding to violate this fundamental
> (kind of) rule might be worth the price (and would avoid other
> complexities, of which there may be lurking more than what you enumerate
> below).

The GPU driver knows how to allocate buffers that are usable by the GPU.
On a discrete GPU, these buffers will generally be in VRAM, rather than
in system RAM, because access to system RAM requires going through the
PCI bus (slow).  However, VRAM is a limited resource, so the driver will
migrate pages between VRAM and system RAM as needed.  During the
migration, a guest that tries to access the pages must block until the
migration is complete.

Some GPU drivers support accessing externally provided memory.  This is
called "userptr", and is supported by i915 and amdgpu.  However, it
appears that some other drivers (such as MSM) do not support it, and
since GPUs with VRAM need to be supported anyway, Xen still needs to
support GPU driver-allocated memory.

I also CCd dri-devel@lists.freedesktop.org and the general GPU driver
maintainers in Linux in case they can give a better answer, as well as
Rob Clark who invented native contexts.

> >  This requires changes to all of the Xen hypervisor, Linux
> > kernel, and userspace device model.
> > 
> > ### Goals
> > 
> >  - Allow any userspace pages to be mapped into a guest.
> >  - Support deprivileged operation: this API must not be usable for privilege escalation.
> >  - Use MMU notifiers to ensure safety with respect to use-after-free.
> > 
> > ### Hypervisor changes
> > 
> > There are at least two Xen changes required:
> > 
> > 1. Add a new flag to IOREQ that means "retry this instruction".
> > 
> >    An IOREQ server can set this flag after having successfully handled a
> >    page fault.  It is expected that the IOREQ server has successfully
> >    mapped a page into the guest at the location of the fault.
> >    Otherwise, the same fault will likely happen again.
> 
> Were there any thoughts on how to prevent this becoming an infinite loop?
> I.e. how to (a) guarantee forward progress in the guest and (b) deal with
> misbehaving IOREQ servers?

Guaranteeing forward progress is up to the IOREQ server.  If the IOREQ
server misbehaves, an infinite loop is possible, but the CPU time used
by it should be charged to the IOREQ server, so this isn't a
vulnerability.

> > 2. Add support for `XEN_DOMCTL_memory_mapping` to use system RAM, not
> >    just IOMEM.  Mappings made with `XEN_DOMCTL_memory_mapping` are
> >    guaranteed to be able to be successfully revoked with
> >    `XEN_DOMCTL_memory_mapping`, so all operations that would create
> >    extra references to the mapped memory must be forbidden.  These
> >    include, but may not be limited to:
> > 
> >    1. Granting the pages to the same or other domains.
> >    2. Mapping into another domain using `XEN_DOMCTL_memory_mapping`.
> >    3. Another domain accessing the pages using the foreign memory APIs,
> >       unless it is privileged over the domain that owns the pages.
> 
> All of which may call for actually converting the memory to kind-of-MMIO,
> with a means to later convert it back.

Would this support the case where the mapping domain is not fully
priviliged, and where it might be a PV guest?
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2024-06-19 16:57 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-13 18:43 Design session notes: GPU acceleration in Xen Demi Marie Obenour
2024-06-14  6:38 ` Jan Beulich
2024-06-14  7:21   ` Roger Pau Monné
2024-06-14  8:12     ` Jan Beulich
2024-06-14  8:39       ` Roger Pau Monné
2024-06-17  0:38         ` Demi Marie Obenour
2024-06-17  7:46           ` Roger Pau Monné
2024-06-17 15:07             ` Demi Marie Obenour
2024-06-17 20:46             ` Marek Marczykowski-Górecki
2024-06-18  0:57               ` Demi Marie Obenour
2024-06-18  6:33                 ` Christian König
2024-06-18 14:12                   ` Demi Marie Obenour
2024-06-19  7:31                     ` Christian König
2024-06-19 16:56                       ` Alex Deucher
2024-06-18 14:43                 ` Roger Pau Monné
2024-06-18 14:56                   ` Demi Marie Obenour
2024-06-17  9:13           ` Jan Beulich
2024-06-14 16:44       ` Demi Marie Obenour
2024-06-17  9:07         ` Jan Beulich
2024-06-17 15:17           ` Demi Marie Obenour
2024-06-17 15:39             ` Jan Beulich
2024-06-17 16:02               ` Demi Marie Obenour
2024-06-14 17:55     ` Demi Marie Obenour
2024-06-14 16:35   ` Demi Marie Obenour
2024-06-17  9:05     ` Jan Beulich
2024-06-14 20:56   ` Demi Marie Obenour

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.