* Design session notes: GPU acceleration in Xen
@ 2024-06-13 18:43 Demi Marie Obenour
2024-06-14 6:38 ` Jan Beulich
0 siblings, 1 reply; 26+ messages in thread
From: Demi Marie Obenour @ 2024-06-13 18:43 UTC (permalink / raw)
To: Xen developer discussion
Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Andrew Cooper,
Ray Huang
[-- Attachment #1: Type: text/plain, Size: 2580 bytes --]
GPU acceleration requires that pageable host memory be able to be mapped
into a guest. This requires changes to all of the Xen hypervisor, Linux
kernel, and userspace device model.
### Goals
- Allow any userspace pages to be mapped into a guest.
- Support deprivileged operation: this API must not be usable for privilege escalation.
- Use MMU notifiers to ensure safety with respect to use-after-free.
### Hypervisor changes
There are at least two Xen changes required:
1. Add a new flag to IOREQ that means "retry this instruction".
An IOREQ server can set this flag after having successfully handled a
page fault. It is expected that the IOREQ server has successfully
mapped a page into the guest at the location of the fault.
Otherwise, the same fault will likely happen again.
2. Add support for `XEN_DOMCTL_memory_mapping` to use system RAM, not
just IOMEM. Mappings made with `XEN_DOMCTL_memory_mapping` are
guaranteed to be able to be successfully revoked with
`XEN_DOMCTL_memory_mapping`, so all operations that would create
extra references to the mapped memory must be forbidden. These
include, but may not be limited to:
1. Granting the pages to the same or other domains.
2. Mapping into another domain using `XEN_DOMCTL_memory_mapping`.
3. Another domain accessing the pages using the foreign memory APIs,
unless it is privileged over the domain that owns the pages.
Open question: what if the other domain goes away? Ideally,
unmapping would (vacuously) succeed in this case. Qubes OS doesn't
care about domid reuse but others might.
### Kernel changes
Linux will add support for mapping userspace memory into an emulated PCI
BAR. This requires Linux to automatically revoke access when needed.
There will be an IOREQ server that handles page faults. The discussion
assumed that this handling will happen in kernel mode, but if handling
in user mode is simpler that is also an option.
There is no async #PF in Xen (yet), so the entire vCPU will be blocked
while the fault is handled. This is not great for performance, but
correctness comes first.
There will be a new kernel ioctl to perform the mapping. A possible C
prototype (presented at design session, but not discussed there):
struct xen_linux_register_memory {
uint64_t pointer;
uint64_t size;
uint64_t gpa;
uint32_t id;
uint32_t guest_domid;
};
--
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: Design session notes: GPU acceleration in Xen 2024-06-13 18:43 Design session notes: GPU acceleration in Xen Demi Marie Obenour @ 2024-06-14 6:38 ` Jan Beulich 2024-06-14 7:21 ` Roger Pau Monné ` (2 more replies) 0 siblings, 3 replies; 26+ messages in thread From: Jan Beulich @ 2024-06-14 6:38 UTC (permalink / raw) To: Demi Marie Obenour Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang, Xen developer discussion, Andrew Cooper On 13.06.2024 20:43, Demi Marie Obenour wrote: > GPU acceleration requires that pageable host memory be able to be mapped > into a guest. I'm sure it was explained in the session, which sadly I couldn't attend. I've been asking Ray and Xenia the same before, but I'm afraid it still hasn't become clear to me why this is a _requirement_. After all that's against what we're doing elsewhere (i.e. so far it has always been guest memory that's mapped in the host). I can appreciate that it might be more difficult to implement, but avoiding to violate this fundamental (kind of) rule might be worth the price (and would avoid other complexities, of which there may be lurking more than what you enumerate below). > This requires changes to all of the Xen hypervisor, Linux > kernel, and userspace device model. > > ### Goals > > - Allow any userspace pages to be mapped into a guest. > - Support deprivileged operation: this API must not be usable for privilege escalation. > - Use MMU notifiers to ensure safety with respect to use-after-free. > > ### Hypervisor changes > > There are at least two Xen changes required: > > 1. Add a new flag to IOREQ that means "retry this instruction". > > An IOREQ server can set this flag after having successfully handled a > page fault. It is expected that the IOREQ server has successfully > mapped a page into the guest at the location of the fault. > Otherwise, the same fault will likely happen again. Were there any thoughts on how to prevent this becoming an infinite loop? I.e. how to (a) guarantee forward progress in the guest and (b) deal with misbehaving IOREQ servers? > 2. Add support for `XEN_DOMCTL_memory_mapping` to use system RAM, not > just IOMEM. Mappings made with `XEN_DOMCTL_memory_mapping` are > guaranteed to be able to be successfully revoked with > `XEN_DOMCTL_memory_mapping`, so all operations that would create > extra references to the mapped memory must be forbidden. These > include, but may not be limited to: > > 1. Granting the pages to the same or other domains. > 2. Mapping into another domain using `XEN_DOMCTL_memory_mapping`. > 3. Another domain accessing the pages using the foreign memory APIs, > unless it is privileged over the domain that owns the pages. All of which may call for actually converting the memory to kind-of-MMIO, with a means to later convert it back. Jan > Open question: what if the other domain goes away? Ideally, > unmapping would (vacuously) succeed in this case. Qubes OS doesn't > care about domid reuse but others might. > > ### Kernel changes > > Linux will add support for mapping userspace memory into an emulated PCI > BAR. This requires Linux to automatically revoke access when needed. > > There will be an IOREQ server that handles page faults. The discussion > assumed that this handling will happen in kernel mode, but if handling > in user mode is simpler that is also an option. > > There is no async #PF in Xen (yet), so the entire vCPU will be blocked > while the fault is handled. This is not great for performance, but > correctness comes first. > > There will be a new kernel ioctl to perform the mapping. A possible C > prototype (presented at design session, but not discussed there): > > struct xen_linux_register_memory { > uint64_t pointer; > uint64_t size; > uint64_t gpa; > uint32_t id; > uint32_t guest_domid; > }; ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Design session notes: GPU acceleration in Xen 2024-06-14 6:38 ` Jan Beulich @ 2024-06-14 7:21 ` Roger Pau Monné 2024-06-14 8:12 ` Jan Beulich 2024-06-14 17:55 ` Demi Marie Obenour 2024-06-14 16:35 ` Demi Marie Obenour 2024-06-14 20:56 ` Demi Marie Obenour 2 siblings, 2 replies; 26+ messages in thread From: Roger Pau Monné @ 2024-06-14 7:21 UTC (permalink / raw) To: Jan Beulich, Demi Marie Obenour Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang, Xen developer discussion, Andrew Cooper On Fri, Jun 14, 2024 at 08:38:51AM +0200, Jan Beulich wrote: > On 13.06.2024 20:43, Demi Marie Obenour wrote: > > GPU acceleration requires that pageable host memory be able to be mapped > > into a guest. > > I'm sure it was explained in the session, which sadly I couldn't attend. > I've been asking Ray and Xenia the same before, but I'm afraid it still > hasn't become clear to me why this is a _requirement_. After all that's > against what we're doing elsewhere (i.e. so far it has always been > guest memory that's mapped in the host). I can appreciate that it might > be more difficult to implement, but avoiding to violate this fundamental > (kind of) rule might be worth the price (and would avoid other > complexities, of which there may be lurking more than what you enumerate > below). My limited understanding (please someone correct me if wrong) is that the GPU buffer (or context I think it's also called?) is always allocated from dom0 (the owner of the GPU). The underling memory addresses of such buffer needs to be mapped into the guest. The buffer backing memory might be GPU MMIO from the device BAR(s) or system RAM, and such buffer can be paged by the dom0 kernel at any time (iow: changing the backing memory from MMIO to RAM or vice versa). Also, the buffer must be contiguous in physical address space. I'm not sure it's possible to ensure that when using system RAM such memory comes from the guest rather than the host, as it would likely require some very intrusive hooks into the kernel logic, and negotiation with the guest to allocate the requested amount of memory and hand it over to dom0. If the maximum size of the buffer is known in advance maybe dom0 can negotiate with the guest to allocate such a region and grant it access to dom0 at driver attachment time. One aspect that I'm lacking clarity is better understanding of how the process of allocating and assigning a GPU buffer to a guest is performed (I think this is the key to how GPU VirtIO native contexts work?). Another question I have, are guest expected to have a single GPU buffer, or they can have multiple GPU buffers simultaneously allocated? Regards, Roger. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Design session notes: GPU acceleration in Xen 2024-06-14 7:21 ` Roger Pau Monné @ 2024-06-14 8:12 ` Jan Beulich 2024-06-14 8:39 ` Roger Pau Monné 2024-06-14 16:44 ` Demi Marie Obenour 2024-06-14 17:55 ` Demi Marie Obenour 1 sibling, 2 replies; 26+ messages in thread From: Jan Beulich @ 2024-06-14 8:12 UTC (permalink / raw) To: Roger Pau Monné Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang, Xen developer discussion, Andrew Cooper, Demi Marie Obenour On 14.06.2024 09:21, Roger Pau Monné wrote: > On Fri, Jun 14, 2024 at 08:38:51AM +0200, Jan Beulich wrote: >> On 13.06.2024 20:43, Demi Marie Obenour wrote: >>> GPU acceleration requires that pageable host memory be able to be mapped >>> into a guest. >> >> I'm sure it was explained in the session, which sadly I couldn't attend. >> I've been asking Ray and Xenia the same before, but I'm afraid it still >> hasn't become clear to me why this is a _requirement_. After all that's >> against what we're doing elsewhere (i.e. so far it has always been >> guest memory that's mapped in the host). I can appreciate that it might >> be more difficult to implement, but avoiding to violate this fundamental >> (kind of) rule might be worth the price (and would avoid other >> complexities, of which there may be lurking more than what you enumerate >> below). > > My limited understanding (please someone correct me if wrong) is that > the GPU buffer (or context I think it's also called?) is always > allocated from dom0 (the owner of the GPU). The underling memory > addresses of such buffer needs to be mapped into the guest. The > buffer backing memory might be GPU MMIO from the device BAR(s) or > system RAM, and such buffer can be paged by the dom0 kernel at any > time (iow: changing the backing memory from MMIO to RAM or vice > versa). Also, the buffer must be contiguous in physical address > space. This last one in particular would of course be a severe restriction. Yet: There's an IOMMU involved, isn't there? > I'm not sure it's possible to ensure that when using system RAM such > memory comes from the guest rather than the host, as it would likely > require some very intrusive hooks into the kernel logic, and > negotiation with the guest to allocate the requested amount of > memory and hand it over to dom0. If the maximum size of the buffer is > known in advance maybe dom0 can negotiate with the guest to allocate > such a region and grant it access to dom0 at driver attachment time. Besides the thought of transiently converting RAM to kind-of-MMIO, this makes me think of another possible option: Could Dom0 transfer ownership of the RAM that wants mapping in the guest (remotely resembling grant-transfer)? Would require the guest to have ballooned down enough first, of course. (In both cases it would certainly need working out how the conversion / transfer back could be made work safely and reasonably cleanly.) Jan > One aspect that I'm lacking clarity is better understanding of how the > process of allocating and assigning a GPU buffer to a guest is > performed (I think this is the key to how GPU VirtIO native contexts > work?). > > Another question I have, are guest expected to have a single GPU > buffer, or they can have multiple GPU buffers simultaneously > allocated? > > Regards, Roger. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Design session notes: GPU acceleration in Xen 2024-06-14 8:12 ` Jan Beulich @ 2024-06-14 8:39 ` Roger Pau Monné 2024-06-17 0:38 ` Demi Marie Obenour 2024-06-14 16:44 ` Demi Marie Obenour 1 sibling, 1 reply; 26+ messages in thread From: Roger Pau Monné @ 2024-06-14 8:39 UTC (permalink / raw) To: Jan Beulich Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang, Xen developer discussion, Andrew Cooper, Demi Marie Obenour On Fri, Jun 14, 2024 at 10:12:40AM +0200, Jan Beulich wrote: > On 14.06.2024 09:21, Roger Pau Monné wrote: > > On Fri, Jun 14, 2024 at 08:38:51AM +0200, Jan Beulich wrote: > >> On 13.06.2024 20:43, Demi Marie Obenour wrote: > >>> GPU acceleration requires that pageable host memory be able to be mapped > >>> into a guest. > >> > >> I'm sure it was explained in the session, which sadly I couldn't attend. > >> I've been asking Ray and Xenia the same before, but I'm afraid it still > >> hasn't become clear to me why this is a _requirement_. After all that's > >> against what we're doing elsewhere (i.e. so far it has always been > >> guest memory that's mapped in the host). I can appreciate that it might > >> be more difficult to implement, but avoiding to violate this fundamental > >> (kind of) rule might be worth the price (and would avoid other > >> complexities, of which there may be lurking more than what you enumerate > >> below). > > > > My limited understanding (please someone correct me if wrong) is that > > the GPU buffer (or context I think it's also called?) is always > > allocated from dom0 (the owner of the GPU). The underling memory > > addresses of such buffer needs to be mapped into the guest. The > > buffer backing memory might be GPU MMIO from the device BAR(s) or > > system RAM, and such buffer can be paged by the dom0 kernel at any > > time (iow: changing the backing memory from MMIO to RAM or vice > > versa). Also, the buffer must be contiguous in physical address > > space. > > This last one in particular would of course be a severe restriction. > Yet: There's an IOMMU involved, isn't there? Yup, IIRC that's why Ray said it was much more easier for them to support VirtIO GPUs from a PVH dom0 rather than classic PV one. It might be easier to implement from a classic PV dom0 if there's pv-iommu support, so that dom0 can create it's own contiguous memory buffers from the device PoV. > > I'm not sure it's possible to ensure that when using system RAM such > > memory comes from the guest rather than the host, as it would likely > > require some very intrusive hooks into the kernel logic, and > > negotiation with the guest to allocate the requested amount of > > memory and hand it over to dom0. If the maximum size of the buffer is > > known in advance maybe dom0 can negotiate with the guest to allocate > > such a region and grant it access to dom0 at driver attachment time. > > Besides the thought of transiently converting RAM to kind-of-MMIO, this As a note here, changing the type to MMIO would likely involve modifying the EPT/NPT tables to propagate the new type. On a PVH dom0 this would likely involve shattering superpages in order to set the correct memory types. Depending on how often and how random those system RAM changes are necessary this could also create contention on the p2m lock. > makes me think of another possible option: Could Dom0 transfer ownership > of the RAM that wants mapping in the guest (remotely resembling > grant-transfer)? Would require the guest to have ballooned down enough > first, of course. (In both cases it would certainly need working out how > the conversion / transfer back could be made work safely and reasonably > cleanly.) Maybe. The fact the guest needs to balloon down that amount of memory seems weird to me, as from the guest PoV that mapped memory is MMIO-like and not system RAM. Thanks, Roger. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Design session notes: GPU acceleration in Xen 2024-06-14 8:39 ` Roger Pau Monné @ 2024-06-17 0:38 ` Demi Marie Obenour 2024-06-17 7:46 ` Roger Pau Monné 2024-06-17 9:13 ` Jan Beulich 0 siblings, 2 replies; 26+ messages in thread From: Demi Marie Obenour @ 2024-06-17 0:38 UTC (permalink / raw) To: Roger Pau Monné, Jan Beulich Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang, Xen developer discussion, Andrew Cooper [-- Attachment #1: Type: text/plain, Size: 4284 bytes --] On Fri, Jun 14, 2024 at 10:39:37AM +0200, Roger Pau Monné wrote: > On Fri, Jun 14, 2024 at 10:12:40AM +0200, Jan Beulich wrote: > > On 14.06.2024 09:21, Roger Pau Monné wrote: > > > On Fri, Jun 14, 2024 at 08:38:51AM +0200, Jan Beulich wrote: > > >> On 13.06.2024 20:43, Demi Marie Obenour wrote: > > >>> GPU acceleration requires that pageable host memory be able to be mapped > > >>> into a guest. > > >> > > >> I'm sure it was explained in the session, which sadly I couldn't attend. > > >> I've been asking Ray and Xenia the same before, but I'm afraid it still > > >> hasn't become clear to me why this is a _requirement_. After all that's > > >> against what we're doing elsewhere (i.e. so far it has always been > > >> guest memory that's mapped in the host). I can appreciate that it might > > >> be more difficult to implement, but avoiding to violate this fundamental > > >> (kind of) rule might be worth the price (and would avoid other > > >> complexities, of which there may be lurking more than what you enumerate > > >> below). > > > > > > My limited understanding (please someone correct me if wrong) is that > > > the GPU buffer (or context I think it's also called?) is always > > > allocated from dom0 (the owner of the GPU). The underling memory > > > addresses of such buffer needs to be mapped into the guest. The > > > buffer backing memory might be GPU MMIO from the device BAR(s) or > > > system RAM, and such buffer can be paged by the dom0 kernel at any > > > time (iow: changing the backing memory from MMIO to RAM or vice > > > versa). Also, the buffer must be contiguous in physical address > > > space. > > > > This last one in particular would of course be a severe restriction. > > Yet: There's an IOMMU involved, isn't there? > > Yup, IIRC that's why Ray said it was much more easier for them to > support VirtIO GPUs from a PVH dom0 rather than classic PV one. > > It might be easier to implement from a classic PV dom0 if there's > pv-iommu support, so that dom0 can create it's own contiguous memory > buffers from the device PoV. What makes PVH an improvement here? I thought PV dom0 uses an identity mapping for the IOMMU, while a PVH dom0 uses an IOMMU that mirrors the dom0 second-stage page tables. In both cases, the device physical addresses are identical to dom0’s physical addresses. PV is terrible for many reasons, so I’m okay with focusing on PVH dom0, but I’d like to know why there is a difference. > > > I'm not sure it's possible to ensure that when using system RAM such > > > memory comes from the guest rather than the host, as it would likely > > > require some very intrusive hooks into the kernel logic, and > > > negotiation with the guest to allocate the requested amount of > > > memory and hand it over to dom0. If the maximum size of the buffer is > > > known in advance maybe dom0 can negotiate with the guest to allocate > > > such a region and grant it access to dom0 at driver attachment time. > > > > Besides the thought of transiently converting RAM to kind-of-MMIO, this > > As a note here, changing the type to MMIO would likely involve > modifying the EPT/NPT tables to propagate the new type. On a PVH dom0 > this would likely involve shattering superpages in order to set the > correct memory types. > > Depending on how often and how random those system RAM changes are > necessary this could also create contention on the p2m lock. > > > makes me think of another possible option: Could Dom0 transfer ownership > > of the RAM that wants mapping in the guest (remotely resembling > > grant-transfer)? Would require the guest to have ballooned down enough > > first, of course. (In both cases it would certainly need working out how > > the conversion / transfer back could be made work safely and reasonably > > cleanly.) > > Maybe. The fact the guest needs to balloon down that amount of memory > seems weird to me, as from the guest PoV that mapped memory is > MMIO-like and not system RAM. I don’t like it either. Furthermore, this would require changes to the virtio-GPU driver in the guest, which I’d prefer to avoid. -- Sincerely, Demi Marie Obenour (she/her/hers) Invisible Things Lab [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Design session notes: GPU acceleration in Xen 2024-06-17 0:38 ` Demi Marie Obenour @ 2024-06-17 7:46 ` Roger Pau Monné 2024-06-17 15:07 ` Demi Marie Obenour 2024-06-17 20:46 ` Marek Marczykowski-Górecki 2024-06-17 9:13 ` Jan Beulich 1 sibling, 2 replies; 26+ messages in thread From: Roger Pau Monné @ 2024-06-17 7:46 UTC (permalink / raw) To: Demi Marie Obenour Cc: Jan Beulich, Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang, Xen developer discussion, Andrew Cooper On Sun, Jun 16, 2024 at 08:38:19PM -0400, Demi Marie Obenour wrote: > On Fri, Jun 14, 2024 at 10:39:37AM +0200, Roger Pau Monné wrote: > > On Fri, Jun 14, 2024 at 10:12:40AM +0200, Jan Beulich wrote: > > > On 14.06.2024 09:21, Roger Pau Monné wrote: > > > > On Fri, Jun 14, 2024 at 08:38:51AM +0200, Jan Beulich wrote: > > > >> On 13.06.2024 20:43, Demi Marie Obenour wrote: > > > >>> GPU acceleration requires that pageable host memory be able to be mapped > > > >>> into a guest. > > > >> > > > >> I'm sure it was explained in the session, which sadly I couldn't attend. > > > >> I've been asking Ray and Xenia the same before, but I'm afraid it still > > > >> hasn't become clear to me why this is a _requirement_. After all that's > > > >> against what we're doing elsewhere (i.e. so far it has always been > > > >> guest memory that's mapped in the host). I can appreciate that it might > > > >> be more difficult to implement, but avoiding to violate this fundamental > > > >> (kind of) rule might be worth the price (and would avoid other > > > >> complexities, of which there may be lurking more than what you enumerate > > > >> below). > > > > > > > > My limited understanding (please someone correct me if wrong) is that > > > > the GPU buffer (or context I think it's also called?) is always > > > > allocated from dom0 (the owner of the GPU). The underling memory > > > > addresses of such buffer needs to be mapped into the guest. The > > > > buffer backing memory might be GPU MMIO from the device BAR(s) or > > > > system RAM, and such buffer can be paged by the dom0 kernel at any > > > > time (iow: changing the backing memory from MMIO to RAM or vice > > > > versa). Also, the buffer must be contiguous in physical address > > > > space. > > > > > > This last one in particular would of course be a severe restriction. > > > Yet: There's an IOMMU involved, isn't there? > > > > Yup, IIRC that's why Ray said it was much more easier for them to > > support VirtIO GPUs from a PVH dom0 rather than classic PV one. > > > > It might be easier to implement from a classic PV dom0 if there's > > pv-iommu support, so that dom0 can create it's own contiguous memory > > buffers from the device PoV. > > What makes PVH an improvement here? I thought PV dom0 uses an identity > mapping for the IOMMU, while a PVH dom0 uses an IOMMU that mirrors the > dom0 second-stage page tables. Indeed, hence finding a physically contiguous buffer on classic PV is way more complicated, because the IOMMU identity maps mfns, and the PV address space can be completely scattered. OTOH, on PVH the IOMMU page tables are the same as the second stage translation, and hence the physical address is way more compact (as it would be on native). > In both cases, the device physical > addresses are identical to dom0’s physical addresses. Yes, but a PV dom0 physical address space can be very scattered. IIRC there's an hypercall to request physically contiguous memory for PV, but you don't want to be using that every time you allocate a buffer (not sure it would support the sizes needed by the GPU anyway). > PV is terrible for many reasons, so I’m okay with focusing on PVH dom0, > but I’d like to know why there is a difference. > > > > > I'm not sure it's possible to ensure that when using system RAM such > > > > memory comes from the guest rather than the host, as it would likely > > > > require some very intrusive hooks into the kernel logic, and > > > > negotiation with the guest to allocate the requested amount of > > > > memory and hand it over to dom0. If the maximum size of the buffer is > > > > known in advance maybe dom0 can negotiate with the guest to allocate > > > > such a region and grant it access to dom0 at driver attachment time. > > > > > > Besides the thought of transiently converting RAM to kind-of-MMIO, this > > > > As a note here, changing the type to MMIO would likely involve > > modifying the EPT/NPT tables to propagate the new type. On a PVH dom0 > > this would likely involve shattering superpages in order to set the > > correct memory types. > > > > Depending on how often and how random those system RAM changes are > > necessary this could also create contention on the p2m lock. > > > > > makes me think of another possible option: Could Dom0 transfer ownership > > > of the RAM that wants mapping in the guest (remotely resembling > > > grant-transfer)? Would require the guest to have ballooned down enough > > > first, of course. (In both cases it would certainly need working out how > > > the conversion / transfer back could be made work safely and reasonably > > > cleanly.) > > > > Maybe. The fact the guest needs to balloon down that amount of memory > > seems weird to me, as from the guest PoV that mapped memory is > > MMIO-like and not system RAM. > > I don’t like it either. Furthermore, this would require changes to the > virtio-GPU driver in the guest, which I’d prefer to avoid. IMO it would be helpful if you (or someone) could write the full specification of how VirtIO GPU is supposed to work right now (with the KVM model I assume?) as it would be a good starting point to provide suggestions about how to make it work (or adapt it) on Xen. I don't think the high level layers on top of VirtIO GPU are relevant, but it's important to understand the protocol between the VirtIO GPU front and back ends. So far I only had scattered conversation about what's needed, but not a formal write-up of how this is supposed to work. Thanks, Roger. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Design session notes: GPU acceleration in Xen 2024-06-17 7:46 ` Roger Pau Monné @ 2024-06-17 15:07 ` Demi Marie Obenour 2024-06-17 20:46 ` Marek Marczykowski-Górecki 1 sibling, 0 replies; 26+ messages in thread From: Demi Marie Obenour @ 2024-06-17 15:07 UTC (permalink / raw) To: Roger Pau Monné Cc: Jan Beulich, Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang, Xen developer discussion, Andrew Cooper [-- Attachment #1: Type: text/plain, Size: 6667 bytes --] On Mon, Jun 17, 2024 at 09:46:29AM +0200, Roger Pau Monné wrote: > On Sun, Jun 16, 2024 at 08:38:19PM -0400, Demi Marie Obenour wrote: > > On Fri, Jun 14, 2024 at 10:39:37AM +0200, Roger Pau Monné wrote: > > > On Fri, Jun 14, 2024 at 10:12:40AM +0200, Jan Beulich wrote: > > > > On 14.06.2024 09:21, Roger Pau Monné wrote: > > > > > On Fri, Jun 14, 2024 at 08:38:51AM +0200, Jan Beulich wrote: > > > > >> On 13.06.2024 20:43, Demi Marie Obenour wrote: > > > > >>> GPU acceleration requires that pageable host memory be able to be mapped > > > > >>> into a guest. > > > > >> > > > > >> I'm sure it was explained in the session, which sadly I couldn't attend. > > > > >> I've been asking Ray and Xenia the same before, but I'm afraid it still > > > > >> hasn't become clear to me why this is a _requirement_. After all that's > > > > >> against what we're doing elsewhere (i.e. so far it has always been > > > > >> guest memory that's mapped in the host). I can appreciate that it might > > > > >> be more difficult to implement, but avoiding to violate this fundamental > > > > >> (kind of) rule might be worth the price (and would avoid other > > > > >> complexities, of which there may be lurking more than what you enumerate > > > > >> below). > > > > > > > > > > My limited understanding (please someone correct me if wrong) is that > > > > > the GPU buffer (or context I think it's also called?) is always > > > > > allocated from dom0 (the owner of the GPU). The underling memory > > > > > addresses of such buffer needs to be mapped into the guest. The > > > > > buffer backing memory might be GPU MMIO from the device BAR(s) or > > > > > system RAM, and such buffer can be paged by the dom0 kernel at any > > > > > time (iow: changing the backing memory from MMIO to RAM or vice > > > > > versa). Also, the buffer must be contiguous in physical address > > > > > space. > > > > > > > > This last one in particular would of course be a severe restriction. > > > > Yet: There's an IOMMU involved, isn't there? > > > > > > Yup, IIRC that's why Ray said it was much more easier for them to > > > support VirtIO GPUs from a PVH dom0 rather than classic PV one. > > > > > > It might be easier to implement from a classic PV dom0 if there's > > > pv-iommu support, so that dom0 can create it's own contiguous memory > > > buffers from the device PoV. > > > > What makes PVH an improvement here? I thought PV dom0 uses an identity > > mapping for the IOMMU, while a PVH dom0 uses an IOMMU that mirrors the > > dom0 second-stage page tables. > > Indeed, hence finding a physically contiguous buffer on classic PV is > way more complicated, because the IOMMU identity maps mfns, and the PV > address space can be completely scattered. > > OTOH, on PVH the IOMMU page tables are the same as the second stage > translation, and hence the physical address is way more compact (as it > would be on native). Ah, _that_ is what I missed. I didn't realize that the physical address space of PV guests was so scattered. > > In both cases, the device physical > > addresses are identical to dom0’s physical addresses. > > Yes, but a PV dom0 physical address space can be very scattered. > > IIRC there's an hypercall to request physically contiguous memory for > PV, but you don't want to be using that every time you allocate a > buffer (not sure it would support the sizes needed by the GPU > anyway). That makes sense, thanks! > > PV is terrible for many reasons, so I’m okay with focusing on PVH dom0, > > but I’d like to know why there is a difference. > > > > > > > I'm not sure it's possible to ensure that when using system RAM such > > > > > memory comes from the guest rather than the host, as it would likely > > > > > require some very intrusive hooks into the kernel logic, and > > > > > negotiation with the guest to allocate the requested amount of > > > > > memory and hand it over to dom0. If the maximum size of the buffer is > > > > > known in advance maybe dom0 can negotiate with the guest to allocate > > > > > such a region and grant it access to dom0 at driver attachment time. > > > > > > > > Besides the thought of transiently converting RAM to kind-of-MMIO, this > > > > > > As a note here, changing the type to MMIO would likely involve > > > modifying the EPT/NPT tables to propagate the new type. On a PVH dom0 > > > this would likely involve shattering superpages in order to set the > > > correct memory types. > > > > > > Depending on how often and how random those system RAM changes are > > > necessary this could also create contention on the p2m lock. > > > > > > > makes me think of another possible option: Could Dom0 transfer ownership > > > > of the RAM that wants mapping in the guest (remotely resembling > > > > grant-transfer)? Would require the guest to have ballooned down enough > > > > first, of course. (In both cases it would certainly need working out how > > > > the conversion / transfer back could be made work safely and reasonably > > > > cleanly.) > > > > > > Maybe. The fact the guest needs to balloon down that amount of memory > > > seems weird to me, as from the guest PoV that mapped memory is > > > MMIO-like and not system RAM. > > > > I don’t like it either. Furthermore, this would require changes to the > > virtio-GPU driver in the guest, which I’d prefer to avoid. > > IMO it would be helpful if you (or someone) could write the full > specification of how VirtIO GPU is supposed to work right now (with > the KVM model I assume?) as it would be a good starting point to > provide suggestions about how to make it work (or adapt it) on Xen. > > I don't think the high level layers on top of VirtIO GPU are relevant, > but it's important to understand the protocol between the VirtIO GPU > front and back ends. virtio-GPU is part of the OASIS VirtIO standard [1]. [1]: https://docs.oasis-open.org/virtio/virtio/v1.3/virtio-v1.3.html > So far I only had scattered conversation about what's needed, but not > a formal write-up of how this is supposed to work. My understanding is that mapping GPU buffers into guests ("blob resources" in virtio-GPU terms) is the only part of virtio-GPU that didn't just work. Furthermore, any solution that uses Linux's kernel-mode GPU driver on the host will have the same requirements. I don't consider writing a bespoke GPU driver that uses caller-allocated buffers to be a reasonable solution that can support many GPU models. -- Sincerely, Demi Marie Obenour (she/her/hers) Invisible Things Lab [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Design session notes: GPU acceleration in Xen 2024-06-17 7:46 ` Roger Pau Monné 2024-06-17 15:07 ` Demi Marie Obenour @ 2024-06-17 20:46 ` Marek Marczykowski-Górecki 2024-06-18 0:57 ` Demi Marie Obenour 1 sibling, 1 reply; 26+ messages in thread From: Marek Marczykowski-Górecki @ 2024-06-17 20:46 UTC (permalink / raw) To: Roger Pau Monné Cc: Demi Marie Obenour, Jan Beulich, Xenia Ragiadakou, Ray Huang, Xen developer discussion, Andrew Cooper [-- Attachment #1: Type: text/plain, Size: 990 bytes --] On Mon, Jun 17, 2024 at 09:46:29AM +0200, Roger Pau Monné wrote: > On Sun, Jun 16, 2024 at 08:38:19PM -0400, Demi Marie Obenour wrote: > > In both cases, the device physical > > addresses are identical to dom0’s physical addresses. > > Yes, but a PV dom0 physical address space can be very scattered. > > IIRC there's an hypercall to request physically contiguous memory for > PV, but you don't want to be using that every time you allocate a > buffer (not sure it would support the sizes needed by the GPU > anyway). Indeed that isn't going to fly. In older Qubes versions we had PV sys-net with PCI passthrough for a network card. After some uptime it was basically impossible to restart and still have enough contagious memory for a network driver, and there it was about _much_ smaller buffers, like 2M or 4M. At least not without shutting down a lot more things to free some more memory. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Design session notes: GPU acceleration in Xen 2024-06-17 20:46 ` Marek Marczykowski-Górecki @ 2024-06-18 0:57 ` Demi Marie Obenour 2024-06-18 6:33 ` Christian König 2024-06-18 14:43 ` Roger Pau Monné 0 siblings, 2 replies; 26+ messages in thread From: Demi Marie Obenour @ 2024-06-18 0:57 UTC (permalink / raw) To: Marek Marczykowski-Górecki, Roger Pau Monné Cc: Jan Beulich, Xenia Ragiadakou, Ray Huang, Xen developer discussion, Andrew Cooper, Direct Rendering Infrastructure development, Christian König, Qubes OS Development Mailing List -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 On Mon, Jun 17, 2024 at 10:46:13PM +0200, Marek Marczykowski-Górecki wrote: > On Mon, Jun 17, 2024 at 09:46:29AM +0200, Roger Pau Monné wrote: > > On Sun, Jun 16, 2024 at 08:38:19PM -0400, Demi Marie Obenour wrote: > > > In both cases, the device physical > > > addresses are identical to dom0’s physical addresses. > > > > Yes, but a PV dom0 physical address space can be very scattered. > > > > IIRC there's an hypercall to request physically contiguous memory for > > PV, but you don't want to be using that every time you allocate a > > buffer (not sure it would support the sizes needed by the GPU > > anyway). > > Indeed that isn't going to fly. In older Qubes versions we had PV > sys-net with PCI passthrough for a network card. After some uptime it > was basically impossible to restart and still have enough contagious > memory for a network driver, and there it was about _much_ smaller > buffers, like 2M or 4M. At least not without shutting down a lot more > things to free some more memory. Ouch! That makes me wonder if all GPU drivers actually need physically contiguous buffers, or if it is (as I suspect) driver-specific. CCing Christian König who has mentioned issues in this area. Given the recent progress on PVH dom0, is it reasonable to assume that PVH dom0 will be ready in time for R4.3, and that therefore Qubes OS doesn't need to worry about this problem on x86? - -- Sincerely, Demi Marie Obenour (she/her/hers) Invisible Things Lab -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEdodNnxM2uiJZBxxxsoi1X/+cIsEFAmZw22kACgkQsoi1X/+c IsGqtA/+INEbVP6pjKoMOJStaXajIvx19hJFU5HJQT0FBe4u2VXd3wfhK5gbJ90P NrlE3Lstzper0qBG7Lt8lt4DAcL9Q3Ml9d8M0K7z6VYIKPqiu2Wh/P25HD7r+Adn L2AMwnKUHtC02LJpT1Cjt/acKU3En9TMd35RhCNf4K+c9Swodtea3iOo7GzgQjNA TFMAYiiIlhwQIvThrVlcKktCMZajvhudxwfZTd3EfUkIQbMtc/ydkeqL92nV9Fg4 uz+AEeDDNhCGsEjrFUFTXKnXc/28jpVIc4mXyGW+x4dginRjrjRVmtNrnz/1wO+S X/xVUVnvLoTUXI+dKI9y5XmobVAJzLNZaEOEfnKePj5zA2ayRfnWybPBjzJuU+S4 wKevyBDlTuOdgtOT9nktd+qzXBQYtreEu8f+t9sEezURpVU/oOyrVn7Ui0RMtZID W3sXJH3NfVb3mWCsYOMpJyzb5VYfYR5PWN6Ggln/CHvfLTDI8TKdaO41INkXLlTC fA1cXVSKPn/VX9LRIFcQ81v9MGBAFkDX4Mf7z7xodi9Qopj+o2Yw66g5vLrPxPCH asJSdnrnaZAtZSsbEhY4uV5+4QLD0dyNUqj+HxRlODFwhpDyervCikfp0MoSsWmT qFvFHkiSqkx7E33QaVjmcGmFv4eWTVunYxW0j8tWnpWQLNLfPzY= =H5gN -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Design session notes: GPU acceleration in Xen 2024-06-18 0:57 ` Demi Marie Obenour @ 2024-06-18 6:33 ` Christian König 2024-06-18 14:12 ` Demi Marie Obenour 2024-06-18 14:43 ` Roger Pau Monné 1 sibling, 1 reply; 26+ messages in thread From: Christian König @ 2024-06-18 6:33 UTC (permalink / raw) To: Demi Marie Obenour, Marek Marczykowski-Górecki, Roger Pau Monné Cc: Jan Beulich, Xenia Ragiadakou, Ray Huang, Xen developer discussion, Andrew Cooper, Direct Rendering Infrastructure development, Qubes OS Development Mailing List Am 18.06.24 um 02:57 schrieb Demi Marie Obenour: > On Mon, Jun 17, 2024 at 10:46:13PM +0200, Marek Marczykowski-Górecki > wrote: > > On Mon, Jun 17, 2024 at 09:46:29AM +0200, Roger Pau Monné wrote: > >> On Sun, Jun 16, 2024 at 08:38:19PM -0400, Demi Marie Obenour wrote: > >>> In both cases, the device physical > >>> addresses are identical to dom0’s physical addresses. > >> > >> Yes, but a PV dom0 physical address space can be very scattered. > >> > >> IIRC there's an hypercall to request physically contiguous memory for > >> PV, but you don't want to be using that every time you allocate a > >> buffer (not sure it would support the sizes needed by the GPU > >> anyway). > > > Indeed that isn't going to fly. In older Qubes versions we had PV > > sys-net with PCI passthrough for a network card. After some uptime it > > was basically impossible to restart and still have enough contagious > > memory for a network driver, and there it was about _much_ smaller > > buffers, like 2M or 4M. At least not without shutting down a lot more > > things to free some more memory. > > Ouch! That makes me wonder if all GPU drivers actually need physically > contiguous buffers, or if it is (as I suspect) driver-specific. CCing > Christian König who has mentioned issues in this area. Well GPUs don't need physical contiguous memory to function, but if they only get 4k pages to work with it means a quite large (up to 30%) performance penalty. So scattering memory like you described is probably a very bad idea if you want any halve way decent performance. Regards, Christian. > > Given the recent progress on PVH dom0, is it reasonable to assume that > PVH dom0 will be ready in time for R4.3, and that therefore Qubes OS > doesn't need to worry about this problem on x86? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Design session notes: GPU acceleration in Xen 2024-06-18 6:33 ` Christian König @ 2024-06-18 14:12 ` Demi Marie Obenour 2024-06-19 7:31 ` Christian König 0 siblings, 1 reply; 26+ messages in thread From: Demi Marie Obenour @ 2024-06-18 14:12 UTC (permalink / raw) To: Christian König, Marek Marczykowski-Górecki, Roger Pau Monné Cc: Jan Beulich, Xenia Ragiadakou, Ray Huang, Xen developer discussion, Andrew Cooper, Direct Rendering Infrastructure development, Qubes OS Development Mailing List -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 On Tue, Jun 18, 2024 at 08:33:38AM +0200, Christian König wrote: > Am 18.06.24 um 02:57 schrieb Demi Marie Obenour: > > On Mon, Jun 17, 2024 at 10:46:13PM +0200, Marek Marczykowski-Górecki > > wrote: > > > On Mon, Jun 17, 2024 at 09:46:29AM +0200, Roger Pau Monné wrote: > > >> On Sun, Jun 16, 2024 at 08:38:19PM -0400, Demi Marie Obenour wrote: > > >>> In both cases, the device physical > > >>> addresses are identical to dom0’s physical addresses. > > >> > > >> Yes, but a PV dom0 physical address space can be very scattered. > > >> > > >> IIRC there's an hypercall to request physically contiguous memory for > > >> PV, but you don't want to be using that every time you allocate a > > >> buffer (not sure it would support the sizes needed by the GPU > > >> anyway). > > > > > Indeed that isn't going to fly. In older Qubes versions we had PV > > > sys-net with PCI passthrough for a network card. After some uptime it > > > was basically impossible to restart and still have enough contagious > > > memory for a network driver, and there it was about _much_ smaller > > > buffers, like 2M or 4M. At least not without shutting down a lot more > > > things to free some more memory. > > > > Ouch! That makes me wonder if all GPU drivers actually need physically > > contiguous buffers, or if it is (as I suspect) driver-specific. CCing > > Christian König who has mentioned issues in this area. > > Well GPUs don't need physical contiguous memory to function, but if they > only get 4k pages to work with it means a quite large (up to 30%) > performance penalty. The status quo is "no GPU acceleration at all", so 70% of bare metal performance would be amazing right now. However, the implementation should not preclude eliminating this performance penalty in the future. What size pages do GPUs need for good performance? Is it the same as CPU huge pages? PV dom0 doesn't get huge pages at all, but PVH and HVM guests do, and the goal is to move away from PV guests as they have lots of unrelated problems. > So scattering memory like you described is probably a very bad idea if you > want any halve way decent performance. For an initial prototype a 30% performance penalty is acceptable, but it's good to know that memory fragmentation needs to be avoided. > Regards, > Christian Thanks for the prompt response! - -- Sincerely, Demi Marie Obenour (she/her/hers) Invisible Things Lab -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEdodNnxM2uiJZBxxxsoi1X/+cIsEFAmZxlbsACgkQsoi1X/+c IsG+WhAA00y83cU94MMJCuDMqTCSOgJraPchvQHLBuMIB0cJkIbVxhA2T4yuvVZy Bzg/oVvWJH8B+p47HHo6uyjoPoeO659q8Hyea6zT8yMrKhiwOF8UxFRyxakdYHRs l793sCwUtMFwkJdsfacTSKjL6sMktWhicvOqX4rA/SIVpwzZh1auFjAIrZ2BENb/ YIRH18Dfl2iEOA2W3TQTNiaqLeT2qtYspDVVLuUeAe7OAFCJVSkeMpAPPR15jCzm Ou0HP6JP2jH6h7Shd09ns+3UvQK4xaygpvEsj+BwpXPf2CDNgypKHezqgF1WMzCc HGXK1deGXE35XNH4EL5jgRlF7FmLT54CXuMpPIGbfNWbT2fvpoS2tyrdQPHxwgr8 lqqqfjugZ9qzbqA4v/m+v0cKFclMvSYL8Rzn+tbz8kAFf7VTglypY55RIIStdnSZ sLYStA6qv8Mcu4NHYvdGeatTS26XR72X+dB5ApTn4dLLttnzbXMAyqDSTys28XQb jeHnh1uTOLChODJHu5prHJ6bN0MxmISwFuot58gW/iI0spyihRhPNjZ/6E/7BpIm 8AGiT+p96dvaymLB5k6dqj5ruqVPP8HLBibB8zafzJn3JIJpjCZm9HM5YcO7xMQ2 92ZNZ/XOswah+0s6MyWDCsU8jKnhQ87ESnB4JItI5skKj+001Jg= =ddxn -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Design session notes: GPU acceleration in Xen 2024-06-18 14:12 ` Demi Marie Obenour @ 2024-06-19 7:31 ` Christian König 2024-06-19 16:56 ` Alex Deucher 0 siblings, 1 reply; 26+ messages in thread From: Christian König @ 2024-06-19 7:31 UTC (permalink / raw) To: Demi Marie Obenour, Marek Marczykowski-Górecki, Roger Pau Monné, Pelloux-prayer, Pierre-eric Cc: Jan Beulich, Xenia Ragiadakou, Ray Huang, Xen developer discussion, Andrew Cooper, Direct Rendering Infrastructure development, Qubes OS Development Mailing List Am 18.06.24 um 16:12 schrieb Demi Marie Obenour: > On Tue, Jun 18, 2024 at 08:33:38AM +0200, Christian König wrote: > > Am 18.06.24 um 02:57 schrieb Demi Marie Obenour: > >> On Mon, Jun 17, 2024 at 10:46:13PM +0200, Marek Marczykowski-Górecki > >> wrote: > >>> On Mon, Jun 17, 2024 at 09:46:29AM +0200, Roger Pau Monné wrote: > >>>> On Sun, Jun 16, 2024 at 08:38:19PM -0400, Demi Marie Obenour wrote: > >>>>> In both cases, the device physical > >>>>> addresses are identical to dom0’s physical addresses. > >>>> > >>>> Yes, but a PV dom0 physical address space can be very scattered. > >>>> > >>>> IIRC there's an hypercall to request physically contiguous memory for > >>>> PV, but you don't want to be using that every time you allocate a > >>>> buffer (not sure it would support the sizes needed by the GPU > >>>> anyway). > >> > >>> Indeed that isn't going to fly. In older Qubes versions we had PV > >>> sys-net with PCI passthrough for a network card. After some uptime it > >>> was basically impossible to restart and still have enough contagious > >>> memory for a network driver, and there it was about _much_ smaller > >>> buffers, like 2M or 4M. At least not without shutting down a lot more > >>> things to free some more memory. > >> > >> Ouch! That makes me wonder if all GPU drivers actually need physically > >> contiguous buffers, or if it is (as I suspect) driver-specific. CCing > >> Christian König who has mentioned issues in this area. > > > Well GPUs don't need physical contiguous memory to function, but if they > > only get 4k pages to work with it means a quite large (up to 30%) > > performance penalty. > > The status quo is "no GPU acceleration at all", so 70% of bare metal > performance would be amazing right now. Well AMD uses native context approach in XEN which which delivers over 90% of bare metal performance. Pierre-Eric can tell you more, but we certainly have GPU solutions in productions with XEN which would suffer greatly if we see the underlying memory fragmented like this. > However, the implementation > should not preclude eliminating this performance penalty in the future. > > What size pages do GPUs need for good performance? Is it the same as > CPU huge pages? 2MiB are usually sufficient. Regards, Christian. > PV dom0 doesn't get huge pages at all, but PVH and HVM > guests do, and the goal is to move away from PV guests as they have lots > of unrelated problems. > > > So scattering memory like you described is probably a very bad idea > if you > > want any halve way decent performance. > > For an initial prototype a 30% performance penalty is acceptable, but > it's good to know that memory fragmentation needs to be avoided. > > > Regards, > > Christian > > Thanks for the prompt response! ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Design session notes: GPU acceleration in Xen 2024-06-19 7:31 ` Christian König @ 2024-06-19 16:56 ` Alex Deucher 0 siblings, 0 replies; 26+ messages in thread From: Alex Deucher @ 2024-06-19 16:56 UTC (permalink / raw) To: Christian König Cc: Demi Marie Obenour, Marek Marczykowski-Górecki, Roger Pau Monné, Pelloux-prayer, Pierre-eric, Jan Beulich, Xenia Ragiadakou, Ray Huang, Xen developer discussion, Andrew Cooper, Direct Rendering Infrastructure development, Qubes OS Development Mailing List On Wed, Jun 19, 2024 at 12:27 PM Christian König <christian.koenig@amd.com> wrote: > > Am 18.06.24 um 16:12 schrieb Demi Marie Obenour: > > On Tue, Jun 18, 2024 at 08:33:38AM +0200, Christian König wrote: > > > Am 18.06.24 um 02:57 schrieb Demi Marie Obenour: > > >> On Mon, Jun 17, 2024 at 10:46:13PM +0200, Marek Marczykowski-Górecki > > >> wrote: > > >>> On Mon, Jun 17, 2024 at 09:46:29AM +0200, Roger Pau Monné wrote: > > >>>> On Sun, Jun 16, 2024 at 08:38:19PM -0400, Demi Marie Obenour wrote: > > >>>>> In both cases, the device physical > > >>>>> addresses are identical to dom0’s physical addresses. > > >>>> > > >>>> Yes, but a PV dom0 physical address space can be very scattered. > > >>>> > > >>>> IIRC there's an hypercall to request physically contiguous memory for > > >>>> PV, but you don't want to be using that every time you allocate a > > >>>> buffer (not sure it would support the sizes needed by the GPU > > >>>> anyway). > > >> > > >>> Indeed that isn't going to fly. In older Qubes versions we had PV > > >>> sys-net with PCI passthrough for a network card. After some uptime it > > >>> was basically impossible to restart and still have enough contagious > > >>> memory for a network driver, and there it was about _much_ smaller > > >>> buffers, like 2M or 4M. At least not without shutting down a lot more > > >>> things to free some more memory. > > >> > > >> Ouch! That makes me wonder if all GPU drivers actually need physically > > >> contiguous buffers, or if it is (as I suspect) driver-specific. CCing > > >> Christian König who has mentioned issues in this area. > > > > > Well GPUs don't need physical contiguous memory to function, but if they > > > only get 4k pages to work with it means a quite large (up to 30%) > > > performance penalty. > > > > The status quo is "no GPU acceleration at all", so 70% of bare metal > > performance would be amazing right now. > > Well AMD uses native context approach in XEN which which delivers over > 90% of bare metal performance. > > Pierre-Eric can tell you more, but we certainly have GPU solutions in > productions with XEN which would suffer greatly if we see the underlying > memory fragmented like this. > > > However, the implementation > > should not preclude eliminating this performance penalty in the future. > > > > What size pages do GPUs need for good performance? Is it the same as > > CPU huge pages? > > 2MiB are usually sufficient. Larger pages are helpful for both system memory and VRAM, but it's more important for VRAM. Alex > > Regards, > Christian. > > > PV dom0 doesn't get huge pages at all, but PVH and HVM > > guests do, and the goal is to move away from PV guests as they have lots > > of unrelated problems. > > > > > So scattering memory like you described is probably a very bad idea > > if you > > > want any halve way decent performance. > > > > For an initial prototype a 30% performance penalty is acceptable, but > > it's good to know that memory fragmentation needs to be avoided. > > > > > Regards, > > > Christian > > > > Thanks for the prompt response! > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Design session notes: GPU acceleration in Xen 2024-06-18 0:57 ` Demi Marie Obenour 2024-06-18 6:33 ` Christian König @ 2024-06-18 14:43 ` Roger Pau Monné 2024-06-18 14:56 ` Demi Marie Obenour 1 sibling, 1 reply; 26+ messages in thread From: Roger Pau Monné @ 2024-06-18 14:43 UTC (permalink / raw) To: Demi Marie Obenour Cc: Marek Marczykowski-Górecki, Jan Beulich, Xenia Ragiadakou, Ray Huang, Xen developer discussion, Andrew Cooper, Direct Rendering Infrastructure development, Christian König, Qubes OS Development Mailing List On Mon, Jun 17, 2024 at 08:57:14PM -0400, Demi Marie Obenour wrote: > Given the recent progress on PVH dom0, is it reasonable to assume that > PVH dom0 will be ready in time for R4.3, and that therefore Qubes OS > doesn't need to worry about this problem on x86? PVH dom0 will only be ready (whatever ready means in your use-case) when people test and fix the issues, otherwise it would stay in the same limbo it's currently in. I guess the main blocker for Qubes is the lack of PCI passthrough support in order to test it more aggressively? Thanks, Roger. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Design session notes: GPU acceleration in Xen 2024-06-18 14:43 ` Roger Pau Monné @ 2024-06-18 14:56 ` Demi Marie Obenour 0 siblings, 0 replies; 26+ messages in thread From: Demi Marie Obenour @ 2024-06-18 14:56 UTC (permalink / raw) To: Roger Pau Monné Cc: Marek Marczykowski-Górecki, Jan Beulich, Xenia Ragiadakou, Ray Huang, Xen developer discussion, Andrew Cooper, Direct Rendering Infrastructure development, Christian König, Qubes OS Development Mailing List -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 On Tue, Jun 18, 2024 at 04:43:50PM +0200, Roger Pau Monné wrote: > On Mon, Jun 17, 2024 at 08:57:14PM -0400, Demi Marie Obenour wrote: > > Given the recent progress on PVH dom0, is it reasonable to assume that > > PVH dom0 will be ready in time for R4.3, and that therefore Qubes OS > > doesn't need to worry about this problem on x86? > > PVH dom0 will only be ready (whatever ready means in your use-case) > when people test and fix the issues, otherwise it would stay in the > same limbo it's currently in. > > I guess the main blocker for Qubes is the lack of PCI passthrough > support in order to test it more aggressively? I suspect so, though Marek would need to confirm. - -- Sincerely, Demi Marie Obenour (she/her/hers) Invisible Things Lab -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEdodNnxM2uiJZBxxxsoi1X/+cIsEFAmZxoAgACgkQsoi1X/+c IsH7QhAAjZiGKHUUE62xWcI4bxz/ebW6hS9eMEgRPpd9NSOt2slf5NBdGnKtYj1y mCE+hpyBS3ZKD+4ERbJ4U6K/MrwXaUHc/gqwnRB+rrrKevP/oy/+mI8z8OPrGSc0 0ZCu3AfNKk5Bohf15IMtiqKkk+tsztLTfjgso7lJ1sK1wobdf8Ps97shdbCrnjlI QlHIXWtYIJse4UKR1aZ1eZ/dggLKOyye3ukF6OSet8tLWbG258wdhRDwC57So5nI xZdZayCpbixhFQLxbSy+L5lbEVTaq7Ymkoca33Fhn6kFtxzXv/gBoHz+nZBiqVZG 6fSQrIxr0MgDvQRzEvh90fnIDcAQtqRuvDJvB3jjkHjkQzuWsOpZycJytSEfisCw //Z/T7DsbE581T9sBCpoZ4a7k89zsnZfT2MK7pypPL+spxtVTK2man6Us8mdEj85 5d+f3MGaoHQBPAbn5eoSWCzJCmdDBHIvMnIrxvvx+ZyD74nv4v8OMfUeMbDK8jz0 Z4LKG+cF0hc9pl/DlewrvP3spuw/a3KyxeKZBPKiZmArxuUbiuarbowauBT+YmgT KTkWs/hL2VIq2+kX82DckABvroIhDm/YVF4miX4WIJMhoiEE0+zB35Gjyw19QvXr +WviUWvA3a6icPzCUz2tIZlBabQ3fcgD/+IWVuuDv+7x9Kwy088= =U2jm -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Design session notes: GPU acceleration in Xen 2024-06-17 0:38 ` Demi Marie Obenour 2024-06-17 7:46 ` Roger Pau Monné @ 2024-06-17 9:13 ` Jan Beulich 1 sibling, 0 replies; 26+ messages in thread From: Jan Beulich @ 2024-06-17 9:13 UTC (permalink / raw) To: Demi Marie Obenour Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang, Xen developer discussion, Andrew Cooper, Roger Pau Monné On 17.06.2024 02:38, Demi Marie Obenour wrote: > On Fri, Jun 14, 2024 at 10:39:37AM +0200, Roger Pau Monné wrote: >> On Fri, Jun 14, 2024 at 10:12:40AM +0200, Jan Beulich wrote: >>> On 14.06.2024 09:21, Roger Pau Monné wrote: >>>> On Fri, Jun 14, 2024 at 08:38:51AM +0200, Jan Beulich wrote: >>>>> On 13.06.2024 20:43, Demi Marie Obenour wrote: >>>>>> GPU acceleration requires that pageable host memory be able to be mapped >>>>>> into a guest. >>>>> >>>>> I'm sure it was explained in the session, which sadly I couldn't attend. >>>>> I've been asking Ray and Xenia the same before, but I'm afraid it still >>>>> hasn't become clear to me why this is a _requirement_. After all that's >>>>> against what we're doing elsewhere (i.e. so far it has always been >>>>> guest memory that's mapped in the host). I can appreciate that it might >>>>> be more difficult to implement, but avoiding to violate this fundamental >>>>> (kind of) rule might be worth the price (and would avoid other >>>>> complexities, of which there may be lurking more than what you enumerate >>>>> below). >>>> >>>> My limited understanding (please someone correct me if wrong) is that >>>> the GPU buffer (or context I think it's also called?) is always >>>> allocated from dom0 (the owner of the GPU). The underling memory >>>> addresses of such buffer needs to be mapped into the guest. The >>>> buffer backing memory might be GPU MMIO from the device BAR(s) or >>>> system RAM, and such buffer can be paged by the dom0 kernel at any >>>> time (iow: changing the backing memory from MMIO to RAM or vice >>>> versa). Also, the buffer must be contiguous in physical address >>>> space. >>> >>> This last one in particular would of course be a severe restriction. >>> Yet: There's an IOMMU involved, isn't there? >> >> Yup, IIRC that's why Ray said it was much more easier for them to >> support VirtIO GPUs from a PVH dom0 rather than classic PV one. >> >> It might be easier to implement from a classic PV dom0 if there's >> pv-iommu support, so that dom0 can create it's own contiguous memory >> buffers from the device PoV. > > What makes PVH an improvement here? I thought PV dom0 uses an identity > mapping for the IOMMU, True, but see how Roger mentioned PV IOMMU (which would allow a domain to move away from this identity mapping). Jan > while a PVH dom0 uses an IOMMU that mirrors the > dom0 second-stage page tables. In both cases, the device physical > addresses are identical to dom0’s physical addresses. > > PV is terrible for many reasons, so I’m okay with focusing on PVH dom0, > but I’d like to know why there is a difference. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Design session notes: GPU acceleration in Xen 2024-06-14 8:12 ` Jan Beulich 2024-06-14 8:39 ` Roger Pau Monné @ 2024-06-14 16:44 ` Demi Marie Obenour 2024-06-17 9:07 ` Jan Beulich 1 sibling, 1 reply; 26+ messages in thread From: Demi Marie Obenour @ 2024-06-14 16:44 UTC (permalink / raw) To: Jan Beulich, Roger Pau Monné Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang, Xen developer discussion, Andrew Cooper [-- Attachment #1: Type: text/plain, Size: 3113 bytes --] On Fri, Jun 14, 2024 at 10:12:40AM +0200, Jan Beulich wrote: > On 14.06.2024 09:21, Roger Pau Monné wrote: > > On Fri, Jun 14, 2024 at 08:38:51AM +0200, Jan Beulich wrote: > >> On 13.06.2024 20:43, Demi Marie Obenour wrote: > >>> GPU acceleration requires that pageable host memory be able to be mapped > >>> into a guest. > >> > >> I'm sure it was explained in the session, which sadly I couldn't attend. > >> I've been asking Ray and Xenia the same before, but I'm afraid it still > >> hasn't become clear to me why this is a _requirement_. After all that's > >> against what we're doing elsewhere (i.e. so far it has always been > >> guest memory that's mapped in the host). I can appreciate that it might > >> be more difficult to implement, but avoiding to violate this fundamental > >> (kind of) rule might be worth the price (and would avoid other > >> complexities, of which there may be lurking more than what you enumerate > >> below). > > > > My limited understanding (please someone correct me if wrong) is that > > the GPU buffer (or context I think it's also called?) is always > > allocated from dom0 (the owner of the GPU). The underling memory > > addresses of such buffer needs to be mapped into the guest. The > > buffer backing memory might be GPU MMIO from the device BAR(s) or > > system RAM, and such buffer can be paged by the dom0 kernel at any > > time (iow: changing the backing memory from MMIO to RAM or vice > > versa). Also, the buffer must be contiguous in physical address > > space. > > This last one in particular would of course be a severe restriction. > Yet: There's an IOMMU involved, isn't there? On x86 there is. On Arm there might or might not be. There are non-embedded systems (such as Apple silicon) where the GPU is not behind an IOMMU, for performance reasons IIUC. > > I'm not sure it's possible to ensure that when using system RAM such > > memory comes from the guest rather than the host, as it would likely > > require some very intrusive hooks into the kernel logic, and > > negotiation with the guest to allocate the requested amount of > > memory and hand it over to dom0. If the maximum size of the buffer is > > known in advance maybe dom0 can negotiate with the guest to allocate > > such a region and grant it access to dom0 at driver attachment time. > > Besides the thought of transiently converting RAM to kind-of-MMIO, this > makes me think of another possible option: Could Dom0 transfer ownership > of the RAM that wants mapping in the guest (remotely resembling > grant-transfer)? Would require the guest to have ballooned down enough > first, of course. (In both cases it would certainly need working out how > the conversion / transfer back could be made work safely and reasonably > cleanly.) > > Jan The kernel driver needs to be able to reclaim the memory at any time. My understanding is that this is used to migrate memory between VRAM and system RAM. It might also be used for other purposes. -- Sincerely, Demi Marie Obenour (she/her/hers) Invisible Things Lab [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Design session notes: GPU acceleration in Xen 2024-06-14 16:44 ` Demi Marie Obenour @ 2024-06-17 9:07 ` Jan Beulich 2024-06-17 15:17 ` Demi Marie Obenour 0 siblings, 1 reply; 26+ messages in thread From: Jan Beulich @ 2024-06-17 9:07 UTC (permalink / raw) To: Demi Marie Obenour Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang, Xen developer discussion, Andrew Cooper, Roger Pau Monné On 14.06.2024 18:44, Demi Marie Obenour wrote: > On Fri, Jun 14, 2024 at 10:12:40AM +0200, Jan Beulich wrote: >> On 14.06.2024 09:21, Roger Pau Monné wrote: >>> I'm not sure it's possible to ensure that when using system RAM such >>> memory comes from the guest rather than the host, as it would likely >>> require some very intrusive hooks into the kernel logic, and >>> negotiation with the guest to allocate the requested amount of >>> memory and hand it over to dom0. If the maximum size of the buffer is >>> known in advance maybe dom0 can negotiate with the guest to allocate >>> such a region and grant it access to dom0 at driver attachment time. >> >> Besides the thought of transiently converting RAM to kind-of-MMIO, this >> makes me think of another possible option: Could Dom0 transfer ownership >> of the RAM that wants mapping in the guest (remotely resembling >> grant-transfer)? Would require the guest to have ballooned down enough >> first, of course. (In both cases it would certainly need working out how >> the conversion / transfer back could be made work safely and reasonably >> cleanly.) > > The kernel driver needs to be able to reclaim the memory at any time. > My understanding is that this is used to migrate memory between VRAM and > system RAM. It might also be used for other purposes. Except: How would the kernel driver reclaim the memory when it's mapped by a DomU? Jan ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Design session notes: GPU acceleration in Xen 2024-06-17 9:07 ` Jan Beulich @ 2024-06-17 15:17 ` Demi Marie Obenour 2024-06-17 15:39 ` Jan Beulich 0 siblings, 1 reply; 26+ messages in thread From: Demi Marie Obenour @ 2024-06-17 15:17 UTC (permalink / raw) To: Jan Beulich Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang, Xen developer discussion, Andrew Cooper, Roger Pau Monné [-- Attachment #1: Type: text/plain, Size: 2130 bytes --] On Mon, Jun 17, 2024 at 11:07:54AM +0200, Jan Beulich wrote: > On 14.06.2024 18:44, Demi Marie Obenour wrote: > > On Fri, Jun 14, 2024 at 10:12:40AM +0200, Jan Beulich wrote: > >> On 14.06.2024 09:21, Roger Pau Monné wrote: > >>> I'm not sure it's possible to ensure that when using system RAM such > >>> memory comes from the guest rather than the host, as it would likely > >>> require some very intrusive hooks into the kernel logic, and > >>> negotiation with the guest to allocate the requested amount of > >>> memory and hand it over to dom0. If the maximum size of the buffer is > >>> known in advance maybe dom0 can negotiate with the guest to allocate > >>> such a region and grant it access to dom0 at driver attachment time. > >> > >> Besides the thought of transiently converting RAM to kind-of-MMIO, this > >> makes me think of another possible option: Could Dom0 transfer ownership > >> of the RAM that wants mapping in the guest (remotely resembling > >> grant-transfer)? Would require the guest to have ballooned down enough > >> first, of course. (In both cases it would certainly need working out how > >> the conversion / transfer back could be made work safely and reasonably > >> cleanly.) > > > > The kernel driver needs to be able to reclaim the memory at any time. > > My understanding is that this is used to migrate memory between VRAM and > > system RAM. It might also be used for other purposes. > > Except: How would the kernel driver reclaim the memory when it's mapped > by a DomU? The Xen driver in dom0 will register for MMU notifier callbacks. When the kernel driver reclaims the memory, the Xen driver will be notified, and it will issue a hypercall that tells Xen to remove the memory from the DomU's address space. Subsequent accesses to the pages will trigger a stage 2 translation fault that is handled by an IOREQ server. For I/O memory, this is already possible via XEN_DOMCTL_memory_mapping. The proposal in this thread is to make this possible for system RAM as well. -- Sincerely, Demi Marie Obenour (she/her/hers) Invisible Things Lab [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Design session notes: GPU acceleration in Xen 2024-06-17 15:17 ` Demi Marie Obenour @ 2024-06-17 15:39 ` Jan Beulich 2024-06-17 16:02 ` Demi Marie Obenour 0 siblings, 1 reply; 26+ messages in thread From: Jan Beulich @ 2024-06-17 15:39 UTC (permalink / raw) To: Demi Marie Obenour Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang, Xen developer discussion, Andrew Cooper, Roger Pau Monné On 17.06.2024 17:17, Demi Marie Obenour wrote: > On Mon, Jun 17, 2024 at 11:07:54AM +0200, Jan Beulich wrote: >> On 14.06.2024 18:44, Demi Marie Obenour wrote: >>> On Fri, Jun 14, 2024 at 10:12:40AM +0200, Jan Beulich wrote: >>>> On 14.06.2024 09:21, Roger Pau Monné wrote: >>>>> I'm not sure it's possible to ensure that when using system RAM such >>>>> memory comes from the guest rather than the host, as it would likely >>>>> require some very intrusive hooks into the kernel logic, and >>>>> negotiation with the guest to allocate the requested amount of >>>>> memory and hand it over to dom0. If the maximum size of the buffer is >>>>> known in advance maybe dom0 can negotiate with the guest to allocate >>>>> such a region and grant it access to dom0 at driver attachment time. >>>> >>>> Besides the thought of transiently converting RAM to kind-of-MMIO, this >>>> makes me think of another possible option: Could Dom0 transfer ownership >>>> of the RAM that wants mapping in the guest (remotely resembling >>>> grant-transfer)? Would require the guest to have ballooned down enough >>>> first, of course. (In both cases it would certainly need working out how >>>> the conversion / transfer back could be made work safely and reasonably >>>> cleanly.) >>> >>> The kernel driver needs to be able to reclaim the memory at any time. >>> My understanding is that this is used to migrate memory between VRAM and >>> system RAM. It might also be used for other purposes. >> >> Except: How would the kernel driver reclaim the memory when it's mapped >> by a DomU? > > The Xen driver in dom0 will register for MMU notifier callbacks. When > the kernel driver reclaims the memory, the Xen driver will be notified, > and it will issue a hypercall that tells Xen to remove the memory from > the DomU's address space. Subsequent accesses to the pages will trigger > a stage 2 translation fault that is handled by an IOREQ server. And such an ioreq server, which I assume isn't going to run in the Dom0 kernel, will then also need keeping up-to-date on holes in the (virtual) BAR. Oh well ... Jan ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Design session notes: GPU acceleration in Xen 2024-06-17 15:39 ` Jan Beulich @ 2024-06-17 16:02 ` Demi Marie Obenour 0 siblings, 0 replies; 26+ messages in thread From: Demi Marie Obenour @ 2024-06-17 16:02 UTC (permalink / raw) To: Jan Beulich Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang, Xen developer discussion, Andrew Cooper, Roger Pau Monné [-- Attachment #1: Type: text/plain, Size: 2567 bytes --] On Mon, Jun 17, 2024 at 05:39:23PM +0200, Jan Beulich wrote: > On 17.06.2024 17:17, Demi Marie Obenour wrote: > > On Mon, Jun 17, 2024 at 11:07:54AM +0200, Jan Beulich wrote: > >> On 14.06.2024 18:44, Demi Marie Obenour wrote: > >>> On Fri, Jun 14, 2024 at 10:12:40AM +0200, Jan Beulich wrote: > >>>> On 14.06.2024 09:21, Roger Pau Monné wrote: > >>>>> I'm not sure it's possible to ensure that when using system RAM such > >>>>> memory comes from the guest rather than the host, as it would likely > >>>>> require some very intrusive hooks into the kernel logic, and > >>>>> negotiation with the guest to allocate the requested amount of > >>>>> memory and hand it over to dom0. If the maximum size of the buffer is > >>>>> known in advance maybe dom0 can negotiate with the guest to allocate > >>>>> such a region and grant it access to dom0 at driver attachment time. > >>>> > >>>> Besides the thought of transiently converting RAM to kind-of-MMIO, this > >>>> makes me think of another possible option: Could Dom0 transfer ownership > >>>> of the RAM that wants mapping in the guest (remotely resembling > >>>> grant-transfer)? Would require the guest to have ballooned down enough > >>>> first, of course. (In both cases it would certainly need working out how > >>>> the conversion / transfer back could be made work safely and reasonably > >>>> cleanly.) > >>> > >>> The kernel driver needs to be able to reclaim the memory at any time. > >>> My understanding is that this is used to migrate memory between VRAM and > >>> system RAM. It might also be used for other purposes. > >> > >> Except: How would the kernel driver reclaim the memory when it's mapped > >> by a DomU? > > > > The Xen driver in dom0 will register for MMU notifier callbacks. When > > the kernel driver reclaims the memory, the Xen driver will be notified, > > and it will issue a hypercall that tells Xen to remove the memory from > > the DomU's address space. Subsequent accesses to the pages will trigger > > a stage 2 translation fault that is handled by an IOREQ server. > > And such an ioreq server, which I assume isn't going to run in the Dom0 > kernel, will then also need keeping up-to-date on holes in the (virtual) > BAR. Oh well ... My initial plan was that it _would_ run in the dom0 kernel, because this results in a cleaner userspace API. Ultimately I think it is best to go with whichever approach keeps the kernel code simpler, but I'm not sure. -- Sincerely, Demi Marie Obenour (she/her/hers) Invisible Things Lab [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Design session notes: GPU acceleration in Xen 2024-06-14 7:21 ` Roger Pau Monné 2024-06-14 8:12 ` Jan Beulich @ 2024-06-14 17:55 ` Demi Marie Obenour 1 sibling, 0 replies; 26+ messages in thread From: Demi Marie Obenour @ 2024-06-14 17:55 UTC (permalink / raw) To: Roger Pau Monné, Jan Beulich Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang, Xen developer discussion, Andrew Cooper [-- Attachment #1: Type: text/plain, Size: 3341 bytes --] On Fri, Jun 14, 2024 at 09:21:56AM +0200, Roger Pau Monné wrote: > On Fri, Jun 14, 2024 at 08:38:51AM +0200, Jan Beulich wrote: > > On 13.06.2024 20:43, Demi Marie Obenour wrote: > > > GPU acceleration requires that pageable host memory be able to be mapped > > > into a guest. > > > > I'm sure it was explained in the session, which sadly I couldn't attend. > > I've been asking Ray and Xenia the same before, but I'm afraid it still > > hasn't become clear to me why this is a _requirement_. After all that's > > against what we're doing elsewhere (i.e. so far it has always been > > guest memory that's mapped in the host). I can appreciate that it might > > be more difficult to implement, but avoiding to violate this fundamental > > (kind of) rule might be worth the price (and would avoid other > > complexities, of which there may be lurking more than what you enumerate > > below). > > My limited understanding (please someone correct me if wrong) is that > the GPU buffer (or context I think it's also called?) is always > allocated from dom0 (the owner of the GPU). A GPU context is a GPU address space. It's the GPU equivalent of a CPU process. I don't believe that the same context can be used by more than one userspace process (though I could be wrong), but the same userspace process can create and use as many contexts as it wants. > The underling memory > addresses of such buffer needs to be mapped into the guest. The > buffer backing memory might be GPU MMIO from the device BAR(s) or > system RAM, and such buffer can be paged by the dom0 kernel at any > time (iow: changing the backing memory from MMIO to RAM or vice > versa). Also, the buffer must be contiguous in physical address > space. > > I'm not sure it's possible to ensure that when using system RAM such > memory comes from the guest rather than the host, as it would likely > require some very intrusive hooks into the kernel logic, and > negotiation with the guest to allocate the requested amount of > memory and hand it over to dom0. If the maximum size of the buffer is > known in advance maybe dom0 can negotiate with the guest to allocate > such a region and grant it access to dom0 at driver attachment time. I don't think there is a useful maximum size known. There may be a limit, but it would be around 4GiB or more, which is far too high to reserve physical memory for up front. > One aspect that I'm lacking clarity is better understanding of how the > process of allocating and assigning a GPU buffer to a guest is > performed (I think this is the key to how GPU VirtIO native contexts > work?). The buffer is allocated by the GPU driver in response to an ioctl() made by the userspace server process. If the buffer needs to be accessed by the guest CPU (not all do), it is mapped into part of an emulated PCI BAR for access by the guest. This mailing list thread is about making that possible. > Another question I have, are guest expected to have a single GPU > buffer, or they can have multiple GPU buffers simultaneously > allocated? I believe there is only one emulated BAR, but this is very large (GiBs) and sparsely populated. There can be many GPU buffers mapped into the BAR. -- Sincerely, Demi Marie Obenour (she/her/hers) Invisible Things Lab [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Design session notes: GPU acceleration in Xen 2024-06-14 6:38 ` Jan Beulich 2024-06-14 7:21 ` Roger Pau Monné @ 2024-06-14 16:35 ` Demi Marie Obenour 2024-06-17 9:05 ` Jan Beulich 2024-06-14 20:56 ` Demi Marie Obenour 2 siblings, 1 reply; 26+ messages in thread From: Demi Marie Obenour @ 2024-06-14 16:35 UTC (permalink / raw) To: Jan Beulich Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang, Xen developer discussion, Andrew Cooper, dri-devel [-- Attachment #1: Type: text/plain, Size: 4732 bytes --] On Fri, Jun 14, 2024 at 08:38:51AM +0200, Jan Beulich wrote: > On 13.06.2024 20:43, Demi Marie Obenour wrote: > > GPU acceleration requires that pageable host memory be able to be mapped > > into a guest. > > I'm sure it was explained in the session, which sadly I couldn't attend. > I've been asking Ray and Xenia the same before, but I'm afraid it still > hasn't become clear to me why this is a _requirement_. After all that's > against what we're doing elsewhere (i.e. so far it has always been > guest memory that's mapped in the host). I can appreciate that it might > be more difficult to implement, but avoiding to violate this fundamental > (kind of) rule might be worth the price (and would avoid other > complexities, of which there may be lurking more than what you enumerate > below). My understanding is: - Discrete GPUs require the memory to be VRAM, rather than system RAM. - Various APIs require dmabufs. Xen's support for dmabufs doesn't work with PV dom0. - The existing virtio-GPU protocol (which is not Xen-specific and so gets more testing and has broader support than anything that _is_ Xen-specific) requires backend allocation for native contexts. - There might be other issues (caching? memory management?) involved. I'm CCing dri-devel in hopes of getting a better response. > > This requires changes to all of the Xen hypervisor, Linux > > kernel, and userspace device model. > > > > ### Goals > > > > - Allow any userspace pages to be mapped into a guest. > > - Support deprivileged operation: this API must not be usable for privilege escalation. > > - Use MMU notifiers to ensure safety with respect to use-after-free. > > > > ### Hypervisor changes > > > > There are at least two Xen changes required: > > > > 1. Add a new flag to IOREQ that means "retry this instruction". > > > > An IOREQ server can set this flag after having successfully handled a > > page fault. It is expected that the IOREQ server has successfully > > mapped a page into the guest at the location of the fault. > > Otherwise, the same fault will likely happen again. > > Were there any thoughts on how to prevent this becoming an infinite loop? > I.e. how to (a) guarantee forward progress in the guest and (b) deal with > misbehaving IOREQ servers? Guaranteeing forward progress is up to the IOREQ server. If the IOREQ server misbehaves, an infinite loop is possible, but the CPU time used by it should be charged to the IOREQ server, so this isn't a vulnerability. > > 2. Add support for `XEN_DOMCTL_memory_mapping` to use system RAM, not > > just IOMEM. Mappings made with `XEN_DOMCTL_memory_mapping` are > > guaranteed to be able to be successfully revoked with > > `XEN_DOMCTL_memory_mapping`, so all operations that would create > > extra references to the mapped memory must be forbidden. These > > include, but may not be limited to: > > > > 1. Granting the pages to the same or other domains. > > 2. Mapping into another domain using `XEN_DOMCTL_memory_mapping`. > > 3. Another domain accessing the pages using the foreign memory APIs, > > unless it is privileged over the domain that owns the pages. > > All of which may call for actually converting the memory to kind-of-MMIO, > with a means to later convert it back. Would this support the case where the mapping domain is not fully priviliged, and where it might be a PV guest? > Jan > > > Open question: what if the other domain goes away? Ideally, > > unmapping would (vacuously) succeed in this case. Qubes OS doesn't > > care about domid reuse but others might. > > > > ### Kernel changes > > > > Linux will add support for mapping userspace memory into an emulated PCI > > BAR. This requires Linux to automatically revoke access when needed. > > > > There will be an IOREQ server that handles page faults. The discussion > > assumed that this handling will happen in kernel mode, but if handling > > in user mode is simpler that is also an option. > > > > There is no async #PF in Xen (yet), so the entire vCPU will be blocked > > while the fault is handled. This is not great for performance, but > > correctness comes first. > > > > There will be a new kernel ioctl to perform the mapping. A possible C > > prototype (presented at design session, but not discussed there): > > > > struct xen_linux_register_memory { > > uint64_t pointer; > > uint64_t size; > > uint64_t gpa; > > uint32_t id; > > uint32_t guest_domid; > > }; > -- Sincerely, Demi Marie Obenour (she/her/hers) Invisible Things Lab [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Design session notes: GPU acceleration in Xen 2024-06-14 16:35 ` Demi Marie Obenour @ 2024-06-17 9:05 ` Jan Beulich 0 siblings, 0 replies; 26+ messages in thread From: Jan Beulich @ 2024-06-17 9:05 UTC (permalink / raw) To: Demi Marie Obenour Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang, Xen developer discussion, Andrew Cooper, dri-devel, Daniel Vetter, David Airlie On 14.06.2024 18:35, Demi Marie Obenour wrote: > On Fri, Jun 14, 2024 at 08:38:51AM +0200, Jan Beulich wrote: >> On 13.06.2024 20:43, Demi Marie Obenour wrote: >>> 2. Add support for `XEN_DOMCTL_memory_mapping` to use system RAM, not >>> just IOMEM. Mappings made with `XEN_DOMCTL_memory_mapping` are >>> guaranteed to be able to be successfully revoked with >>> `XEN_DOMCTL_memory_mapping`, so all operations that would create >>> extra references to the mapped memory must be forbidden. These >>> include, but may not be limited to: >>> >>> 1. Granting the pages to the same or other domains. >>> 2. Mapping into another domain using `XEN_DOMCTL_memory_mapping`. >>> 3. Another domain accessing the pages using the foreign memory APIs, >>> unless it is privileged over the domain that owns the pages. >> >> All of which may call for actually converting the memory to kind-of-MMIO, >> with a means to later convert it back. > > Would this support the case where the mapping domain is not fully > priviliged, and where it might be a PV guest? I suppose that should be a goal. Jan ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Design session notes: GPU acceleration in Xen 2024-06-14 6:38 ` Jan Beulich 2024-06-14 7:21 ` Roger Pau Monné 2024-06-14 16:35 ` Demi Marie Obenour @ 2024-06-14 20:56 ` Demi Marie Obenour 2 siblings, 0 replies; 26+ messages in thread From: Demi Marie Obenour @ 2024-06-14 20:56 UTC (permalink / raw) To: Jan Beulich Cc: Xenia Ragiadakou, Marek Marczykowski-Górecki, Ray Huang, Xen developer discussion, Andrew Cooper, Roger Pau Monné, Direct Rendering Infrastructure development, Daniel Vetter, David Airlie, Rob Clark [-- Attachment #1: Type: text/plain, Size: 4007 bytes --] On Fri, Jun 14, 2024 at 08:38:51AM +0200, Jan Beulich wrote: > On 13.06.2024 20:43, Demi Marie Obenour wrote: > > GPU acceleration requires that pageable host memory be able to be mapped > > into a guest. > > I'm sure it was explained in the session, which sadly I couldn't attend. > I've been asking Ray and Xenia the same before, but I'm afraid it still > hasn't become clear to me why this is a _requirement_. After all that's > against what we're doing elsewhere (i.e. so far it has always been > guest memory that's mapped in the host). I can appreciate that it might > be more difficult to implement, but avoiding to violate this fundamental > (kind of) rule might be worth the price (and would avoid other > complexities, of which there may be lurking more than what you enumerate > below). The GPU driver knows how to allocate buffers that are usable by the GPU. On a discrete GPU, these buffers will generally be in VRAM, rather than in system RAM, because access to system RAM requires going through the PCI bus (slow). However, VRAM is a limited resource, so the driver will migrate pages between VRAM and system RAM as needed. During the migration, a guest that tries to access the pages must block until the migration is complete. Some GPU drivers support accessing externally provided memory. This is called "userptr", and is supported by i915 and amdgpu. However, it appears that some other drivers (such as MSM) do not support it, and since GPUs with VRAM need to be supported anyway, Xen still needs to support GPU driver-allocated memory. I also CCd dri-devel@lists.freedesktop.org and the general GPU driver maintainers in Linux in case they can give a better answer, as well as Rob Clark who invented native contexts. > > This requires changes to all of the Xen hypervisor, Linux > > kernel, and userspace device model. > > > > ### Goals > > > > - Allow any userspace pages to be mapped into a guest. > > - Support deprivileged operation: this API must not be usable for privilege escalation. > > - Use MMU notifiers to ensure safety with respect to use-after-free. > > > > ### Hypervisor changes > > > > There are at least two Xen changes required: > > > > 1. Add a new flag to IOREQ that means "retry this instruction". > > > > An IOREQ server can set this flag after having successfully handled a > > page fault. It is expected that the IOREQ server has successfully > > mapped a page into the guest at the location of the fault. > > Otherwise, the same fault will likely happen again. > > Were there any thoughts on how to prevent this becoming an infinite loop? > I.e. how to (a) guarantee forward progress in the guest and (b) deal with > misbehaving IOREQ servers? Guaranteeing forward progress is up to the IOREQ server. If the IOREQ server misbehaves, an infinite loop is possible, but the CPU time used by it should be charged to the IOREQ server, so this isn't a vulnerability. > > 2. Add support for `XEN_DOMCTL_memory_mapping` to use system RAM, not > > just IOMEM. Mappings made with `XEN_DOMCTL_memory_mapping` are > > guaranteed to be able to be successfully revoked with > > `XEN_DOMCTL_memory_mapping`, so all operations that would create > > extra references to the mapped memory must be forbidden. These > > include, but may not be limited to: > > > > 1. Granting the pages to the same or other domains. > > 2. Mapping into another domain using `XEN_DOMCTL_memory_mapping`. > > 3. Another domain accessing the pages using the foreign memory APIs, > > unless it is privileged over the domain that owns the pages. > > All of which may call for actually converting the memory to kind-of-MMIO, > with a means to later convert it back. Would this support the case where the mapping domain is not fully priviliged, and where it might be a PV guest? -- Sincerely, Demi Marie Obenour (she/her/hers) Invisible Things Lab [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2024-06-19 16:57 UTC | newest] Thread overview: 26+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-06-13 18:43 Design session notes: GPU acceleration in Xen Demi Marie Obenour 2024-06-14 6:38 ` Jan Beulich 2024-06-14 7:21 ` Roger Pau Monné 2024-06-14 8:12 ` Jan Beulich 2024-06-14 8:39 ` Roger Pau Monné 2024-06-17 0:38 ` Demi Marie Obenour 2024-06-17 7:46 ` Roger Pau Monné 2024-06-17 15:07 ` Demi Marie Obenour 2024-06-17 20:46 ` Marek Marczykowski-Górecki 2024-06-18 0:57 ` Demi Marie Obenour 2024-06-18 6:33 ` Christian König 2024-06-18 14:12 ` Demi Marie Obenour 2024-06-19 7:31 ` Christian König 2024-06-19 16:56 ` Alex Deucher 2024-06-18 14:43 ` Roger Pau Monné 2024-06-18 14:56 ` Demi Marie Obenour 2024-06-17 9:13 ` Jan Beulich 2024-06-14 16:44 ` Demi Marie Obenour 2024-06-17 9:07 ` Jan Beulich 2024-06-17 15:17 ` Demi Marie Obenour 2024-06-17 15:39 ` Jan Beulich 2024-06-17 16:02 ` Demi Marie Obenour 2024-06-14 17:55 ` Demi Marie Obenour 2024-06-14 16:35 ` Demi Marie Obenour 2024-06-17 9:05 ` Jan Beulich 2024-06-14 20:56 ` Demi Marie Obenour
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.