Mapping non-pinned memory from one Xen domain into another

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* Mapping non-pinned memory from one Xen domain into another
@ 2026-03-24 14:17 Demi Marie Obenour
  2026-03-24 18:00 ` Teddy Astie
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Demi Marie Obenour @ 2026-03-24 14:17 UTC (permalink / raw)
  To: Xen developer discussion, dri-devel, linux-mm, Jan Beulich,
	Val Packett, Ariadne Conill, Andrew Cooper, Juergen Gross

[-- Attachment #1.1.1: Type: text/plain, Size: 10473 bytes --]

Here is a proposed design document for supporting mapping GPU VRAM
and/or file-backed memory into other domains.  It's not in the form of
a patch because the leading + characters would just make it harder to
read for no particular gain, and because this is still RFC right now.
Once it is ready to merge, I'll send a proper patch.  Nevertheless,
you can consider this to be

Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>

This approach is very different from the "frontend-allocates"
approach used elsewhere in Xen.  It is very much Linux-centric,
rather than Xen-centric.  In fact, MMU notifiers were invented for
KVM, and this approach is exactly the same as the one KVM implements.
However, to the best of my understanding, the design described here is
the only viable one.  Linux MM and GPU drivers require it, and changes
to either to relax this requirement will not be accepted upstream.
---
# Memory lending: Mapping pageable memory, such as GPU VRAM, from one Xen domain into another

## Background

Some Linux kernel subsystems require full control over certain memory
regions.  This includes the ability to handle page faults from any
entity accessing this memory.  Such entities include not only that
kernel's userspace, but also kernels belonging to other guests.

For instance, GPU drivers reserve the right to migrate data between
VRAM and system RAM at any time.  Furthermore, there is a set of
page tables between the "aperture" (mapped as a PCI BAR) and the
actual VRAM.  This means that the GPU driver can make the memory
temporarily inaccessible to the CPU.  This is in fact _required_
when resizable BAR is not supported, as otherwise there is too much
VRAM to expose it all via a single BAR.

Since the backing storage of this memory must be movable, pinning
it is not supported.  However, the existing grant table interface
requires pinned memory.  Therefore, such memory currently cannot be
shared with another guest.  As a result, implementing virtio-GPU blob
objects is not possible.  Since blob objects are a prerequisite for
both Venus and native contexts, supporting Vulkan via virtio-GPU on
Xen is also impossible.

Direct Access to Differentiated Memory (DAX) also relies on non-pinned
memory.  In the (now rare) case of persistent memory, it is because
the filesystem may need to move data blocks around on disk.  In the
case of virtio-pmem and virtio-fs, it is because page faults on write
operations are used to inform filesystems that they need to write the
data back at some point.  Without these page faults, filesystems will
not write back the data and silent data loss will result.

There are other use-cases for this too.  For instance, virtio-GPU
cross-domain Wayland exposes host shared memory buffers to the guest.
These buffers are mmap()'d file descriptors provided by the Wayland
compositor, and as such are not guaranteed to be anonymous memory.
Using grant tables for such mappings would conflict with the design
of existing virtio-GPU implementations, which assume that GPU VRAM
and shared memory can be handled uniformly.

Additionally, this is needed to support paging guest memory out to the
host's disks.  While this is significantly less efficient than using
an in-guest balloon driver, it has the advantage of not requiring
guest cooperation.  Therefore, it can be useful for situations in
which the performance of a guest is irrelevant, but where saving the
guest isn't appropriate.

## Informing drivers that they must stop using memory: MMU notifiers

Kernel drivers, such as xen_privcmd, in the same domain that has
the GPU (the "host") may map GPU memory buffers.  However, they must
register an *MMU notifier*.  This is a callback that Linux core memory
management code ("MM") uses to tell the driver that it must stop
all accesses to the memory.  Once the memory is no longer accessed,
Linux assumes it can do whatever it wants with this memory:

- The GPU driver can move it from VRAM to system RAM or visa versa,
  move it within VRAM or system RAM, or it temporarily inaccessible
  so that other VRAM can be accessed.
- MM can swap the page out to disk/zram/etc.
- MM can move the page in system RAM to create huge pages.
- MM can write the pages out to their backing files and then free them.
- Anything else in Linux can do whatever it wants with the memory.

Suspending access to memory is not allowed to block indefinitely.
It can sleep, but it must finish in finite time regardless of what
userspace (or other VMs) do.  Otherwise, bad things (which I believe
includes deadlocks) may result.  I believe it can fail temporarily,
but permanent failure is also not allowed.  Once the MMU notifier
has succeeded, userspace or other domains **must not be allowed to
access the memory**.  This would be an exploitable use-after-free
vulnerability.

Due to these requirements, MMU notifier callbacks must not require
cooperation from other guests.  This means that they are not allowed to
wait for memory that has been granted to another guest to no longer
be mapped by that guest.  Therefore, MMU notifiers and the use of
grant tables are inherently incompatible.

## Memory lending: A different approach

Instead, xen_privcmd must use a different hypercall to _lend_ memory to
another domain (the "guest").  When MM triggers the guest MMU notifier,
xen_privcmd _tells_ Xen (via hypercall) to revoke the guest's access
to the memory.  This hypercall _must succeed in bounded time_ even
if the guest is malicious.

Since the other guests are not aware this has happened, they will
continue to access the memory.  This will cause p2m faults, which
trap to Xen.  Xen normally kills the guest in this situation which is
obviously not desired behavior.  Instead, Xen must pause the guest
and inform the host's kernel.  xen_privcmd will have registered a
handler for such events, so it will be informed when this happens.

When xen_privcmd is told that a guest wants to access the revoked
page, it will ask core MM to make the page available.  Once the page
_is_ available, core MM will inform xen_privcmd, which will in turn
provide a page to Xen that will be mapped into the guest's stage 2
translation tables.  This page will generally be different than the
one that was originally lent.

Requesting a new page can fail.  This is usually due to rare errors,
such as a GPU being hot-unplugged or an I/O error faulting pages
from disk.  In these cases, the old content of the page is lost.

When this happens, xen_privcmd can do one of two things:

1. It can provide a page that is filled with zeros.
2. It can tell Xen that it is unable to fulfill the request.

Which choice it makes is under userspace control.  If userspace
chooses the second option, Xen injects a fault into the guest.
It is up to the guest to handle the fault correctly.

## Restrictions on lent memory

Lent memory is still considered to belong to the lending domain.
The borrowing domain can only access it via its p2m.  Hypercalls made
by the borrowing domain act as if the borrowed memory was not present.
This includes, but is not limited to:

- Using pointers to borrowed memory in hypercall arguments.
- Granting borrowed memory to other VMs.
- Any other operation that depends on whether a page is accessible
  by a domain.

Furthermore:

- Borrowed memory isn't mapped into the IOMMU of any PCIe devices
  the guest has attached, because IOTLB faults generally are not
  replayable.

- Foreign mapping hypercalls that reference lent memory will fail.
  Otherwise, the domain making the foreign mapping hypercall could
  continue to access the borrowed memory after the lease had been
  revoked.  This is true even if the domain performing the foreign
  mapping is an all-powerful dom0.  Otherwise, an emulated device
  could access memory whose lease had been revoked.

This also means that live migration of a domain that has borrowed
memory requires cooperation from the lending domain.  For now, it
will be considered out of scope.  Live migration is typically used
with server workloads, and accelerators for server hardware often
support SR-IOV.

## Where will lent memory appear in a guest's address space?

Typically, lent memory will be an emulated PCI BAR.  It may be emulated
by dom0 or an alternate ioreq server.  However, it is not *required*
to be a PCI BAR.

## Privileges required for memory lending

For obvious reasons, the domain lending the memory must be privileged
over the domain borrowing it.  The lending domain does not inherently
need to be privileged over the whole system.  However, supporting
situations where the providing domain is not dom0 will require
extensions to Xen's permission model, except for the case where the
providing domain only serves a single VM.

Memory lending hypercalls are not subject to the restrictions of
XSA-77.  They may safely be delegated to VMs other than dom0.

## Userspace API

To the extent possible, the memory lending API should be similar
to KVM's uAPI.  Ideally, userspace should be able to abstract over
the differences.  Using the API should not require root privileges
or be equivalent to root on the host.  It should only require a file
descriptor that only allows controlling a single domain.

## Future directions: Creating & running Xen VMs without special privileges

With the exception of a single page used for hypercalls, it is
possible for a Xen domain to *only* have borrowed memory.  Such a
domain can be managed by an entirely unprivileged userspace process,
just like it would manage a KVM VM.  Since the "host" in this scenario
only needs privilege over a domain it itself created, it is possible
(once a subset of XSA-77 restrictions are lifted) for this domain
to not actually be dom0.

Even with XSA-77, the domain could still request dom0 to create and
destroy the domain on its behalf.  Qubes OS already allows unprivileged
guests to cause domain creation and destruction, so this does not
introduce any new Xen attack surface.

This could allow unprivileged processes in a domU to create and manage
sub-domUs, just as if the domU had nested virtualization support and
KVM was used.  However, this should provide significantly better
performance than nested virtualization.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Mapping non-pinned memory from one Xen domain into another
  2026-03-24 14:17 Mapping non-pinned memory from one Xen domain into another Demi Marie Obenour
@ 2026-03-24 18:00 ` Teddy Astie
  2026-03-26 17:18   ` Demi Marie Obenour
  2026-03-29 17:32 ` Why memory lending is needed for GPU acceleration Demi Marie Obenour
  2026-03-30 12:13 ` Mapping non-pinned memory from one Xen domain into another Teddy Astie
  2 siblings, 1 reply; 14+ messages in thread
From: Teddy Astie @ 2026-03-24 18:00 UTC (permalink / raw)
  To: Demi Marie Obenour, Xen developer discussion, dri-devel, linux-mm,
	Jan Beulich, Val Packett, Ariadne Conill, Andrew Cooper,
	Juergen Gross

Hello,

I assume all this only concerns HVM/PVH DomU, I don't think it is doable
for PV DomU (if that matters).

Le 24/03/2026 à 15:17, Demi Marie Obenour a écrit :
> Here is a proposed design document for supporting mapping GPU VRAM
> and/or file-backed memory into other domains.  It's not in the form of
> a patch because the leading + characters would just make it harder to
> read for no particular gain, and because this is still RFC right now.
> Once it is ready to merge, I'll send a proper patch.  Nevertheless,
> you can consider this to be
>
> Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
>
> This approach is very different from the "frontend-allocates"
> approach used elsewhere in Xen.  It is very much Linux-centric,
> rather than Xen-centric.  In fact, MMU notifiers were invented for
> KVM, and this approach is exactly the same as the one KVM implements.
> However, to the best of my understanding, the design described here is
> the only viable one.  Linux MM and GPU drivers require it, and changes
> to either to relax this requirement will not be accepted upstream.
> ---
> # Memory lending: Mapping pageable memory, such as GPU VRAM, from one Xen domain into another
>
> ## Background
>
> Some Linux kernel subsystems require full control over certain memory
> regions.  This includes the ability to handle page faults from any
> entity accessing this memory.  Such entities include not only that
> kernel's userspace, but also kernels belonging to other guests.
>
> For instance, GPU drivers reserve the right to migrate data between
> VRAM and system RAM at any time.  Furthermore, there is a set of
> page tables between the "aperture" (mapped as a PCI BAR) and the
> actual VRAM.  This means that the GPU driver can make the memory
> temporarily inaccessible to the CPU.  This is in fact _required_
> when resizable BAR is not supported, as otherwise there is too much
> VRAM to expose it all via a single BAR.
>
> Since the backing storage of this memory must be movable, pinning
> it is not supported.  However, the existing grant table interface
> requires pinned memory.  Therefore, such memory currently cannot be
> shared with another guest.  As a result, implementing virtio-GPU blob
> objects is not possible.  Since blob objects are a prerequisite for
> both Venus and native contexts, supporting Vulkan via virtio-GPU on
> Xen is also impossible.
>

I'm not sure Vulkan fully allow memory to be moved between RAM and VRAM.
Or at least, you would need to lie a bit on the
VK_MEMORY_HEAP_DEVICE_LOCAL_BIT property.

So I assume there is a way to choose between having memory as VRAM or as
RAM somehow, unless it is only a hint ?

> Direct Access to Differentiated Memory (DAX) also relies on non-pinned
> memory.  In the (now rare) case of persistent memory, it is because
> the filesystem may need to move data blocks around on disk.  In the
> case of virtio-pmem and virtio-fs, it is because page faults on write
> operations are used to inform filesystems that they need to write the
> data back at some point.  Without these page faults, filesystems will
> not write back the data and silent data loss will result.
>
> There are other use-cases for this too.  For instance, virtio-GPU
> cross-domain Wayland exposes host shared memory buffers to the guest.
> These buffers are mmap()'d file descriptors provided by the Wayland
> compositor, and as such are not guaranteed to be anonymous memory.
> Using grant tables for such mappings would conflict with the design
> of existing virtio-GPU implementations, which assume that GPU VRAM
> and shared memory can be handled uniformly.
>
> Additionally, this is needed to support paging guest memory out to the
> host's disks.  While this is significantly less efficient than using
> an in-guest balloon driver, it has the advantage of not requiring
> guest cooperation.  Therefore, it can be useful for situations in
> which the performance of a guest is irrelevant, but where saving the
> guest isn't appropriate.
>
> ## Informing drivers that they must stop using memory: MMU notifiers
>
> Kernel drivers, such as xen_privcmd, in the same domain that has
> the GPU (the "host") may map GPU memory buffers.  However, they must
> register an *MMU notifier*.  This is a callback that Linux core memory
> management code ("MM") uses to tell the driver that it must stop
> all accesses to the memory.  Once the memory is no longer accessed,
> Linux assumes it can do whatever it wants with this memory:
>
> - The GPU driver can move it from VRAM to system RAM or visa versa,
>   move it within VRAM or system RAM, or it temporarily inaccessible
>   so that other VRAM can be accessed.
> - MM can swap the page out to disk/zram/etc.
> - MM can move the page in system RAM to create huge pages.
> - MM can write the pages out to their backing files and then free them.
> - Anything else in Linux can do whatever it wants with the memory.
>
> Suspending access to memory is not allowed to block indefinitely.
> It can sleep, but it must finish in finite time regardless of what
> userspace (or other VMs) do.  Otherwise, bad things (which I believe
> includes deadlocks) may result.  I believe it can fail temporarily,
> but permanent failure is also not allowed.  Once the MMU notifier
> has succeeded, userspace or other domains **must not be allowed to
> access the memory**.  This would be an exploitable use-after-free
> vulnerability.
>
> Due to these requirements, MMU notifier callbacks must not require
> cooperation from other guests.  This means that they are not allowed to
> wait for memory that has been granted to another guest to no longer
> be mapped by that guest.  Therefore, MMU notifiers and the use of
> grant tables are inherently incompatible.


>
> ## Memory lending: A different approach
>
> Instead, xen_privcmd must use a different hypercall to _lend_ memory to
> another domain (the "guest").  When MM triggers the guest MMU notifier,
> xen_privcmd _tells_ Xen (via hypercall) to revoke the guest's access
> to the memory.  This hypercall _must succeed in bounded time_ even
> if the guest is malicious.
>
> Since the other guests are not aware this has happened, they will
> continue to access the memory.  This will cause p2m faults, which
> trap to Xen.  Xen normally kills the guest in this situation which is
> obviously not desired behavior.  Instead, Xen must pause the guest
> and inform the host's kernel.  xen_privcmd will have registered a
> handler for such events, so it will be informed when this happens.
>
> When xen_privcmd is told that a guest wants to access the revoked
> page, it will ask core MM to make the page available.  Once the page
> _is_ available, core MM will inform xen_privcmd, which will in turn
> provide a page to Xen that will be mapped into the guest's stage 2
> translation tables.  This page will generally be different than the
> one that was originally lent.
>
> Requesting a new page can fail.  This is usually due to rare errors,
> such as a GPU being hot-unplugged or an I/O error faulting pages
> from disk.  In these cases, the old content of the page is lost.
>
> When this happens, xen_privcmd can do one of two things:
>
> 1. It can provide a page that is filled with zeros.
> 2. It can tell Xen that it is unable to fulfill the request.
>
> Which choice it makes is under userspace control.  If userspace
> chooses the second option, Xen injects a fault into the guest.
> It is up to the guest to handle the fault correctly.
>
Is it some ioreq-like mechanism where :
- A guest access a "non-ready" page
- Nothing there -> pagefault (e.g NPF) and guest vCPU is blocked
- Xen asks Dom0 what to do (event channel, VIRQ, ...)
- Dom0 explicitly maps memory to the guest (or do any other operation)
- Guest resumes execution with the page mapped

Something that looks a bit similar to "memory paging".

> ## Restrictions on lent memory
>
> Lent memory is still considered to belong to the lending domain.
> The borrowing domain can only access it via its p2m.  Hypercalls made
> by the borrowing domain act as if the borrowed memory was not present.
> This includes, but is not limited to:
>
> - Using pointers to borrowed memory in hypercall arguments.
> - Granting borrowed memory to other VMs.
> - Any other operation that depends on whether a page is accessible
>    by a domain.

What about emulated instructions that refers to this memory ?

>
> Furthermore:
>
> - Borrowed memory isn't mapped into the IOMMU of any PCIe devices
>    the guest has attached, because IOTLB faults generally are not
>    replayable.
>

Given that (as written bellow) Borrowed memory is a part of some form of
emulated BAR or special region, there is no guarantee that DMA will work
properly anyway (unless P2P DMA support is advertised).

Splitting the IOMMU side from the P2M is not a good idea as it rules out
the "IOMMU HAP PT Share" optimization.

> - Foreign mapping hypercalls that reference lent memory will fail.
>    Otherwise, the domain making the foreign mapping hypercall could
>    continue to access the borrowed memory after the lease had been
>    revoked.  This is true even if the domain performing the foreign
>    mapping is an all-powerful dom0.  Otherwise, an emulated device
>    could access memory whose lease had been revoked.
>
> This also means that live migration of a domain that has borrowed
> memory requires cooperation from the lending domain.  For now, it
> will be considered out of scope.  Live migration is typically used
> with server workloads, and accelerators for server hardware often
> support SR-IOV.
>
> ## Where will lent memory appear in a guest's address space?
>
> Typically, lent memory will be an emulated PCI BAR.  It may be emulated
> by dom0 or an alternate ioreq server.  However, it is not *required*
> to be a PCI BAR.
>

---

While the design could work (albeit the implied complexity), I'm not a
big fan of it, or at least, it needs to consider some constraints for
having reasonable performance.
One of the big issue is that a performance-sensitive system (virtualized
GPU) is interlocking with several "hard to optimize" subsystem like P2M
or Dom0 having to process a paging event.

Modifying the P2M (especially removing entries) is a fairly expensive
operation as it sometimes requires pausing all the vCPUs each time it's
done.

If it's done at 4k granularity, it would also lack superpage support,
which wouldn't help either. (doing things at the 2M+ scale would help,
but I don't know enough how MMU notifier does things.

While I agree that grants is not a adequate mechanism for this (for
multiples reasons), I'm not fully convinced of the proposal.
I would prefer a strategy where we map a fixed amount of RAM+VRAM as a
blob, along with some form of cooperative hotplug mechanism to
dynamically provision the amount.


--
Teddy Astie | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Mapping non-pinned memory from one Xen domain into another
  2026-03-24 18:00 ` Teddy Astie
@ 2026-03-26 17:18   ` Demi Marie Obenour
  2026-03-26 18:26     ` Teddy Astie
  0 siblings, 1 reply; 14+ messages in thread
From: Demi Marie Obenour @ 2026-03-26 17:18 UTC (permalink / raw)
  To: Teddy Astie, Xen developer discussion, dri-devel, linux-mm,
	Jan Beulich, Val Packett, Ariadne Conill, Andrew Cooper,
	Juergen Gross


[-- Attachment #1.1.1: Type: text/plain, Size: 13983 bytes --]

On 3/24/26 14:00, Teddy Astie wrote:
> Hello,
> 
> I assume all this only concerns HVM/PVH DomU, I don't think it is doable 
> for PV DomU (if that matters).
> 
> Le 24/03/2026 à 15:17, Demi Marie Obenour a écrit :
>> Here is a proposed design document for supporting mapping GPU VRAM
>> and/or file-backed memory into other domains.  It's not in the form of
>> a patch because the leading + characters would just make it harder to
>> read for no particular gain, and because this is still RFC right now.
>> Once it is ready to merge, I'll send a proper patch.  Nevertheless,
>> you can consider this to be
>>
>> Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
>>
>> This approach is very different from the "frontend-allocates"
>> approach used elsewhere in Xen.  It is very much Linux-centric,
>> rather than Xen-centric.  In fact, MMU notifiers were invented for
>> KVM, and this approach is exactly the same as the one KVM implements.
>> However, to the best of my understanding, the design described here is
>> the only viable one.  Linux MM and GPU drivers require it, and changes
>> to either to relax this requirement will not be accepted upstream.
>> ---
>> # Memory lending: Mapping pageable memory, such as GPU VRAM, from one Xen domain into another
>>
>> ## Background
>>
>> Some Linux kernel subsystems require full control over certain memory
>> regions.  This includes the ability to handle page faults from any
>> entity accessing this memory.  Such entities include not only that
>> kernel's userspace, but also kernels belonging to other guests.
>>
>> For instance, GPU drivers reserve the right to migrate data between
>> VRAM and system RAM at any time.  Furthermore, there is a set of
>> page tables between the "aperture" (mapped as a PCI BAR) and the
>> actual VRAM.  This means that the GPU driver can make the memory
>> temporarily inaccessible to the CPU.  This is in fact _required_
>> when resizable BAR is not supported, as otherwise there is too much
>> VRAM to expose it all via a single BAR.
>>
>> Since the backing storage of this memory must be movable, pinning
>> it is not supported.  However, the existing grant table interface
>> requires pinned memory.  Therefore, such memory currently cannot be
>> shared with another guest.  As a result, implementing virtio-GPU blob
>> objects is not possible.  Since blob objects are a prerequisite for
>> both Venus and native contexts, supporting Vulkan via virtio-GPU on
>> Xen is also impossible.
>>
> 
> I'm not sure Vulkan fully allow memory to be moved between RAM and VRAM. 
> Or at least, you would need to lie a bit on the 
> VK_MEMORY_HEAP_DEVICE_LOCAL_BIT property.
> 
> So I assume there is a way to choose between having memory as VRAM or as 
> RAM somehow, unless it is only a hint ?

On Linux, it is indeed only a hint.  If VRAM is exhausted, Linux
will move buffers to system RAM to free up space.  This significantly
benefits desktop workloads, which have a lot of GPU buffers that are
not used frequently.  Games are an exception, but I suspect they use
their buffers frequently enough that Linux doesn't decide to page
them out.

>> Direct Access to Differentiated Memory (DAX) also relies on non-pinned
>> memory.  In the (now rare) case of persistent memory, it is because
>> the filesystem may need to move data blocks around on disk.  In the
>> case of virtio-pmem and virtio-fs, it is because page faults on write
>> operations are used to inform filesystems that they need to write the
>> data back at some point.  Without these page faults, filesystems will
>> not write back the data and silent data loss will result.
>>
>> There are other use-cases for this too.  For instance, virtio-GPU
>> cross-domain Wayland exposes host shared memory buffers to the guest.
>> These buffers are mmap()'d file descriptors provided by the Wayland
>> compositor, and as such are not guaranteed to be anonymous memory.
>> Using grant tables for such mappings would conflict with the design
>> of existing virtio-GPU implementations, which assume that GPU VRAM
>> and shared memory can be handled uniformly.
>>
>> Additionally, this is needed to support paging guest memory out to the
>> host's disks.  While this is significantly less efficient than using
>> an in-guest balloon driver, it has the advantage of not requiring
>> guest cooperation.  Therefore, it can be useful for situations in
>> which the performance of a guest is irrelevant, but where saving the
>> guest isn't appropriate.
>>
>> ## Informing drivers that they must stop using memory: MMU notifiers
>>
>> Kernel drivers, such as xen_privcmd, in the same domain that has
>> the GPU (the "host") may map GPU memory buffers.  However, they must
>> register an *MMU notifier*.  This is a callback that Linux core memory
>> management code ("MM") uses to tell the driver that it must stop
>> all accesses to the memory.  Once the memory is no longer accessed,
>> Linux assumes it can do whatever it wants with this memory:
>>
>> - The GPU driver can move it from VRAM to system RAM or visa versa,
>>   move it within VRAM or system RAM, or it temporarily inaccessible
>>   so that other VRAM can be accessed.
>> - MM can swap the page out to disk/zram/etc.
>> - MM can move the page in system RAM to create huge pages.
>> - MM can write the pages out to their backing files and then free them.
>> - Anything else in Linux can do whatever it wants with the memory.
>>
>> Suspending access to memory is not allowed to block indefinitely.
>> It can sleep, but it must finish in finite time regardless of what
>> userspace (or other VMs) do.  Otherwise, bad things (which I believe
>> includes deadlocks) may result.  I believe it can fail temporarily,
>> but permanent failure is also not allowed.  Once the MMU notifier
>> has succeeded, userspace or other domains **must not be allowed to
>> access the memory**.  This would be an exploitable use-after-free
>> vulnerability.
>>
>> Due to these requirements, MMU notifier callbacks must not require
>> cooperation from other guests.  This means that they are not allowed to
>> wait for memory that has been granted to another guest to no longer
>> be mapped by that guest.  Therefore, MMU notifiers and the use of
>> grant tables are inherently incompatible.
> 
> 
>>
>> ## Memory lending: A different approach
>>
>> Instead, xen_privcmd must use a different hypercall to _lend_ memory to
>> another domain (the "guest").  When MM triggers the guest MMU notifier,
>> xen_privcmd _tells_ Xen (via hypercall) to revoke the guest's access
>> to the memory.  This hypercall _must succeed in bounded time_ even
>> if the guest is malicious.
>>
>> Since the other guests are not aware this has happened, they will
>> continue to access the memory.  This will cause p2m faults, which
>> trap to Xen.  Xen normally kills the guest in this situation which is
>> obviously not desired behavior.  Instead, Xen must pause the guest
>> and inform the host's kernel.  xen_privcmd will have registered a
>> handler for such events, so it will be informed when this happens.
>>
>> When xen_privcmd is told that a guest wants to access the revoked
>> page, it will ask core MM to make the page available.  Once the page
>> _is_ available, core MM will inform xen_privcmd, which will in turn
>> provide a page to Xen that will be mapped into the guest's stage 2
>> translation tables.  This page will generally be different than the
>> one that was originally lent.
>>
>> Requesting a new page can fail.  This is usually due to rare errors,
>> such as a GPU being hot-unplugged or an I/O error faulting pages
>> from disk.  In these cases, the old content of the page is lost.
>>
>> When this happens, xen_privcmd can do one of two things:
>>
>> 1. It can provide a page that is filled with zeros.
>> 2. It can tell Xen that it is unable to fulfill the request.
>>
>> Which choice it makes is under userspace control.  If userspace
>> chooses the second option, Xen injects a fault into the guest.
>> It is up to the guest to handle the fault correctly.
>>
> Is it some ioreq-like mechanism where :
> - A guest access a "non-ready" page
> - Nothing there -> pagefault (e.g NPF) and guest vCPU is blocked
> - Xen asks Dom0 what to do (event channel, VIRQ, ...)
> - Dom0 explicitly maps memory to the guest (or do any other operation)
> - Guest resumes execution with the page mapped

This is exactly it!

> Something that looks a bit similar to "memory paging".

It *is* memory paging :).  I named it differently for two reasons:

1. I didn't want this to be confused with the existing memory paging
   mechanism in Xen.

2. I wanted to emphasize that lent memory is still "owned" by the
   lending domain.
   
>> ## Restrictions on lent memory>>
>> Lent memory is still considered to belong to the lending domain.
>> The borrowing domain can only access it via its p2m.  Hypercalls made
>> by the borrowing domain act as if the borrowed memory was not present.
>> This includes, but is not limited to:
>>
>> - Using pointers to borrowed memory in hypercall arguments.
>> - Granting borrowed memory to other VMs.
>> - Any other operation that depends on whether a page is accessible
>>    by a domain.
> 
> What about emulated instructions that refers to this memory ?

This would be allowed if (and only if) it can trigger paging as you
wrote above.

>> Furthermore:
>>
>> - Borrowed memory isn't mapped into the IOMMU of any PCIe devices
>>    the guest has attached, because IOTLB faults generally are not
>>    replayable.
>>
> 
> Given that (as written bellow) Borrowed memory is a part of some form of 
> emulated BAR or special region, there is no guarantee that DMA will work 
> properly anyway (unless P2P DMA support is advertised).
> 
> Splitting the IOMMU side from the P2M is not a good idea as it rules out 
> the "IOMMU HAP PT Share" optimization.

If the pages are mapped in the IOMMU, paging them out requires an
IOTLB invalidation.  My understanding is that these are far too slow.

How important is sharing the HAP and IOMMU page tables?

>> - Foreign mapping hypercalls that reference lent memory will fail.
>>    Otherwise, the domain making the foreign mapping hypercall could
>>    continue to access the borrowed memory after the lease had been
>>    revoked.  This is true even if the domain performing the foreign
>>    mapping is an all-powerful dom0.  Otherwise, an emulated device
>>    could access memory whose lease had been revoked.
>>
>> This also means that live migration of a domain that has borrowed
>> memory requires cooperation from the lending domain.  For now, it
>> will be considered out of scope.  Live migration is typically used
>> with server workloads, and accelerators for server hardware often
>> support SR-IOV.
>>
>> ## Where will lent memory appear in a guest's address space?
>>
>> Typically, lent memory will be an emulated PCI BAR.  It may be emulated
>> by dom0 or an alternate ioreq server.  However, it is not *required*
>> to be a PCI BAR.
>>
> 
> ---
> 
> While the design could work (albeit the implied complexity), I'm not a 
> big fan of it, or at least, it needs to consider some constraints for 
> having reasonable performance.
> One of the big issue is that a performance-sensitive system (virtualized 
> GPU) is interlocking with several "hard to optimize" subsystem like P2M 
> or Dom0 having to process a paging event.
> 
> Modifying the P2M (especially removing entries) is a fairly expensive 
> operation as it sometimes requires pausing all the vCPUs each time it's 
> done.

Not every GPU supports recoverable page faults.  Even when they
are supported, they are extremely expensive.  Each of them involves
a round-trip from the GPU to the CPU and back, which means that a
potentially very large number of GPU cores are blocked until the
CPU can respond.  Therefore, GPU driver developers avoid relying on
GPU page faults whenever possible.  Instead, data is moved in large
chunks using a dedicated DMA engine in the GPU.
As a result, I'm not too concerned with the cost of P2M manipulation.
Anything that requires making a GPU buffer temporarily inaccessible
is already an expensive process, and driver developers have strong
incentives to keep the time the buffer is unmapped as short as
possible.
If performance turns out to be a problem, something like KVM's
asynchronous page faults might be a better solution.

> If it's done at 4k granularity, it would also lack superpage support, 
> which wouldn't help either. (doing things at the 2M+ scale would help, 
> but I don't know enough how MMU notifier does things.
> 
> While I agree that grants is not a adequate mechanism for this (for 
> multiples reasons), I'm not fully convinced of the proposal.
> I would prefer a strategy where we map a fixed amount of RAM+VRAM as a 
> blob, along with some form of cooperative hotplug mechanism to 
> dynamically provision the amount.

I asked the GPU driver developers about pinning VRAM like this a couple
years ago or so.  The response I got was that it isn't supported.
I suspect that anyone needing VRAM pinning for graphics workloads is
using non-upstreamable hacks, most likely specific to a single driver.

More generally, the entire graphics stack receives essentially no
testing under Xen.  There have been bugs that have affected Qubes OS
users for months or more, and they went unfixed because they couldn't
be reproduced outside of Xen.  To the upstream graphics developers,
Xen might as well not exist.  This means that any solution that
requires changing the graphics stack is not a practical option,
and I do not expect this to change in the foreseeable future.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Mapping non-pinned memory from one Xen domain into another
  2026-03-26 17:18   ` Demi Marie Obenour
@ 2026-03-26 18:26     ` Teddy Astie
  2026-03-27 17:18       ` Demi Marie Obenour
  0 siblings, 1 reply; 14+ messages in thread
From: Teddy Astie @ 2026-03-26 18:26 UTC (permalink / raw)
  To: Demi Marie Obenour, Xen developer discussion, dri-devel, linux-mm,
	Jan Beulich, Val Packett, Ariadne Conill, Andrew Cooper,
	Juergen Gross

Le 26/03/2026 à 18:18, Demi Marie Obenour a écrit :
> On 3/24/26 14:00, Teddy Astie wrote:
>>> ## Restrictions on lent memory>>
>>> Lent memory is still considered to belong to the lending domain.
>>> The borrowing domain can only access it via its p2m.  Hypercalls made
>>> by the borrowing domain act as if the borrowed memory was not present.
>>> This includes, but is not limited to:
>>>
>>> - Using pointers to borrowed memory in hypercall arguments.
>>> - Granting borrowed memory to other VMs.
>>> - Any other operation that depends on whether a page is accessible
>>>     by a domain.
>>
>> What about emulated instructions that refers to this memory ?
>
> This would be allowed if (and only if) it can trigger paging as you
> wrote above.
>
>>> Furthermore:
>>>
>>> - Borrowed memory isn't mapped into the IOMMU of any PCIe devices
>>>     the guest has attached, because IOTLB faults generally are not
>>>     replayable.
>>>
>>
>> Given that (as written bellow) Borrowed memory is a part of some form of
>> emulated BAR or special region, there is no guarantee that DMA will work
>> properly anyway (unless P2P DMA support is advertised).
>>
>> Splitting the IOMMU side from the P2M is not a good idea as it rules out
>> the "IOMMU HAP PT Share" optimization.
>
> If the pages are mapped in the IOMMU, paging them out requires an
> IOTLB invalidation.  My understanding is that these are far too slow.
>

yes (aside specific cases like with paravirtualized IOMMU), but only if
you have a device in the guest.

The problem is that that would force us to modify the ABI to have
"non-DMA-able" memory in the guest, which doesn't exist yet aside
specific cases like grants in PV.

> How important is sharing the HAP and IOMMU page tables?
>
>>> - Foreign mapping hypercalls that reference lent memory will fail.
>>>     Otherwise, the domain making the foreign mapping hypercall could
>>>     continue to access the borrowed memory after the lease had been
>>>     revoked.  This is true even if the domain performing the foreign
>>>     mapping is an all-powerful dom0.  Otherwise, an emulated device
>>>     could access memory whose lease had been revoked.
>>>
>>> This also means that live migration of a domain that has borrowed
>>> memory requires cooperation from the lending domain.  For now, it
>>> will be considered out of scope.  Live migration is typically used
>>> with server workloads, and accelerators for server hardware often
>>> support SR-IOV.
>>>
>>> ## Where will lent memory appear in a guest's address space?
>>>
>>> Typically, lent memory will be an emulated PCI BAR.  It may be emulated
>>> by dom0 or an alternate ioreq server.  However, it is not *required*
>>> to be a PCI BAR.
>>>
>>
>> ---
>>
>> While the design could work (albeit the implied complexity), I'm not a
>> big fan of it, or at least, it needs to consider some constraints for
>> having reasonable performance.
>> One of the big issue is that a performance-sensitive system (virtualized
>> GPU) is interlocking with several "hard to optimize" subsystem like P2M
>> or Dom0 having to process a paging event.
>>
>> Modifying the P2M (especially removing entries) is a fairly expensive
>> operation as it sometimes requires pausing all the vCPUs each time it's
>> done.
>
> Not every GPU supports recoverable page faults.  Even when they
> are supported, they are extremely expensive.  Each of them involves
> a round-trip from the GPU to the CPU and back, which means that a
> potentially very large number of GPU cores are blocked until the
> CPU can respond.  Therefore, GPU driver developers avoid relying on
> GPU page faults whenever possible.  Instead, data is moved in large
> chunks using a dedicated DMA engine in the GPU.
> As a result, I'm not too concerned with the cost of P2M manipulation.
> Anything that requires making a GPU buffer temporarily inaccessible
> is already an expensive process, and driver developers have strong
> incentives to keep the time the buffer is unmapped as short as
> possible.
> If performance turns out to be a problem, something like KVM's
> asynchronous page faults might be a better solution.
>

Asynchronous page fault looks like a interesting and potentially easier
to implement.

IIUC, the idea is to make the pages disappears on the guest behalf, and
the guest would have to deal with the eventual page fault. Currently in
Xen, a unhandled #NPF is fatal, but that could be tuned down for
specific regions and transformed into a #PF or another exception for the
guest to handle.

We have actually a similar need for SEV-ES MMIO handling, as we need to
distinguish "MMIO-related NPF" (to paravirtualize through GHCB) to the
other NPF; which needs to be configured in advance in page-tables (so
that the CPU choose between #VC and VMEXIT#NPF).

It would also need some form of para-virtualization coming from virtio
or a new Xen PV driver for the guest to be made aware of this mechanism.
I also assume that the guest handles properly that kind of event.

>> If it's done at 4k granularity, it would also lack superpage support,
>> which wouldn't help either. (doing things at the 2M+ scale would help,
>> but I don't know enough how MMU notifier does things.
>>
>> While I agree that grants is not a adequate mechanism for this (for
>> multiples reasons), I'm not fully convinced of the proposal.
>> I would prefer a strategy where we map a fixed amount of RAM+VRAM as a
>> blob, along with some form of cooperative hotplug mechanism to
>> dynamically provision the amount.
>
> I asked the GPU driver developers about pinning VRAM like this a couple
> years ago or so.  The response I got was that it isn't supported.
> I suspect that anyone needing VRAM pinning for graphics workloads is
> using non-upstreamable hacks, most likely specific to a single driver.
>
> More generally, the entire graphics stack receives essentially no
> testing under Xen.  There have been bugs that have affected Qubes OS
> users for months or more, and they went unfixed because they couldn't
> be reproduced outside of Xen.  To the upstream graphics developers,
> Xen might as well not exist.  This means that any solution that
> requires changing the graphics stack is not a practical option,
> and I do not expect this to change in the foreseeable future.


--
Teddy Astie | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Mapping non-pinned memory from one Xen domain into another
  2026-03-26 18:26     ` Teddy Astie
@ 2026-03-27 17:18       ` Demi Marie Obenour
  0 siblings, 0 replies; 14+ messages in thread
From: Demi Marie Obenour @ 2026-03-27 17:18 UTC (permalink / raw)
  To: Teddy Astie, Xen developer discussion, dri-devel, linux-mm,
	Jan Beulich, Val Packett, Ariadne Conill, Andrew Cooper,
	Juergen Gross, Marek Marczykowski-Górecki


[-- Attachment #1.1.1: Type: text/plain, Size: 7948 bytes --]

On 3/26/26 14:26, Teddy Astie wrote:
> Le 26/03/2026 à 18:18, Demi Marie Obenour a écrit :
>> On 3/24/26 14:00, Teddy Astie wrote:
>>>> ## Restrictions on lent memory>>
>>>> Lent memory is still considered to belong to the lending domain.
>>>> The borrowing domain can only access it via its p2m.  Hypercalls made
>>>> by the borrowing domain act as if the borrowed memory was not present.
>>>> This includes, but is not limited to:
>>>>
>>>> - Using pointers to borrowed memory in hypercall arguments.
>>>> - Granting borrowed memory to other VMs.
>>>> - Any other operation that depends on whether a page is accessible
>>>>     by a domain.
>>>
>>> What about emulated instructions that refers to this memory ?
>>
>> This would be allowed if (and only if) it can trigger paging as you
>> wrote above.
>>
>>>> Furthermore:
>>>>
>>>> - Borrowed memory isn't mapped into the IOMMU of any PCIe devices
>>>>     the guest has attached, because IOTLB faults generally are not
>>>>     replayable.
>>>>
>>>
>>> Given that (as written bellow) Borrowed memory is a part of some form of
>>> emulated BAR or special region, there is no guarantee that DMA will work
>>> properly anyway (unless P2P DMA support is advertised).
>>>
>>> Splitting the IOMMU side from the P2M is not a good idea as it rules out
>>> the "IOMMU HAP PT Share" optimization.
>>
>> If the pages are mapped in the IOMMU, paging them out requires an
>> IOTLB invalidation.  My understanding is that these are far too slow.
>>
> 
> yes (aside specific cases like with paravirtualized IOMMU), but only if 
> you have a device in the guest.
> 
> The problem is that that would force us to modify the ABI to have 
> "non-DMA-able" memory in the guest, which doesn't exist yet aside 
> specific cases like grants in PV.

This would make the mechanism *de facto* incompatible with PCI
passthrough.  That is unfortunate but not a dealbreaker for most
applications.  It's quite annoying, though, because of dual-GPU setups
where one GPU is paravirtualized and the other is passed through.

I don't think it necessarily needs any new guest ABI changes.
As you pointed out, guests are not allowed to assume that P2PDMA
works, so if the guest tries to DMA to these pages it's a guest bug.
This means that whether the pages can be DMA'd to or not is not a
guest-facing ABI.

That said, this should not block getting this feature implemented.

>> How important is sharing the HAP and IOMMU page tables?
>>
>>>> - Foreign mapping hypercalls that reference lent memory will fail.
>>>>     Otherwise, the domain making the foreign mapping hypercall could
>>>>     continue to access the borrowed memory after the lease had been
>>>>     revoked.  This is true even if the domain performing the foreign
>>>>     mapping is an all-powerful dom0.  Otherwise, an emulated device
>>>>     could access memory whose lease had been revoked.
>>>>
>>>> This also means that live migration of a domain that has borrowed
>>>> memory requires cooperation from the lending domain.  For now, it
>>>> will be considered out of scope.  Live migration is typically used
>>>> with server workloads, and accelerators for server hardware often
>>>> support SR-IOV.
>>>>
>>>> ## Where will lent memory appear in a guest's address space?
>>>>
>>>> Typically, lent memory will be an emulated PCI BAR.  It may be emulated
>>>> by dom0 or an alternate ioreq server.  However, it is not *required*
>>>> to be a PCI BAR.
>>>>
>>>
>>> ---
>>>
>>> While the design could work (albeit the implied complexity), I'm not a
>>> big fan of it, or at least, it needs to consider some constraints for
>>> having reasonable performance.
>>> One of the big issue is that a performance-sensitive system (virtualized
>>> GPU) is interlocking with several "hard to optimize" subsystem like P2M
>>> or Dom0 having to process a paging event.
>>>
>>> Modifying the P2M (especially removing entries) is a fairly expensive
>>> operation as it sometimes requires pausing all the vCPUs each time it's
>>> done.
>>
>> Not every GPU supports recoverable page faults.  Even when they
>> are supported, they are extremely expensive.  Each of them involves
>> a round-trip from the GPU to the CPU and back, which means that a
>> potentially very large number of GPU cores are blocked until the
>> CPU can respond.  Therefore, GPU driver developers avoid relying on
>> GPU page faults whenever possible.  Instead, data is moved in large
>> chunks using a dedicated DMA engine in the GPU.
>> As a result, I'm not too concerned with the cost of P2M manipulation.
>> Anything that requires making a GPU buffer temporarily inaccessible
>> is already an expensive process, and driver developers have strong
>> incentives to keep the time the buffer is unmapped as short as
>> possible.
>> If performance turns out to be a problem, something like KVM's
>> asynchronous page faults might be a better solution.
>>
> 
> Asynchronous page fault looks like a interesting and potentially easier 
> to implement.
> 
> IIUC, the idea is to make the pages disappears on the guest behalf, and 
> the guest would have to deal with the eventual page fault. Currently in 
> Xen, a unhandled #NPF is fatal, but that could be tuned down for 
> specific regions and transformed into a #PF or another exception for the 
> guest to handle.

Yup!

> We have actually a similar need for SEV-ES MMIO handling, as we need to 
> distinguish "MMIO-related NPF" (to paravirtualize through GHCB) to the 
> other NPF; which needs to be configured in advance in page-tables (so 
> that the CPU choose between #VC and VMEXIT#NPF).
> 
> It would also need some form of para-virtualization coming from virtio 
> or a new Xen PV driver for the guest to be made aware of this mechanism.
> I also assume that the guest handles properly that kind of event.

On KVM, asynchronous page faults are purely an optimization.  I have
a few concerns with relying entirely on them:

1. Can guest userspace use this to crash the guest kernel?  What
   happens if the guest kernel takes a fault in copy_{to,from}_user()?

2. Can this be made to work with Windows guests?

3. Could this run into a livelock problem?  Xen could tell the guest
   that the page is ready, but by the time the guest gets around to
   scheduling the userspace program, the page has been paged out again.

>>> If it's done at 4k granularity, it would also lack superpage support,
>>> which wouldn't help either. (doing things at the 2M+ scale would help,
>>> but I don't know enough how MMU notifier does things.

As an aside, graphics very much needs huge pages.  On AMD, using 4K
pages means a 30% performance hit.

>>> While I agree that grants is not a adequate mechanism for this (for
>>> multiples reasons), I'm not fully convinced of the proposal.
>>> I would prefer a strategy where we map a fixed amount of RAM+VRAM as a
>>> blob, along with some form of cooperative hotplug mechanism to
>>> dynamically provision the amount.
>>
>> I asked the GPU driver developers about pinning VRAM like this a couple
>> years ago or so.  The response I got was that it isn't supported.
>> I suspect that anyone needing VRAM pinning for graphics workloads is
>> using non-upstreamable hacks, most likely specific to a single driver.
>>
>> More generally, the entire graphics stack receives essentially no
>> testing under Xen.  There have been bugs that have affected Qubes OS
>> users for months or more, and they went unfixed because they couldn't
>> be reproduced outside of Xen.  To the upstream graphics developers,
>> Xen might as well not exist.  This means that any solution that
>> requires changing the graphics stack is not a practical option,
>> and I do not expect this to change in the foreseeable future.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Why memory lending is needed for GPU acceleration
  2026-03-24 14:17 Mapping non-pinned memory from one Xen domain into another Demi Marie Obenour
  2026-03-24 18:00 ` Teddy Astie
@ 2026-03-29 17:32 ` Demi Marie Obenour
  2026-03-30 10:15   ` Teddy Astie
  2026-03-30 20:07   ` Val Packett
  2026-03-30 12:13 ` Mapping non-pinned memory from one Xen domain into another Teddy Astie
  2 siblings, 2 replies; 14+ messages in thread
From: Demi Marie Obenour @ 2026-03-29 17:32 UTC (permalink / raw)
  To: Xen developer discussion, dri-devel, linux-mm, Jan Beulich,
	Val Packett, Ariadne Conill, Andrew Cooper, Juergen Gross,
	Teddy Astie

[-- Attachment #1.1.1: Type: text/plain, Size: 4026 bytes --]

On 3/24/26 10:17, Demi Marie Obenour wrote:
> Here is a proposed design document for supporting mapping GPU VRAM
> and/or file-backed memory into other domains.  It's not in the form of
> a patch because the leading + characters would just make it harder to
> read for no particular gain, and because this is still RFC right now.
> Once it is ready to merge, I'll send a proper patch.  Nevertheless,
> you can consider this to be
> 
> Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
> 
> This approach is very different from the "frontend-allocates"
> approach used elsewhere in Xen.  It is very much Linux-centric,
> rather than Xen-centric.  In fact, MMU notifiers were invented for
> KVM, and this approach is exactly the same as the one KVM implements.
> However, to the best of my understanding, the design described here is
> the only viable one.  Linux MM and GPU drivers require it, and changes
> to either to relax this requirement will not be accepted upstream.

Teddy Astie (CCd) proposed a couple of alternatives on Matrix:

1. Create dma-bufs for guest pages and import them into the host.

   This is a win not only for Xen, but also for KVM.  Right now, shared
   (CPU) memory buffers must be copied from the guest to the host,
   which is pointless.  So fixing that is a good thing!  That said,
   I'm still concerned about triggering GPU driver code-paths that
   are not tested on bare metal.

2. Use PASID and 2-stage translation so that the GPU can operate in
   guest physical memory.

   This is also a win.  AMD XDNA absolutely requires PASID support,
   and apparently AMD GPUs can also use PASID.  So being able to use
   PASID is certainly helpful.

However, I don't think either approach is sufficient for two reasons.

First, discrete GPUs have dedicated VRAM, which Xen knows nothing about.
Only dom0's GPU drivers can manage VRAM, and they will insist on being
able to migrate it between the CPU and the GPU.  Furthermore, VRAM
can only be allocated using GPU driver ioctls, which will allocate
it from dom0-owned memory.

Second, Certain Wayland protocols, such as screencapture, require programs
to be able to import dmabufs.  Both of the above solutions would
require that the pages be pinned.  I don't think this is an option,
as IIUC pin_user_pages() fails on mappings of these dmabufs.  It's why
direct I/O to dmabufs doesn't work.

To the best of my knowledge, these problems mean that lending memory
is the only way to get robust GPU acceleration for both graphics and
compute workloads under Xen.  Simpler approaches might work for pure
compute workloads, for iGPUs, or for drivers that have Xen-specific
changes.  None of them, however, support graphics workloads on dGPUs
while using the GPU driver the same way bare metal workloads do.

Linux's graphics stack is massive, and trying to adapt it to work with
Xen isn't going to be sustainable in the long term.  Adapting Xen to
fit the graphics stack is probably more work up front, but it has the
advantage of working with all GPU drivers, including ones that have not
been written yet.  It also means that the testing done on bare metal is
still applicable, and that bugs found when using this driver can either
be reproduced on bare metal or can be fixed without driver changes.

Finally, I'm not actually attached to memory lending at all.  It's a
lot of complexity, and it's not at all similar to how the rest of
Xen works.  If someone else can come up with a better solution that
doesn't require GPU driver changes, I'd be all for it.  Unfortunately,
I suspect none exists.  One can make almost anything work if one is
willing to patch the drivers, but I am virtually certain that this
will not be long-term sustainable.

If Xen had its own GPU drivers, the situation would be totally
different.  However, Xen must rely on Linux's GPU drivers, and that
means it must play by their rules.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why memory lending is needed for GPU acceleration
  2026-03-29 17:32 ` Why memory lending is needed for GPU acceleration Demi Marie Obenour
@ 2026-03-30 10:15   ` Teddy Astie
  2026-03-30 10:25     ` Jan Beulich
  2026-03-30 12:24     ` Demi Marie Obenour
  2026-03-30 20:07   ` Val Packett
  1 sibling, 2 replies; 14+ messages in thread
From: Teddy Astie @ 2026-03-30 10:15 UTC (permalink / raw)
  To: Demi Marie Obenour, Xen developer discussion, dri-devel, linux-mm,
	Jan Beulich, Val Packett, Ariadne Conill, Andrew Cooper,
	Juergen Gross

Le 29/03/2026 à 19:32, Demi Marie Obenour a écrit :
> On 3/24/26 10:17, Demi Marie Obenour wrote:
>> Here is a proposed design document for supporting mapping GPU VRAM
>> and/or file-backed memory into other domains.  It's not in the form of
>> a patch because the leading + characters would just make it harder to
>> read for no particular gain, and because this is still RFC right now.
>> Once it is ready to merge, I'll send a proper patch.  Nevertheless,
>> you can consider this to be
>>
>> Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
>>
>> This approach is very different from the "frontend-allocates"
>> approach used elsewhere in Xen.  It is very much Linux-centric,
>> rather than Xen-centric.  In fact, MMU notifiers were invented for
>> KVM, and this approach is exactly the same as the one KVM implements.
>> However, to the best of my understanding, the design described here is
>> the only viable one.  Linux MM and GPU drivers require it, and changes
>> to either to relax this requirement will not be accepted upstream.
>
> Teddy Astie (CCd) proposed a couple of alternatives on Matrix:
>
> 1. Create dma-bufs for guest pages and import them into the host.
>
>     This is a win not only for Xen, but also for KVM.  Right now, shared
>     (CPU) memory buffers must be copied from the guest to the host,
>     which is pointless.  So fixing that is a good thing!  That said,
>     I'm still concerned about triggering GPU driver code-paths that
>     are not tested on bare metal.
>
> 2. Use PASID and 2-stage translation so that the GPU can operate in
>     guest physical memory.
>
>     This is also a win.  AMD XDNA absolutely requires PASID support,
>     and apparently AMD GPUs can also use PASID.  So being able to use
>     PASID is certainly helpful.
>
> However, I don't think either approach is sufficient for two reasons.
>
> First, discrete GPUs have dedicated VRAM, which Xen knows nothing about.
> Only dom0's GPU drivers can manage VRAM, and they will insist on being
> able to migrate it between the CPU and the GPU.  Furthermore, VRAM
> can only be allocated using GPU driver ioctls, which will allocate
> it from dom0-owned memory.
>
> Second, Certain Wayland protocols, such as screencapture, require programs
> to be able to import dmabufs.  Both of the above solutions would
> require that the pages be pinned.  I don't think this is an option,
> as IIUC pin_user_pages() fails on mappings of these dmabufs.  It's why
> direct I/O to dmabufs doesn't work.
>

I suppose it fails because of the RAM/VRAM constraint you said
previously. If the location of the memory stays the same (i.e guest
memory mapping), pin should be almost "no-op".

(though, having dma-buf buffers coming from GPU drivers failing to pin
is probably not a good thing in term of stability; some stuff like
cameras probably break as a result; but I'm not a expert on that subject)

> To the best of my knowledge, these problems mean that lending memory
> is the only way to get robust GPU acceleration for both graphics and
> compute workloads under Xen.  Simpler approaches might work for pure
> compute workloads, for iGPUs, or for drivers that have Xen-specific
> changes.  None of them, however, support graphics workloads on dGPUs
> while using the GPU driver the same way bare metal workloads do.
>
> Linux's graphics stack is massive, and trying to adapt it to work with
> Xen isn't going to be sustainable in the long term.  Adapting Xen to
> fit the graphics stack is probably more work up front, but it has the
> advantage of working with all GPU drivers, including ones that have not
> been written yet.  It also means that the testing done on bare metal is
> still applicable, and that bugs found when using this driver can either
> be reproduced on bare metal or can be fixed without driver changes.

One of my main concerns was about whether dma-buf can be used as
"general purpose" GPU buffers; what I read in driver code suggest it
should be fine, but it's a bit on the edge.

>
> Finally, I'm not actually attached to memory lending at all.  It's a
> lot of complexity, and it's not at all similar to how the rest of
> Xen works.  If someone else can come up with a better solution that
> doesn't require GPU driver changes, I'd be all for it.  Unfortunately,
> I suspect none exists.  One can make almost anything work if one is
> willing to patch the drivers, but I am virtually certain that this
> will not be long-term sustainable.
>

There's also the virtio-gpu side to consider. Blob mechanism appears to
insist that GPU memory to come from the host by allowing buffers that
aren't bound to virtio-gpu BAR yet (that also complexifies the KVM
situation).

You can have GPU memory that exists in virtio-gpu, without being
guest-visible, then the guest can map it on its own BAR.

> If Xen had its own GPU drivers, the situation would be totally
> different.  However, Xen must rely on Linux's GPU drivers, and that
> means it must play by their rules.




--
Teddy Astie | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why memory lending is needed for GPU acceleration
  2026-03-30 10:15   ` Teddy Astie
@ 2026-03-30 10:25     ` Jan Beulich
  2026-03-30 12:24     ` Demi Marie Obenour
  1 sibling, 0 replies; 14+ messages in thread
From: Jan Beulich @ 2026-03-30 10:25 UTC (permalink / raw)
  To: Teddy Astie, Demi Marie Obenour
  Cc: Xen developer discussion, dri-devel, linux-mm, Val Packett,
	Ariadne Conill, Andrew Cooper, Juergen Gross

On 30.03.2026 12:15, Teddy Astie wrote:
> Le 29/03/2026 à 19:32, Demi Marie Obenour a écrit :

May I ask that the two of you please properly separate To: vs Cc:?

Thanks, Jan

>> On 3/24/26 10:17, Demi Marie Obenour wrote:
>>> Here is a proposed design document for supporting mapping GPU VRAM
>>> and/or file-backed memory into other domains.  It's not in the form of
>>> a patch because the leading + characters would just make it harder to
>>> read for no particular gain, and because this is still RFC right now.
>>> Once it is ready to merge, I'll send a proper patch.  Nevertheless,
>>> you can consider this to be
>>>
>>> Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
>>>
>>> This approach is very different from the "frontend-allocates"
>>> approach used elsewhere in Xen.  It is very much Linux-centric,
>>> rather than Xen-centric.  In fact, MMU notifiers were invented for
>>> KVM, and this approach is exactly the same as the one KVM implements.
>>> However, to the best of my understanding, the design described here is
>>> the only viable one.  Linux MM and GPU drivers require it, and changes
>>> to either to relax this requirement will not be accepted upstream.
>>
>> Teddy Astie (CCd) proposed a couple of alternatives on Matrix:
>>
>> 1. Create dma-bufs for guest pages and import them into the host.
>>
>>     This is a win not only for Xen, but also for KVM.  Right now, shared
>>     (CPU) memory buffers must be copied from the guest to the host,
>>     which is pointless.  So fixing that is a good thing!  That said,
>>     I'm still concerned about triggering GPU driver code-paths that
>>     are not tested on bare metal.
>>     
>> 2. Use PASID and 2-stage translation so that the GPU can operate in
>>     guest physical memory.
>>     
>>     This is also a win.  AMD XDNA absolutely requires PASID support,
>>     and apparently AMD GPUs can also use PASID.  So being able to use
>>     PASID is certainly helpful.
>>
>> However, I don't think either approach is sufficient for two reasons.
>>
>> First, discrete GPUs have dedicated VRAM, which Xen knows nothing about.
>> Only dom0's GPU drivers can manage VRAM, and they will insist on being
>> able to migrate it between the CPU and the GPU.  Furthermore, VRAM
>> can only be allocated using GPU driver ioctls, which will allocate
>> it from dom0-owned memory.
>>
>> Second, Certain Wayland protocols, such as screencapture, require programs
>> to be able to import dmabufs.  Both of the above solutions would
>> require that the pages be pinned.  I don't think this is an option,
>> as IIUC pin_user_pages() fails on mappings of these dmabufs.  It's why
>> direct I/O to dmabufs doesn't work.
>>
> 
> I suppose it fails because of the RAM/VRAM constraint you said 
> previously. If the location of the memory stays the same (i.e guest 
> memory mapping), pin should be almost "no-op".
> 
> (though, having dma-buf buffers coming from GPU drivers failing to pin 
> is probably not a good thing in term of stability; some stuff like 
> cameras probably break as a result; but I'm not a expert on that subject)
> 
>> To the best of my knowledge, these problems mean that lending memory
>> is the only way to get robust GPU acceleration for both graphics and
>> compute workloads under Xen.  Simpler approaches might work for pure
>> compute workloads, for iGPUs, or for drivers that have Xen-specific
>> changes.  None of them, however, support graphics workloads on dGPUs
>> while using the GPU driver the same way bare metal workloads do.
>>
>> Linux's graphics stack is massive, and trying to adapt it to work with
>> Xen isn't going to be sustainable in the long term.  Adapting Xen to
>> fit the graphics stack is probably more work up front, but it has the
>> advantage of working with all GPU drivers, including ones that have not
>> been written yet.  It also means that the testing done on bare metal is
>> still applicable, and that bugs found when using this driver can either
>> be reproduced on bare metal or can be fixed without driver changes.
> 
> One of my main concerns was about whether dma-buf can be used as 
> "general purpose" GPU buffers; what I read in driver code suggest it 
> should be fine, but it's a bit on the edge.
> 
>>
>> Finally, I'm not actually attached to memory lending at all.  It's a
>> lot of complexity, and it's not at all similar to how the rest of
>> Xen works.  If someone else can come up with a better solution that
>> doesn't require GPU driver changes, I'd be all for it.  Unfortunately,
>> I suspect none exists.  One can make almost anything work if one is
>> willing to patch the drivers, but I am virtually certain that this
>> will not be long-term sustainable.
>>
> 
> There's also the virtio-gpu side to consider. Blob mechanism appears to 
> insist that GPU memory to come from the host by allowing buffers that 
> aren't bound to virtio-gpu BAR yet (that also complexifies the KVM 
> situation).
> 
> You can have GPU memory that exists in virtio-gpu, without being 
> guest-visible, then the guest can map it on its own BAR.
> 
>> If Xen had its own GPU drivers, the situation would be totally
>> different.  However, Xen must rely on Linux's GPU drivers, and that
>> means it must play by their rules.
> 
> 
> 
> 
> --
> Teddy Astie | Vates XCP-ng Developer
> 
> XCP-ng & Xen Orchestra - Vates solutions
> 
> web: https://vates.tech
> 
> 



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why memory lending is needed for GPU acceleration
  2026-03-30 10:15   ` Teddy Astie
  2026-03-30 10:25     ` Jan Beulich
@ 2026-03-30 12:24     ` Demi Marie Obenour
  1 sibling, 0 replies; 14+ messages in thread
From: Demi Marie Obenour @ 2026-03-30 12:24 UTC (permalink / raw)
  To: Teddy Astie
  Cc: Xen developer discussion, dri-devel, Jan Beulich, Juergen Gross,
	linux-mm, Andrew Cooper, Ariadne Conill, Val Packett,
	Marek Marczykowski-Górecki, Alyssa Ross


[-- Attachment #1.1.1: Type: text/plain, Size: 6535 bytes --]

On 3/30/26 06:15, Teddy Astie wrote:
> Le 29/03/2026 à 19:32, Demi Marie Obenour a écrit :
>> On 3/24/26 10:17, Demi Marie Obenour wrote:
>>> Here is a proposed design document for supporting mapping GPU VRAM
>>> and/or file-backed memory into other domains.  It's not in the form of
>>> a patch because the leading + characters would just make it harder to
>>> read for no particular gain, and because this is still RFC right now.
>>> Once it is ready to merge, I'll send a proper patch.  Nevertheless,
>>> you can consider this to be
>>>
>>> Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
>>>
>>> This approach is very different from the "frontend-allocates"
>>> approach used elsewhere in Xen.  It is very much Linux-centric,
>>> rather than Xen-centric.  In fact, MMU notifiers were invented for
>>> KVM, and this approach is exactly the same as the one KVM implements.
>>> However, to the best of my understanding, the design described here is
>>> the only viable one.  Linux MM and GPU drivers require it, and changes
>>> to either to relax this requirement will not be accepted upstream.
>>
>> Teddy Astie (CCd) proposed a couple of alternatives on Matrix:
>>
>> 1. Create dma-bufs for guest pages and import them into the host.
>>
>>     This is a win not only for Xen, but also for KVM.  Right now, shared
>>     (CPU) memory buffers must be copied from the guest to the host,
>>     which is pointless.  So fixing that is a good thing!  That said,
>>     I'm still concerned about triggering GPU driver code-paths that
>>     are not tested on bare metal.
>>     
>> 2. Use PASID and 2-stage translation so that the GPU can operate in
>>     guest physical memory.
>>     
>>     This is also a win.  AMD XDNA absolutely requires PASID support,
>>     and apparently AMD GPUs can also use PASID.  So being able to use
>>     PASID is certainly helpful.
>>
>> However, I don't think either approach is sufficient for two reasons.
>>
>> First, discrete GPUs have dedicated VRAM, which Xen knows nothing about.
>> Only dom0's GPU drivers can manage VRAM, and they will insist on being
>> able to migrate it between the CPU and the GPU.  Furthermore, VRAM
>> can only be allocated using GPU driver ioctls, which will allocate
>> it from dom0-owned memory.
>>
>> Second, Certain Wayland protocols, such as screencapture, require programs
>> to be able to import dmabufs.  Both of the above solutions would
>> require that the pages be pinned.  I don't think this is an option,
>> as IIUC pin_user_pages() fails on mappings of these dmabufs.  It's why
>> direct I/O to dmabufs doesn't work.
>>
> 
> I suppose it fails because of the RAM/VRAM constraint you said 
> previously. If the location of the memory stays the same (i.e guest 
> memory mapping), pin should be almost "no-op".

Yup, there is no reason it shouldn't work for mappings of guest memory,
udmabufs or indeed for iGPU dmabufs.  Whether it does work is another
question.  I believe it sometimes fails even when it could work,
due to (fixable) Linux kernel limitations.

> (though, having dma-buf buffers coming from GPU drivers failing to pin 
> is probably not a good thing in term of stability; some stuff like 
> cameras probably break as a result; but I'm not a expert on that subject)

I suspect that it works for the drivers where this situation applies
in practice.  To the best of my knowledge, these drivers are either
for iGPUs or for compute workloads.  Compute workloads *do* need tight
control of whether a workload is in system RAM or VRAM, and typical
desktops don't have a bunch of them sitting idle so the benefits of
paging out their VRAM are vastly reduced.

>> To the best of my knowledge, these problems mean that lending memory
>> is the only way to get robust GPU acceleration for both graphics and
>> compute workloads under Xen.  Simpler approaches might work for pure
>> compute workloads, for iGPUs, or for drivers that have Xen-specific
>> changes.  None of them, however, support graphics workloads on dGPUs
>> while using the GPU driver the same way bare metal workloads do.
>>
>> Linux's graphics stack is massive, and trying to adapt it to work with
>> Xen isn't going to be sustainable in the long term.  Adapting Xen to
>> fit the graphics stack is probably more work up front, but it has the
>> advantage of working with all GPU drivers, including ones that have not
>> been written yet.  It also means that the testing done on bare metal is
>> still applicable, and that bugs found when using this driver can either
>> be reproduced on bare metal or can be fixed without driver changes.
> 
> One of my main concerns was about whether dma-buf can be used as 
> "general purpose" GPU buffers; what I read in driver code suggest it 
> should be fine, but it's a bit on the edge.

Importing dmabufs into GPU drivers should work unless there is a
reason it cannot work.  If it doesn't, I would consider it to be a bug.
However, "should work" and "is widely tested" are two different things.
I don't want to have something regress and whoever is using this
(probably the Qubes team) responsible for fixing it.

is that what you meant by "it's a bit on the edge"?

>> Finally, I'm not actually attached to memory lending at all.  It's a
>> lot of complexity, and it's not at all similar to how the rest of
>> Xen works.  If someone else can come up with a better solution that
>> doesn't require GPU driver changes, I'd be all for it.  Unfortunately,
>> I suspect none exists.  One can make almost anything work if one is
>> willing to patch the drivers, but I am virtually certain that this
>> will not be long-term sustainable.
>>
> 
> There's also the virtio-gpu side to consider. Blob mechanism appears to 
> insist that GPU memory to come from the host by allowing buffers that 
> aren't bound to virtio-gpu BAR yet (that also complexifies the KVM 
> situation).

Indeed, that's not a good situation for anyone and ought to just
be fixed.  CC'ing Alyssa Ross as this probably would help Spectrum too.

> You can have GPU memory that exists in virtio-gpu, without being 
> guest-visible, then the guest can map it on its own BAR.

Would you mind explaining this?

One other question: How difficult would it be to implement memory
lending for someone experienced in Linux, Xen, and their interface?
What about for someone who is not experienced?
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why memory lending is needed for GPU acceleration
  2026-03-29 17:32 ` Why memory lending is needed for GPU acceleration Demi Marie Obenour
  2026-03-30 10:15   ` Teddy Astie
@ 2026-03-30 20:07   ` Val Packett
  2026-03-31  9:42     ` Teddy Astie
  1 sibling, 1 reply; 14+ messages in thread
From: Val Packett @ 2026-03-30 20:07 UTC (permalink / raw)
  To: Demi Marie Obenour, Xen developer discussion, dri-devel, linux-mm,
	Jan Beulich, Ariadne Conill, Andrew Cooper, Juergen Gross,
	Teddy Astie

Hi,

On 3/29/26 2:32 PM, Demi Marie Obenour wrote:
> On 3/24/26 10:17, Demi Marie Obenour wrote:
>> Here is a proposed design document for supporting mapping GPU VRAM
>> and/or file-backed memory into other domains.  It's not in the form of
>> a patch because the leading + characters would just make it harder to
>> read for no particular gain, and because this is still RFC right now.
>> Once it is ready to merge, I'll send a proper patch.  Nevertheless,
>> you can consider this to be
>>
>> Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
>>
>> This approach is very different from the "frontend-allocates"
>> approach used elsewhere in Xen.  It is very much Linux-centric,
>> rather than Xen-centric.  In fact, MMU notifiers were invented for
>> KVM, and this approach is exactly the same as the one KVM implements.
>> However, to the best of my understanding, the design described here is
>> the only viable one.  Linux MM and GPU drivers require it, and changes
>> to either to relax this requirement will not be accepted upstream.
> Teddy Astie (CCd) proposed a couple of alternatives on Matrix:
>
> 1. Create dma-bufs for guest pages and import them into the host.
>
>     This is a win not only for Xen, but also for KVM.  Right now, shared
>     (CPU) memory buffers must be copied from the guest to the host,
>     which is pointless.  So fixing that is a good thing!  That said,
>     I'm still concerned about triggering GPU driver code-paths that
>     are not tested on bare metal.

To expand on this: the reason cross-domain Wayland proxies have been 
doing this SHM copy dance was a deficiency in Linux UAPI. Basically, 
applications allocate shared memory using local mechanisms like memfd 
(and good old unlink-of-regular-file, ugh) which weren't compatible with 
cross-VM sharing. However udmabuf should basically solve it, at least 
for memfds. (I haven't yet investigated what happens with "unlinked 
regular files" yet but I don't expect anything good there, welp.)

But I have landed a patch in Linux that removes a silly restriction that 
tied dmabuf import into virtgpu to KMS-only mode:

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=df4dc947c46bb9f80038f52c6e38cb2d40c10e50

And I have experimented with it and got a KVM-based VMM to successfully 
access and print guest memfd contents that were passed to the host via 
this mechanism. (Time to actually properly implement it into the full 
system..)

> 2. Use PASID and 2-stage translation so that the GPU can operate in
>     guest physical memory.
>     
>     This is also a win.  AMD XDNA absolutely requires PASID support,
>     and apparently AMD GPUs can also use PASID.  So being able to use
>     PASID is certainly helpful.
>
> However, I don't think either approach is sufficient for two reasons.
>
> First, discrete GPUs have dedicated VRAM, which Xen knows nothing about.
> Only dom0's GPU drivers can manage VRAM, and they will insist on being
> able to migrate it between the CPU and the GPU.  Furthermore, VRAM
> can only be allocated using GPU driver ioctls, which will allocate
> it from dom0-owned memory.
>
> Second, Certain Wayland protocols, such as screencapture, require programs
> to be able to import dmabufs.  Both of the above solutions would
> require that the pages be pinned.  I don't think this is an option,
> as IIUC pin_user_pages() fails on mappings of these dmabufs.  It's why
> direct I/O to dmabufs doesn't work.
>
> To the best of my knowledge, these problems mean that lending memory
> is the only way to get robust GPU acceleration for both graphics and
> compute workloads under Xen.  Simpler approaches might work for pure
> compute workloads, for iGPUs, or for drivers that have Xen-specific
> changes.  None of them, however, support graphics workloads on dGPUs
> while using the GPU driver the same way bare metal workloads do.
> […]
To recap, how virtio-gpu Host3d memory currently works with KVM is:

- the VMM/virtgpu receives a dmabuf over a socket 
(Wayland/D-Bus/whatever) and registers it internally with some resource 
ID that's passed to the guest;
- When the guest imports that resource, it calls 
VIRTIO_GPU_CMD_RESOURCE_MAP_BLOB to get a PRIME buffer that can be 
turned into a dmabuf fd;
- the VMM's handler for VIRTIO_GPU_CMD_RESOURCE_MAP_BLOB (referencing 
libkrun here) literally just calls mmap() on the host dmabuf, using the 
MAP_FIXED flag to place it correctly inside of the VMM process's 
guest-exposed VA region (configured via KVM_SET_USER_MEMORY_REGION);
- so any resource imported by the guest, even before guest userspace 
does mmap(), is mapped (as VM_PFNMAP|VM_IO) until the guest releases it.

So the generic kernel MM is out of the way, these mappings can't be 
paged out to swap etc. But accessing them may fault, as the comment for 
drm_gem_mmap_obj says:

  * Depending on their requirements, GEM objects can either
  * provide a fault handler in their vm_ops (in which case any accesses to
  * the object will be trapped, to perform migration, GTT binding, surface
  * register allocation, or performance monitoring), or mmap the buffer 
memory
  * synchronously after calling drm_gem_mmap_obj

It all "just works" in KVM because KVM's resolution of the guest's 
memory accesses tries to be literally equivalent to what's mapped into 
the userspace VMM process: hva_to_pfn_remapped explicitly calls 
fixup_user_fault and eventually gets to the GPU driver's fault handler.

Now for Xen this would be… painful,

but,

we have no need to replicate what KVM does. That's far from the only 
thing that can be done with a dmabuf.

The import-export machinery on the other hand actually does pin the 
buffers on the driver level, importers are not obligated to support 
movable buffers (move_notify in dma_buf_attach_ops is entirely optional).

Interestingly, there is already XEN_GNTDEV_DMABUF…

Wait, do we even have any reason at all to suspect 
that XEN_GNTDEV_DMABUF doesn't already satisfy all of our buffer-sharing 
requirements?


Thanks,
~val

P.S. while I have everyone's attention, can I get some eyes on:
https://lore.kernel.org/all/20251126062124.117425-1-val@invisiblethingslab.com/ 
?



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why memory lending is needed for GPU acceleration
  2026-03-30 20:07   ` Val Packett
@ 2026-03-31  9:42     ` Teddy Astie
  2026-03-31 11:23       ` Val Packett
  2026-04-03 21:24       ` Marek Marczykowski-Górecki
  0 siblings, 2 replies; 14+ messages in thread
From: Teddy Astie @ 2026-03-31  9:42 UTC (permalink / raw)
  To: Val Packett, Demi Marie Obenour, Xen developer discussion,
	dri-devel, linux-mm, Ariadne Conill

Le 30/03/2026 à 22:13, Val Packett a écrit :
> Hi,
>
> On 3/29/26 2:32 PM, Demi Marie Obenour wrote:
>> On 3/24/26 10:17, Demi Marie Obenour wrote:
>>> Here is a proposed design document for supporting mapping GPU VRAM
>>> and/or file-backed memory into other domains.  It's not in the form of
>>> a patch because the leading + characters would just make it harder to
>>> read for no particular gain, and because this is still RFC right now.
>>> Once it is ready to merge, I'll send a proper patch.  Nevertheless,
>>> you can consider this to be
>>>
>>> Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
>>>
>>> This approach is very different from the "frontend-allocates"
>>> approach used elsewhere in Xen.  It is very much Linux-centric,
>>> rather than Xen-centric.  In fact, MMU notifiers were invented for
>>> KVM, and this approach is exactly the same as the one KVM implements.
>>> However, to the best of my understanding, the design described here is
>>> the only viable one.  Linux MM and GPU drivers require it, and changes
>>> to either to relax this requirement will not be accepted upstream.
>> Teddy Astie (CCd) proposed a couple of alternatives on Matrix:
>>
>> 1. Create dma-bufs for guest pages and import them into the host.
>>
>>     This is a win not only for Xen, but also for KVM.  Right now, shared
>>     (CPU) memory buffers must be copied from the guest to the host,
>>     which is pointless.  So fixing that is a good thing!  That said,
>>     I'm still concerned about triggering GPU driver code-paths that
>>     are not tested on bare metal.
>
> To expand on this: the reason cross-domain Wayland proxies have been
> doing this SHM copy dance was a deficiency in Linux UAPI. Basically,
> applications allocate shared memory using local mechanisms like memfd
> (and good old unlink-of-regular-file, ugh) which weren't compatible with
> cross-VM sharing. However udmabuf should basically solve it, at least
> for memfds. (I haven't yet investigated what happens with "unlinked
> regular files" yet but I don't expect anything good there, welp.)
>
> But I have landed a patch in Linux that removes a silly restriction that
> tied dmabuf import into virtgpu to KMS-only mode:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/
> commit/?id=df4dc947c46bb9f80038f52c6e38cb2d40c10e50
>
> And I have experimented with it and got a KVM-based VMM to successfully
> access and print guest memfd contents that were passed to the host via
> this mechanism. (Time to actually properly implement it into the full
> system..)
>
>> 2. Use PASID and 2-stage translation so that the GPU can operate in
>>     guest physical memory.
>>     This is also a win.  AMD XDNA absolutely requires PASID support,
>>     and apparently AMD GPUs can also use PASID.  So being able to use
>>     PASID is certainly helpful.
>>
>> However, I don't think either approach is sufficient for two reasons.
>>
>> First, discrete GPUs have dedicated VRAM, which Xen knows nothing about.
>> Only dom0's GPU drivers can manage VRAM, and they will insist on being
>> able to migrate it between the CPU and the GPU.  Furthermore, VRAM
>> can only be allocated using GPU driver ioctls, which will allocate
>> it from dom0-owned memory.
>>
>> Second, Certain Wayland protocols, such as screencapture, require
>> programs
>> to be able to import dmabufs.  Both of the above solutions would
>> require that the pages be pinned.  I don't think this is an option,
>> as IIUC pin_user_pages() fails on mappings of these dmabufs.  It's why
>> direct I/O to dmabufs doesn't work.
>>
>> To the best of my knowledge, these problems mean that lending memory
>> is the only way to get robust GPU acceleration for both graphics and
>> compute workloads under Xen.  Simpler approaches might work for pure
>> compute workloads, for iGPUs, or for drivers that have Xen-specific
>> changes.  None of them, however, support graphics workloads on dGPUs
>> while using the GPU driver the same way bare metal workloads do.
>> […]
> To recap, how virtio-gpu Host3d memory currently works with KVM is:
>
> - the VMM/virtgpu receives a dmabuf over a socket (Wayland/D-Bus/
> whatever) and registers it internally with some resource ID that's
> passed to the guest;
> - When the guest imports that resource, it calls
> VIRTIO_GPU_CMD_RESOURCE_MAP_BLOB to get a PRIME buffer that can be
> turned into a dmabuf fd;
> - the VMM's handler for VIRTIO_GPU_CMD_RESOURCE_MAP_BLOB (referencing
> libkrun here) literally just calls mmap() on the host dmabuf, using the
> MAP_FIXED flag to place it correctly inside of the VMM process's guest-
> exposed VA region (configured via KVM_SET_USER_MEMORY_REGION);
> - so any resource imported by the guest, even before guest userspace
> does mmap(), is mapped (as VM_PFNMAP|VM_IO) until the guest releases it.
>
> So the generic kernel MM is out of the way, these mappings can't be
> paged out to swap etc. But accessing them may fault, as the comment for
> drm_gem_mmap_obj says:
>
>   * Depending on their requirements, GEM objects can either
>   * provide a fault handler in their vm_ops (in which case any accesses to
>   * the object will be trapped, to perform migration, GTT binding, surface
>   * register allocation, or performance monitoring), or mmap the buffer
> memory
>   * synchronously after calling drm_gem_mmap_obj
>
> It all "just works" in KVM because KVM's resolution of the guest's
> memory accesses tries to be literally equivalent to what's mapped into
> the userspace VMM process: hva_to_pfn_remapped explicitly calls
> fixup_user_fault and eventually gets to the GPU driver's fault handler.
>
> Now for Xen this would be… painful,
>

indeed

> but,
>
> we have no need to replicate what KVM does. That's far from the only
> thing that can be done with a dmabuf.
>
> The import-export machinery on the other hand actually does pin the
> buffers on the driver level, importers are not obligated to support
> movable buffers (move_notify in dma_buf_attach_ops is entirely optional).
>

dma-buf is by concept non-movable if actively used (otherwise, it would
break DMA). It's just a foreign buffer, and from device standpoint, just
plain RAM that needs to be mapped.

> Interestingly, there is already XEN_GNTDEV_DMABUF…
>
> Wait, do we even have any reason at all to suspect
> that XEN_GNTDEV_DMABUF doesn't already satisfy all of our buffer-sharing
> requirements?
>

XEN_GNTDEV_DMABUF has been designed for GPU use-cases, and more
precisely for paravirtualizing a display. The only issue I would have
with it is that grants are not scalable for GPU 3D use cases (with
hundreds of MB to share).
But we can still keep the concept of a structured guest-owned memory
that is shared with Dom0 (but for larger quantities), I have some ideas
regarding improving that area in Xen.

The only issue with changing the memory sharing model is that you would
need to adjust the virtio-gpu aspect, but the rest can stay the same.

The biggest concern regarding driver compatibility is more about :
- can dma-buf be used as general buffers : probably yes (even with
OpenGL/Vulkan); exception may be proprietary Nvidia drivers that lacks
the feature; maybe very old hardware may struggle more with it
- can guest UMD work without access to vram : yes (apparently), AMDGPU
has a special case where VRAM is not visible (e.g too small PCI BAR),
there is vram size vs "vram visible size" (which could be 0); you could
fallback vram-guest-visible with ram mapped on device
- can it be defined in Vulkan terms (from driver) : You can have
device_local memory without having it host-visible (i.e memory exists,
but can't be added in the guest). You would probably just lose some
zero-copy paths with VRAM. Though you still have RAM shared with GPU
(GTT in AMDGPU) if that matters.

Worth noting that if you're on integration graphics, you don't have VRAM
and everything is RAM anyway.

>
> Thanks,
> ~val
>
> P.S. while I have everyone's attention, can I get some eyes on:
> https://lore.kernel.org/all/20251126062124.117425-1-
> val@invisiblethingslab.com/ ?
>
>



--
Teddy Astie | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why memory lending is needed for GPU acceleration
  2026-03-31  9:42     ` Teddy Astie
@ 2026-03-31 11:23       ` Val Packett
  2026-04-03 21:24       ` Marek Marczykowski-Górecki
  1 sibling, 0 replies; 14+ messages in thread
From: Val Packett @ 2026-03-31 11:23 UTC (permalink / raw)
  To: Teddy Astie, Demi Marie Obenour, Xen developer discussion,
	dri-devel, linux-mm, Ariadne Conill

On 3/31/26 6:42 AM, Teddy Astie wrote:
> Le 30/03/2026 à 22:13, Val Packett a écrit :
>> [..]
>>
>> we have no need to replicate what KVM does. That's far from the only
>> thing that can be done with a dmabuf.
>>
>> The import-export machinery on the other hand actually does pin the
>> buffers on the driver level, importers are not obligated to support
>> movable buffers (move_notify in dma_buf_attach_ops is entirely optional).
>>
> dma-buf is by concept non-movable if actively used (otherwise, it would
> break DMA). It's just a foreign buffer, and from device standpoint, just
> plain RAM that needs to be mapped.
>
>> Interestingly, there is already XEN_GNTDEV_DMABUF…
>>
>> Wait, do we even have any reason at all to suspect
>> that XEN_GNTDEV_DMABUF doesn't already satisfy all of our buffer-sharing
>> requirements?
>>
> XEN_GNTDEV_DMABUF has been designed for GPU use-cases, and more
> precisely for paravirtualizing a display. The only issue I would have
> with it is that grants are not scalable for GPU 3D use cases (with
> hundreds of MB to share).

At least for the Qubes side, we aren't aiming at running Crysis on a 
paravirtualized GPU just yet anyway :) First we just want desktop apps 
to run well.

Keep in mind that with virtgpu paravirtualization, actual buffer sharing 
between domains only happens for CPU access, which is mostly used for:

- initial resource uploads;
- the occasional readback (which is inherently slow and all graphics 
devs try not to *ever* do);
- special cases like screen capture.

Most CPU mappings of GPU driver managed buffers live for the duration of 
a single memcpy. Mapping size can get large for games indeed, but for 
desktop applications it's rather small.

On the rendering hot path the guest virtgpu driver just submits jobs 
that refer to abstract handles managed by virglrenderer on the host, and 
buffer sharing is *not* happening.

> But we can still keep the concept of a structured guest-owned memory
> that is shared with Dom0 (but for larger quantities), I have some ideas
> regarding improving that area in Xen.
>
> The only issue with changing the memory sharing model is that you would
> need to adjust the virtio-gpu aspect, but the rest can stay the same.
>
> The biggest concern regarding driver compatibility is more about :
> - can dma-buf be used as general buffers : probably yes (even with
> OpenGL/Vulkan); exception may be proprietary Nvidia drivers that lacks
> the feature; maybe very old hardware may struggle more with it
Current nvidia blob drivers do not lack the feature btw..
> - can guest UMD work without access to vram : yes (apparently), AMDGPU
> has a special case where VRAM is not visible (e.g too small PCI BAR),
> there is vram size vs "vram visible size" (which could be 0); you could
> fallback vram-guest-visible with ram mapped on device

UMDs work on a higher level, they work on buffers which are managed by 
the KMD.

In any paravirtualization situation (whether "native 
contexts"/vDRM which runs the full HW-specific UMD in the guest, or 
API-forwarding solutions like Venus) the only guest KMD is virtio-gpu! 
The guest kernel isn't really aware of what VRAM even is.

https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/amd/common/virtio/amdgpu_virtio_bo.c

^ this 300-ish-line file is everything amdgpu ever does with buffer 
objects on the virtio backend.

All it can do is manage host handles, import guest dmabufs into virtgpu 
to get handles for them, export handles to get guest dmabufs, and map 
handles for guest CPU access via the VIRTGPU_MAP ioctl. There are no 
special details to any of this, it's all very straightforward.

It seems to me that implementing VIRTGPU_MAP in terms of dmabuf grants 
would be easy!..

I'll need to get to that point first though, right now I'm still working 
on making basic virtio itself work in our (x86) situation.

> - can it be defined in Vulkan terms (from driver) : You can have
> device_local memory without having it host-visible (i.e memory exists,
> but can't be added in the guest). You would probably just lose some
> zero-copy paths with VRAM. Though you still have RAM shared with GPU
> (GTT in AMDGPU) if that matters.

What did you mean by "added" in the guest?

We shouldn't ever have to touch this level at all, anyhow…
> Worth noting that if you're on integration graphics, you don't have VRAM
> and everything is RAM anyway.

Thanks,
~val

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why memory lending is needed for GPU acceleration
  2026-03-31  9:42     ` Teddy Astie
  2026-03-31 11:23       ` Val Packett
@ 2026-04-03 21:24       ` Marek Marczykowski-Górecki
  1 sibling, 0 replies; 14+ messages in thread
From: Marek Marczykowski-Górecki @ 2026-04-03 21:24 UTC (permalink / raw)
  To: Teddy Astie
  Cc: Val Packett, Demi Marie Obenour, Xen developer discussion,
	dri-devel, linux-mm, Ariadne Conill

[-- Attachment #1: Type: text/plain, Size: 662 bytes --]

On Tue, Mar 31, 2026 at 09:42:22AM +0000, Teddy Astie wrote:
> XEN_GNTDEV_DMABUF has been designed for GPU use-cases, and more
> precisely for paravirtualizing a display. The only issue I would have
> with it is that grants are not scalable for GPU 3D use cases (with
> hundreds of MB to share).

FWIW we do use grants for graphics buffers already - window composition
buffers specifically. We do run xen with extra options for that:
gnttab_max_frames=2048 gnttab_max_maptrack_frames=4096
And similarly, on domU side:
echo 1073741824 > /sys/module/xen_gntalloc/parameters/limit

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Mapping non-pinned memory from one Xen domain into another
  2026-03-24 14:17 Mapping non-pinned memory from one Xen domain into another Demi Marie Obenour
  2026-03-24 18:00 ` Teddy Astie
  2026-03-29 17:32 ` Why memory lending is needed for GPU acceleration Demi Marie Obenour
@ 2026-03-30 12:13 ` Teddy Astie
  2 siblings, 0 replies; 14+ messages in thread
From: Teddy Astie @ 2026-03-30 12:13 UTC (permalink / raw)
  To: Demi Marie Obenour, dri-devel, linux-mm, Val Packett,
	Ariadne Conill
  Cc: Xen developer discussion

(back to the original problem)

Le 24/03/2026 à 15:17, Demi Marie Obenour a écrit :
> Here is a proposed design document for supporting mapping GPU VRAM
> and/or file-backed memory into other domains.  It's not in the form of
> a patch because the leading + characters would just make it harder to
> read for no particular gain, and because this is still RFC right now.
> Once it is ready to merge, I'll send a proper patch.  Nevertheless,
> you can consider this to be
>
> Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
>
> This approach is very different from the "frontend-allocates"
> approach used elsewhere in Xen.  It is very much Linux-centric,
> rather than Xen-centric.  In fact, MMU notifiers were invented for
> KVM, and this approach is exactly the same as the one KVM implements.
> However, to the best of my understanding, the design described here is
> the only viable one.  Linux MM and GPU drivers require it, and changes
> to either to relax this requirement will not be accepted upstream.
> ---
> # Memory lending: Mapping pageable memory, such as GPU VRAM, from one Xen domain into another
>

(...)

> ## Informing drivers that they must stop using memory: MMU notifiers
>
> Kernel drivers, such as xen_privcmd, in the same domain that has
> the GPU (the "host") may map GPU memory buffers.  However, they must
> register an *MMU notifier*.  This is a callback that Linux core memory
> management code ("MM") uses to tell the driver that it must stop
> all accesses to the memory.  Once the memory is no longer accessed,
> Linux assumes it can do whatever it wants with this memory:
>
> - The GPU driver can move it from VRAM to system RAM or visa versa,
>    move it within VRAM or system RAM, or it temporarily inaccessible
>    so that other VRAM can be accessed.
> - MM can swap the page out to disk/zram/etc.
> - MM can move the page in system RAM to create huge pages.
> - MM can write the pages out to their backing files and then free them.
> - Anything else in Linux can do whatever it wants with the memory.
>
> Suspending access to memory is not allowed to block indefinitely.
> It can sleep, but it must finish in finite time regardless of what
> userspace (or other VMs) do.  Otherwise, bad things (which I believe
> includes deadlocks) may result.  I believe it can fail temporarily,
> but permanent failure is also not allowed.  Once the MMU notifier
> has succeeded, userspace or other domains **must not be allowed to
> access the memory**.  This would be an exploitable use-after-free
> vulnerability.
>
> Due to these requirements, MMU notifier callbacks must not require
> cooperation from other guests.  This means that they are not allowed to
> wait for memory that has been granted to another guest to no longer
> be mapped by that guest.  Therefore, MMU notifiers and the use of
> grant tables are inherently incompatible.
>
> ## Memory lending: A different approach
>
> Instead, xen_privcmd must use a different hypercall to _lend_ memory to
> another domain (the "guest").  When MM triggers the guest MMU notifier,
> xen_privcmd _tells_ Xen (via hypercall) to revoke the guest's access
> to the memory.  This hypercall _must succeed in bounded time_ even
> if the guest is malicious.
>
> Since the other guests are not aware this has happened, they will
> continue to access the memory.  This will cause p2m faults, which
> trap to Xen.  Xen normally kills the guest in this situation which is
> obviously not desired behavior.  Instead, Xen must pause the guest
> and inform the host's kernel.  xen_privcmd will have registered a
> handler for such events, so it will be informed when this happens.
>
> When xen_privcmd is told that a guest wants to access the revoked
> page, it will ask core MM to make the page available.  Once the page
> _is_ available, core MM will inform xen_privcmd, which will in turn
> provide a page to Xen that will be mapped into the guest's stage 2
> translation tables.  This page will generally be different than the
> one that was originally lent.
>
> Requesting a new page can fail.  This is usually due to rare errors,
> such as a GPU being hot-unplugged or an I/O error faulting pages
> from disk.  In these cases, the old content of the page is lost.
>
> When this happens, xen_privcmd can do one of two things:
>
> 1. It can provide a page that is filled with zeros.
> 2. It can tell Xen that it is unable to fulfill the request.
>
> Which choice it makes is under userspace control.  If userspace
> chooses the second option, Xen injects a fault into the guest.
> It is up to the guest to handle the fault correctly.
>

To me there are multiples problems :
- mapping a host-owned page into the guest
- make such mapping "non-persistent", i.e letting Linux discard it
- tracking guest access to such "non-existent mappings" (to remap it)

All problems could be mixed into a single solution, but I don't think
it's a good idea, that means various kind of MM events for Linux could
originate from Xen. There is also the "process disappeared" situation
that could cause of lof of problems for the kernel. In KVM, the guest
existence is tied to the process by construction but with Xen, things
are different.
But I think at least for the virtio-gpu use-case, these can be separated.

Here is a approach (multiples parties) :

The first 2 problems can be solved in a "simple" way, just make a
"reverse foreign map" with a MMU notifier attached to it. If Linux wants
to discard the mapping, the remote mapping in the guest is unmapped.
(something still needs to be done for doing that for VRAM)

The 3rd one is a bit trickier. It's mostly a result of the 2nd problem
e.g swap or RAM/VRAM migration. The page has disappeared in the guest.
That could be dealt with a slightly modified ioreq server, but instead
of responding to read/writes, it would just act on "accesses" (it's
mostly to avoid having to emulate the read/writes in the device model).

So overall, pages are mapped but "may disappears" (by kernel) and device
model (e.g QEMU) would need to remap them explicitly if that happens and
guest needs it.

What do you think ?

> ## Restrictions on lent memory
>
> Lent memory is still considered to belong to the lending domain.
> The borrowing domain can only access it via its p2m.  Hypercalls made
> by the borrowing domain act as if the borrowed memory was not present.
> This includes, but is not limited to:
>
> - Using pointers to borrowed memory in hypercall arguments.
> - Granting borrowed memory to other VMs.
> - Any other operation that depends on whether a page is accessible
>    by a domain.
>
> Furthermore:
>
> - Borrowed memory isn't mapped into the IOMMU of any PCIe devices
>    the guest has attached, because IOTLB faults generally are not
>    replayable.
>
> - Foreign mapping hypercalls that reference lent memory will fail.
>    Otherwise, the domain making the foreign mapping hypercall could
>    continue to access the borrowed memory after the lease had been
>    revoked.  This is true even if the domain performing the foreign
>    mapping is an all-powerful dom0.  Otherwise, an emulated device
>    could access memory whose lease had been revoked.
>
> This also means that live migration of a domain that has borrowed
> memory requires cooperation from the lending domain.  For now, it
> will be considered out of scope.  Live migration is typically used
> with server workloads, and accelerators for server hardware often
> support SR-IOV.
>
> ## Where will lent memory appear in a guest's address space?
>
> Typically, lent memory will be an emulated PCI BAR.  It may be emulated
> by dom0 or an alternate ioreq server.  However, it is not *required*
> to be a PCI BAR.
>
> ## Privileges required for memory lending
>
> For obvious reasons, the domain lending the memory must be privileged
> over the domain borrowing it.  The lending domain does not inherently
> need to be privileged over the whole system.  However, supporting
> situations where the providing domain is not dom0 will require
> extensions to Xen's permission model, except for the case where the
> providing domain only serves a single VM.
>
> Memory lending hypercalls are not subject to the restrictions of
> XSA-77.  They may safely be delegated to VMs other than dom0.
>
> ## Userspace API
>
> To the extent possible, the memory lending API should be similar
> to KVM's uAPI.  Ideally, userspace should be able to abstract over
> the differences.  Using the API should not require root privileges
> or be equivalent to root on the host.  It should only require a file
> descriptor that only allows controlling a single domain.
>
> ## Future directions: Creating & running Xen VMs without special privileges
>
> With the exception of a single page used for hypercalls, it is
> possible for a Xen domain to *only* have borrowed memory.  Such a
> domain can be managed by an entirely unprivileged userspace process,
> just like it would manage a KVM VM.  Since the "host" in this scenario
> only needs privilege over a domain it itself created, it is possible
> (once a subset of XSA-77 restrictions are lifted) for this domain
> to not actually be dom0.
>
> Even with XSA-77, the domain could still request dom0 to create and
> destroy the domain on its behalf.  Qubes OS already allows unprivileged
> guests to cause domain creation and destruction, so this does not
> introduce any new Xen attack surface.
>
> This could allow unprivileged processes in a domU to create and manage
> sub-domUs, just as if the domU had nested virtualization support and
> KVM was used.  However, this should provide significantly better
> performance than nested virtualization.



--
Teddy Astie | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech




^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2026-04-03 21:24 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-24 14:17 Mapping non-pinned memory from one Xen domain into another Demi Marie Obenour
2026-03-24 18:00 ` Teddy Astie
2026-03-26 17:18   ` Demi Marie Obenour
2026-03-26 18:26     ` Teddy Astie
2026-03-27 17:18       ` Demi Marie Obenour
2026-03-29 17:32 ` Why memory lending is needed for GPU acceleration Demi Marie Obenour
2026-03-30 10:15   ` Teddy Astie
2026-03-30 10:25     ` Jan Beulich
2026-03-30 12:24     ` Demi Marie Obenour
2026-03-30 20:07   ` Val Packett
2026-03-31  9:42     ` Teddy Astie
2026-03-31 11:23       ` Val Packett
2026-04-03 21:24       ` Marek Marczykowski-Górecki
2026-03-30 12:13 ` Mapping non-pinned memory from one Xen domain into another Teddy Astie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox