public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
From: Demi Marie Obenour <demiobenour@gmail.com>
To: Teddy Astie <teddy.astie@vates.tech>,
	Xen developer discussion <xen-devel@lists.xenproject.org>,
	dri-devel@lists.freedesktop.org, linux-mm@kvack.org,
	Jan Beulich <jbeulich@suse.com>,
	Val Packett <val@invisiblethingslab.com>,
	Ariadne Conill <ariadne@ariadne.space>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Juergen Gross <jgross@suse.com>
Subject: Re: Mapping non-pinned memory from one Xen domain into another
Date: Thu, 26 Mar 2026 13:18:39 -0400	[thread overview]
Message-ID: <4f201188-31ac-4dac-9cc6-79c4283486e5@gmail.com> (raw)
In-Reply-To: <5123c11c-3b8a-4633-809f-16c24418a4ce@vates.tech>


[-- Attachment #1.1.1: Type: text/plain, Size: 13983 bytes --]

On 3/24/26 14:00, Teddy Astie wrote:
> Hello,
> 
> I assume all this only concerns HVM/PVH DomU, I don't think it is doable 
> for PV DomU (if that matters).
> 
> Le 24/03/2026 à 15:17, Demi Marie Obenour a écrit :
>> Here is a proposed design document for supporting mapping GPU VRAM
>> and/or file-backed memory into other domains.  It's not in the form of
>> a patch because the leading + characters would just make it harder to
>> read for no particular gain, and because this is still RFC right now.
>> Once it is ready to merge, I'll send a proper patch.  Nevertheless,
>> you can consider this to be
>>
>> Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
>>
>> This approach is very different from the "frontend-allocates"
>> approach used elsewhere in Xen.  It is very much Linux-centric,
>> rather than Xen-centric.  In fact, MMU notifiers were invented for
>> KVM, and this approach is exactly the same as the one KVM implements.
>> However, to the best of my understanding, the design described here is
>> the only viable one.  Linux MM and GPU drivers require it, and changes
>> to either to relax this requirement will not be accepted upstream.
>> ---
>> # Memory lending: Mapping pageable memory, such as GPU VRAM, from one Xen domain into another
>>
>> ## Background
>>
>> Some Linux kernel subsystems require full control over certain memory
>> regions.  This includes the ability to handle page faults from any
>> entity accessing this memory.  Such entities include not only that
>> kernel's userspace, but also kernels belonging to other guests.
>>
>> For instance, GPU drivers reserve the right to migrate data between
>> VRAM and system RAM at any time.  Furthermore, there is a set of
>> page tables between the "aperture" (mapped as a PCI BAR) and the
>> actual VRAM.  This means that the GPU driver can make the memory
>> temporarily inaccessible to the CPU.  This is in fact _required_
>> when resizable BAR is not supported, as otherwise there is too much
>> VRAM to expose it all via a single BAR.
>>
>> Since the backing storage of this memory must be movable, pinning
>> it is not supported.  However, the existing grant table interface
>> requires pinned memory.  Therefore, such memory currently cannot be
>> shared with another guest.  As a result, implementing virtio-GPU blob
>> objects is not possible.  Since blob objects are a prerequisite for
>> both Venus and native contexts, supporting Vulkan via virtio-GPU on
>> Xen is also impossible.
>>
> 
> I'm not sure Vulkan fully allow memory to be moved between RAM and VRAM. 
> Or at least, you would need to lie a bit on the 
> VK_MEMORY_HEAP_DEVICE_LOCAL_BIT property.
> 
> So I assume there is a way to choose between having memory as VRAM or as 
> RAM somehow, unless it is only a hint ?

On Linux, it is indeed only a hint.  If VRAM is exhausted, Linux
will move buffers to system RAM to free up space.  This significantly
benefits desktop workloads, which have a lot of GPU buffers that are
not used frequently.  Games are an exception, but I suspect they use
their buffers frequently enough that Linux doesn't decide to page
them out.

>> Direct Access to Differentiated Memory (DAX) also relies on non-pinned
>> memory.  In the (now rare) case of persistent memory, it is because
>> the filesystem may need to move data blocks around on disk.  In the
>> case of virtio-pmem and virtio-fs, it is because page faults on write
>> operations are used to inform filesystems that they need to write the
>> data back at some point.  Without these page faults, filesystems will
>> not write back the data and silent data loss will result.
>>
>> There are other use-cases for this too.  For instance, virtio-GPU
>> cross-domain Wayland exposes host shared memory buffers to the guest.
>> These buffers are mmap()'d file descriptors provided by the Wayland
>> compositor, and as such are not guaranteed to be anonymous memory.
>> Using grant tables for such mappings would conflict with the design
>> of existing virtio-GPU implementations, which assume that GPU VRAM
>> and shared memory can be handled uniformly.
>>
>> Additionally, this is needed to support paging guest memory out to the
>> host's disks.  While this is significantly less efficient than using
>> an in-guest balloon driver, it has the advantage of not requiring
>> guest cooperation.  Therefore, it can be useful for situations in
>> which the performance of a guest is irrelevant, but where saving the
>> guest isn't appropriate.
>>
>> ## Informing drivers that they must stop using memory: MMU notifiers
>>
>> Kernel drivers, such as xen_privcmd, in the same domain that has
>> the GPU (the "host") may map GPU memory buffers.  However, they must
>> register an *MMU notifier*.  This is a callback that Linux core memory
>> management code ("MM") uses to tell the driver that it must stop
>> all accesses to the memory.  Once the memory is no longer accessed,
>> Linux assumes it can do whatever it wants with this memory:
>>
>> - The GPU driver can move it from VRAM to system RAM or visa versa,
>>   move it within VRAM or system RAM, or it temporarily inaccessible
>>   so that other VRAM can be accessed.
>> - MM can swap the page out to disk/zram/etc.
>> - MM can move the page in system RAM to create huge pages.
>> - MM can write the pages out to their backing files and then free them.
>> - Anything else in Linux can do whatever it wants with the memory.
>>
>> Suspending access to memory is not allowed to block indefinitely.
>> It can sleep, but it must finish in finite time regardless of what
>> userspace (or other VMs) do.  Otherwise, bad things (which I believe
>> includes deadlocks) may result.  I believe it can fail temporarily,
>> but permanent failure is also not allowed.  Once the MMU notifier
>> has succeeded, userspace or other domains **must not be allowed to
>> access the memory**.  This would be an exploitable use-after-free
>> vulnerability.
>>
>> Due to these requirements, MMU notifier callbacks must not require
>> cooperation from other guests.  This means that they are not allowed to
>> wait for memory that has been granted to another guest to no longer
>> be mapped by that guest.  Therefore, MMU notifiers and the use of
>> grant tables are inherently incompatible.
> 
> 
>>
>> ## Memory lending: A different approach
>>
>> Instead, xen_privcmd must use a different hypercall to _lend_ memory to
>> another domain (the "guest").  When MM triggers the guest MMU notifier,
>> xen_privcmd _tells_ Xen (via hypercall) to revoke the guest's access
>> to the memory.  This hypercall _must succeed in bounded time_ even
>> if the guest is malicious.
>>
>> Since the other guests are not aware this has happened, they will
>> continue to access the memory.  This will cause p2m faults, which
>> trap to Xen.  Xen normally kills the guest in this situation which is
>> obviously not desired behavior.  Instead, Xen must pause the guest
>> and inform the host's kernel.  xen_privcmd will have registered a
>> handler for such events, so it will be informed when this happens.
>>
>> When xen_privcmd is told that a guest wants to access the revoked
>> page, it will ask core MM to make the page available.  Once the page
>> _is_ available, core MM will inform xen_privcmd, which will in turn
>> provide a page to Xen that will be mapped into the guest's stage 2
>> translation tables.  This page will generally be different than the
>> one that was originally lent.
>>
>> Requesting a new page can fail.  This is usually due to rare errors,
>> such as a GPU being hot-unplugged or an I/O error faulting pages
>> from disk.  In these cases, the old content of the page is lost.
>>
>> When this happens, xen_privcmd can do one of two things:
>>
>> 1. It can provide a page that is filled with zeros.
>> 2. It can tell Xen that it is unable to fulfill the request.
>>
>> Which choice it makes is under userspace control.  If userspace
>> chooses the second option, Xen injects a fault into the guest.
>> It is up to the guest to handle the fault correctly.
>>
> Is it some ioreq-like mechanism where :
> - A guest access a "non-ready" page
> - Nothing there -> pagefault (e.g NPF) and guest vCPU is blocked
> - Xen asks Dom0 what to do (event channel, VIRQ, ...)
> - Dom0 explicitly maps memory to the guest (or do any other operation)
> - Guest resumes execution with the page mapped

This is exactly it!

> Something that looks a bit similar to "memory paging".

It *is* memory paging :).  I named it differently for two reasons:

1. I didn't want this to be confused with the existing memory paging
   mechanism in Xen.

2. I wanted to emphasize that lent memory is still "owned" by the
   lending domain.
   
>> ## Restrictions on lent memory>>
>> Lent memory is still considered to belong to the lending domain.
>> The borrowing domain can only access it via its p2m.  Hypercalls made
>> by the borrowing domain act as if the borrowed memory was not present.
>> This includes, but is not limited to:
>>
>> - Using pointers to borrowed memory in hypercall arguments.
>> - Granting borrowed memory to other VMs.
>> - Any other operation that depends on whether a page is accessible
>>    by a domain.
> 
> What about emulated instructions that refers to this memory ?

This would be allowed if (and only if) it can trigger paging as you
wrote above.

>> Furthermore:
>>
>> - Borrowed memory isn't mapped into the IOMMU of any PCIe devices
>>    the guest has attached, because IOTLB faults generally are not
>>    replayable.
>>
> 
> Given that (as written bellow) Borrowed memory is a part of some form of 
> emulated BAR or special region, there is no guarantee that DMA will work 
> properly anyway (unless P2P DMA support is advertised).
> 
> Splitting the IOMMU side from the P2M is not a good idea as it rules out 
> the "IOMMU HAP PT Share" optimization.

If the pages are mapped in the IOMMU, paging them out requires an
IOTLB invalidation.  My understanding is that these are far too slow.

How important is sharing the HAP and IOMMU page tables?

>> - Foreign mapping hypercalls that reference lent memory will fail.
>>    Otherwise, the domain making the foreign mapping hypercall could
>>    continue to access the borrowed memory after the lease had been
>>    revoked.  This is true even if the domain performing the foreign
>>    mapping is an all-powerful dom0.  Otherwise, an emulated device
>>    could access memory whose lease had been revoked.
>>
>> This also means that live migration of a domain that has borrowed
>> memory requires cooperation from the lending domain.  For now, it
>> will be considered out of scope.  Live migration is typically used
>> with server workloads, and accelerators for server hardware often
>> support SR-IOV.
>>
>> ## Where will lent memory appear in a guest's address space?
>>
>> Typically, lent memory will be an emulated PCI BAR.  It may be emulated
>> by dom0 or an alternate ioreq server.  However, it is not *required*
>> to be a PCI BAR.
>>
> 
> ---
> 
> While the design could work (albeit the implied complexity), I'm not a 
> big fan of it, or at least, it needs to consider some constraints for 
> having reasonable performance.
> One of the big issue is that a performance-sensitive system (virtualized 
> GPU) is interlocking with several "hard to optimize" subsystem like P2M 
> or Dom0 having to process a paging event.
> 
> Modifying the P2M (especially removing entries) is a fairly expensive 
> operation as it sometimes requires pausing all the vCPUs each time it's 
> done.

Not every GPU supports recoverable page faults.  Even when they
are supported, they are extremely expensive.  Each of them involves
a round-trip from the GPU to the CPU and back, which means that a
potentially very large number of GPU cores are blocked until the
CPU can respond.  Therefore, GPU driver developers avoid relying on
GPU page faults whenever possible.  Instead, data is moved in large
chunks using a dedicated DMA engine in the GPU.
As a result, I'm not too concerned with the cost of P2M manipulation.
Anything that requires making a GPU buffer temporarily inaccessible
is already an expensive process, and driver developers have strong
incentives to keep the time the buffer is unmapped as short as
possible.
If performance turns out to be a problem, something like KVM's
asynchronous page faults might be a better solution.

> If it's done at 4k granularity, it would also lack superpage support, 
> which wouldn't help either. (doing things at the 2M+ scale would help, 
> but I don't know enough how MMU notifier does things.
> 
> While I agree that grants is not a adequate mechanism for this (for 
> multiples reasons), I'm not fully convinced of the proposal.
> I would prefer a strategy where we map a fixed amount of RAM+VRAM as a 
> blob, along with some form of cooperative hotplug mechanism to 
> dynamically provision the amount.

I asked the GPU driver developers about pinning VRAM like this a couple
years ago or so.  The response I got was that it isn't supported.
I suspect that anyone needing VRAM pinning for graphics workloads is
using non-upstreamable hacks, most likely specific to a single driver.

More generally, the entire graphics stack receives essentially no
testing under Xen.  There have been bugs that have affected Qubes OS
users for months or more, and they went unfixed because they couldn't
be reproduced outside of Xen.  To the upstream graphics developers,
Xen might as well not exist.  This means that any solution that
requires changing the graphics stack is not a practical option,
and I do not expect this to change in the foreseeable future.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2026-03-26 17:19 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-24 14:17 Mapping non-pinned memory from one Xen domain into another Demi Marie Obenour
2026-03-24 18:00 ` Teddy Astie
2026-03-26 17:18   ` Demi Marie Obenour [this message]
2026-03-26 18:26     ` Teddy Astie

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4f201188-31ac-4dac-9cc6-79c4283486e5@gmail.com \
    --to=demiobenour@gmail.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=ariadne@ariadne.space \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=jbeulich@suse.com \
    --cc=jgross@suse.com \
    --cc=linux-mm@kvack.org \
    --cc=teddy.astie@vates.tech \
    --cc=val@invisiblethingslab.com \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox