Re: Mapping non-pinned memory from one Xen domain into another

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: Demi Marie Obenour <demiobenour@gmail.com>
To: "Teddy Astie" <teddy.astie@vates.tech>,
	"Xen developer discussion" <xen-devel@lists.xenproject.org>,
	dri-devel@lists.freedesktop.org, linux-mm@kvack.org,
	"Jan Beulich" <jbeulich@suse.com>,
	"Val Packett" <val@invisiblethingslab.com>,
	"Ariadne Conill" <ariadne@ariadne.space>,
	"Andrew Cooper" <andrew.cooper3@citrix.com>,
	"Juergen Gross" <jgross@suse.com>,
	"Marek Marczykowski-Górecki" <marmarek@invisiblethingslab.com>
Subject: Re: Mapping non-pinned memory from one Xen domain into another
Date: Fri, 27 Mar 2026 13:18:27 -0400	[thread overview]
Message-ID: <d96c3e5d-2302-43cc-9eee-926ff4954678@gmail.com> (raw)
In-Reply-To: <df6194cf-1fc4-4a20-ad46-6eeab1d920a2@vates.tech>


[-- Attachment #1.1.1: Type: text/plain, Size: 7948 bytes --]

On 3/26/26 14:26, Teddy Astie wrote:
> Le 26/03/2026 à 18:18, Demi Marie Obenour a écrit :
>> On 3/24/26 14:00, Teddy Astie wrote:
>>>> ## Restrictions on lent memory>>
>>>> Lent memory is still considered to belong to the lending domain.
>>>> The borrowing domain can only access it via its p2m.  Hypercalls made
>>>> by the borrowing domain act as if the borrowed memory was not present.
>>>> This includes, but is not limited to:
>>>>
>>>> - Using pointers to borrowed memory in hypercall arguments.
>>>> - Granting borrowed memory to other VMs.
>>>> - Any other operation that depends on whether a page is accessible
>>>>     by a domain.
>>>
>>> What about emulated instructions that refers to this memory ?
>>
>> This would be allowed if (and only if) it can trigger paging as you
>> wrote above.
>>
>>>> Furthermore:
>>>>
>>>> - Borrowed memory isn't mapped into the IOMMU of any PCIe devices
>>>>     the guest has attached, because IOTLB faults generally are not
>>>>     replayable.
>>>>
>>>
>>> Given that (as written bellow) Borrowed memory is a part of some form of
>>> emulated BAR or special region, there is no guarantee that DMA will work
>>> properly anyway (unless P2P DMA support is advertised).
>>>
>>> Splitting the IOMMU side from the P2M is not a good idea as it rules out
>>> the "IOMMU HAP PT Share" optimization.
>>
>> If the pages are mapped in the IOMMU, paging them out requires an
>> IOTLB invalidation.  My understanding is that these are far too slow.
>>
> 
> yes (aside specific cases like with paravirtualized IOMMU), but only if 
> you have a device in the guest.
> 
> The problem is that that would force us to modify the ABI to have 
> "non-DMA-able" memory in the guest, which doesn't exist yet aside 
> specific cases like grants in PV.

This would make the mechanism *de facto* incompatible with PCI
passthrough.  That is unfortunate but not a dealbreaker for most
applications.  It's quite annoying, though, because of dual-GPU setups
where one GPU is paravirtualized and the other is passed through.

I don't think it necessarily needs any new guest ABI changes.
As you pointed out, guests are not allowed to assume that P2PDMA
works, so if the guest tries to DMA to these pages it's a guest bug.
This means that whether the pages can be DMA'd to or not is not a
guest-facing ABI.

That said, this should not block getting this feature implemented.

>> How important is sharing the HAP and IOMMU page tables?
>>
>>>> - Foreign mapping hypercalls that reference lent memory will fail.
>>>>     Otherwise, the domain making the foreign mapping hypercall could
>>>>     continue to access the borrowed memory after the lease had been
>>>>     revoked.  This is true even if the domain performing the foreign
>>>>     mapping is an all-powerful dom0.  Otherwise, an emulated device
>>>>     could access memory whose lease had been revoked.
>>>>
>>>> This also means that live migration of a domain that has borrowed
>>>> memory requires cooperation from the lending domain.  For now, it
>>>> will be considered out of scope.  Live migration is typically used
>>>> with server workloads, and accelerators for server hardware often
>>>> support SR-IOV.
>>>>
>>>> ## Where will lent memory appear in a guest's address space?
>>>>
>>>> Typically, lent memory will be an emulated PCI BAR.  It may be emulated
>>>> by dom0 or an alternate ioreq server.  However, it is not *required*
>>>> to be a PCI BAR.
>>>>
>>>
>>> ---
>>>
>>> While the design could work (albeit the implied complexity), I'm not a
>>> big fan of it, or at least, it needs to consider some constraints for
>>> having reasonable performance.
>>> One of the big issue is that a performance-sensitive system (virtualized
>>> GPU) is interlocking with several "hard to optimize" subsystem like P2M
>>> or Dom0 having to process a paging event.
>>>
>>> Modifying the P2M (especially removing entries) is a fairly expensive
>>> operation as it sometimes requires pausing all the vCPUs each time it's
>>> done.
>>
>> Not every GPU supports recoverable page faults.  Even when they
>> are supported, they are extremely expensive.  Each of them involves
>> a round-trip from the GPU to the CPU and back, which means that a
>> potentially very large number of GPU cores are blocked until the
>> CPU can respond.  Therefore, GPU driver developers avoid relying on
>> GPU page faults whenever possible.  Instead, data is moved in large
>> chunks using a dedicated DMA engine in the GPU.
>> As a result, I'm not too concerned with the cost of P2M manipulation.
>> Anything that requires making a GPU buffer temporarily inaccessible
>> is already an expensive process, and driver developers have strong
>> incentives to keep the time the buffer is unmapped as short as
>> possible.
>> If performance turns out to be a problem, something like KVM's
>> asynchronous page faults might be a better solution.
>>
> 
> Asynchronous page fault looks like a interesting and potentially easier 
> to implement.
> 
> IIUC, the idea is to make the pages disappears on the guest behalf, and 
> the guest would have to deal with the eventual page fault. Currently in 
> Xen, a unhandled #NPF is fatal, but that could be tuned down for 
> specific regions and transformed into a #PF or another exception for the 
> guest to handle.

Yup!

> We have actually a similar need for SEV-ES MMIO handling, as we need to 
> distinguish "MMIO-related NPF" (to paravirtualize through GHCB) to the 
> other NPF; which needs to be configured in advance in page-tables (so 
> that the CPU choose between #VC and VMEXIT#NPF).
> 
> It would also need some form of para-virtualization coming from virtio 
> or a new Xen PV driver for the guest to be made aware of this mechanism.
> I also assume that the guest handles properly that kind of event.

On KVM, asynchronous page faults are purely an optimization.  I have
a few concerns with relying entirely on them:

1. Can guest userspace use this to crash the guest kernel?  What
   happens if the guest kernel takes a fault in copy_{to,from}_user()?

2. Can this be made to work with Windows guests?

3. Could this run into a livelock problem?  Xen could tell the guest
   that the page is ready, but by the time the guest gets around to
   scheduling the userspace program, the page has been paged out again.

>>> If it's done at 4k granularity, it would also lack superpage support,
>>> which wouldn't help either. (doing things at the 2M+ scale would help,
>>> but I don't know enough how MMU notifier does things.

As an aside, graphics very much needs huge pages.  On AMD, using 4K
pages means a 30% performance hit.

>>> While I agree that grants is not a adequate mechanism for this (for
>>> multiples reasons), I'm not fully convinced of the proposal.
>>> I would prefer a strategy where we map a fixed amount of RAM+VRAM as a
>>> blob, along with some form of cooperative hotplug mechanism to
>>> dynamically provision the amount.
>>
>> I asked the GPU driver developers about pinning VRAM like this a couple
>> years ago or so.  The response I got was that it isn't supported.
>> I suspect that anyone needing VRAM pinning for graphics workloads is
>> using non-upstreamable hacks, most likely specific to a single driver.
>>
>> More generally, the entire graphics stack receives essentially no
>> testing under Xen.  There have been bugs that have affected Qubes OS
>> users for months or more, and they went unfixed because they couldn't
>> be reproduced outside of Xen.  To the upstream graphics developers,
>> Xen might as well not exist.  This means that any solution that
>> requires changing the graphics stack is not a practical option,
>> and I do not expect this to change in the foreseeable future.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

next prev parent reply	other threads:[~2026-03-27 17:18 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-24 14:17 Mapping non-pinned memory from one Xen domain into another Demi Marie Obenour
2026-03-24 18:00 ` Teddy Astie
2026-03-26 17:18   ` Demi Marie Obenour
2026-03-26 18:26     ` Teddy Astie
2026-03-27 17:18       ` Demi Marie Obenour [this message]
2026-03-29 17:32 ` Why memory lending is needed for GPU acceleration Demi Marie Obenour
2026-03-30 10:15   ` Teddy Astie
2026-03-30 10:25     ` Jan Beulich
2026-03-30 12:24     ` Demi Marie Obenour
2026-03-30 20:07   ` Val Packett
2026-03-31  9:42     ` Teddy Astie
2026-03-31 11:23       ` Val Packett
2026-04-03 21:24       ` Marek Marczykowski-Górecki
2026-03-30 12:13 ` Mapping non-pinned memory from one Xen domain into another Teddy Astie

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d96c3e5d-2302-43cc-9eee-926ff4954678@gmail.com \
    --to=demiobenour@gmail.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=ariadne@ariadne.space \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=jbeulich@suse.com \
    --cc=jgross@suse.com \
    --cc=linux-mm@kvack.org \
    --cc=marmarek@invisiblethingslab.com \
    --cc=teddy.astie@vates.tech \
    --cc=val@invisiblethingslab.com \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox