Mapping memory into a domain

All of lore.kernel.org
 help / color / mirror / Atom feed

* Mapping memory into a domain
@ 2025-05-04 22:51 Demi Marie Obenour
  2025-05-04 22:56 ` Andrew Cooper
  2025-05-05 11:32 ` Alejandro Vallejo
  0 siblings, 2 replies; 20+ messages in thread
From: Demi Marie Obenour @ 2025-05-04 22:51 UTC (permalink / raw)
  To: Xen developer discussion; +Cc: Andrew Cooper, Juergen Gross


[-- Attachment #1.1.1: Type: text/plain, Size: 458 bytes --]

What are the appropriate Xen internal functions for:

1. Turning a PFN into an MFN?
2. Mapping an MFN into a guest?
3. Unmapping that MFN from a guest?

The first patch I am going to send with this information is a documentation
patch so that others do not need to figure this out for themselves.
I remember being unsure even after looking through the source code, which
is why I am asking here.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Mapping memory into a domain
  2025-05-04 22:51 Mapping memory into a domain Demi Marie Obenour
@ 2025-05-04 22:56 ` Andrew Cooper
  2025-05-04 23:24   ` Demi Marie Obenour
  2025-05-05 11:32 ` Alejandro Vallejo
  1 sibling, 1 reply; 20+ messages in thread
From: Andrew Cooper @ 2025-05-04 22:56 UTC (permalink / raw)
  To: Demi Marie Obenour, Xen developer discussion; +Cc: Juergen Gross

On 04/05/2025 11:51 pm, Demi Marie Obenour wrote:
> What are the appropriate Xen internal functions for:
>
> 1. Turning a PFN into an MFN?
> 2. Mapping an MFN into a guest?
> 3. Unmapping that MFN from a guest?
>
> The first patch I am going to send with this information is a documentation
> patch so that others do not need to figure this out for themselves.
> I remember being unsure even after looking through the source code, which
> is why I am asking here.

See the top of xen/include/xen/mm.h which has an overview of
terminology, including an explanation of why Xen doesn't know what the
guest thinks of as PFN.

~Andrew


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Mapping memory into a domain
  2025-05-04 22:56 ` Andrew Cooper
@ 2025-05-04 23:24   ` Demi Marie Obenour
  2025-05-05 14:20     ` Roger Pau Monné
  0 siblings, 1 reply; 20+ messages in thread
From: Demi Marie Obenour @ 2025-05-04 23:24 UTC (permalink / raw)
  To: Andrew Cooper, Xen developer discussion; +Cc: Juergen Gross


[-- Attachment #1.1.1: Type: text/plain, Size: 955 bytes --]

On 5/4/25 6:56 PM, Andrew Cooper wrote:
> On 04/05/2025 11:51 pm, Demi Marie Obenour wrote:
>> What are the appropriate Xen internal functions for:
>>
>> 1. Turning a PFN into an MFN?
>> 2. Mapping an MFN into a guest?
>> 3. Unmapping that MFN from a guest?
>>
>> The first patch I am going to send with this information is a documentation
>> patch so that others do not need to figure this out for themselves.
>> I remember being unsure even after looking through the source code, which
>> is why I am asking here.
> 
> See the top of xen/include/xen/mm.h which has an overview of
> terminology, including an explanation of why Xen doesn't know what the
> guest thinks of as PFN.
I read that and am still confused.  Are you specifically referring to PV
guests?  For PVH and HVM guests, Xen needs to know what the guest’s PFNs
are so that it can correctly set up its own page tables.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Mapping memory into a domain
  2025-05-04 23:24   ` Demi Marie Obenour
@ 2025-05-05 14:20     ` Roger Pau Monné
  0 siblings, 0 replies; 20+ messages in thread
From: Roger Pau Monné @ 2025-05-05 14:20 UTC (permalink / raw)
  To: Demi Marie Obenour; +Cc: Andrew Cooper, Xen developer discussion, Juergen Gross

On Sun, May 04, 2025 at 07:24:46PM -0400, Demi Marie Obenour wrote:
> On 5/4/25 6:56 PM, Andrew Cooper wrote:
> > On 04/05/2025 11:51 pm, Demi Marie Obenour wrote:
> >> What are the appropriate Xen internal functions for:
> >>
> >> 1. Turning a PFN into an MFN?
> >> 2. Mapping an MFN into a guest?
> >> 3. Unmapping that MFN from a guest?
> >>
> >> The first patch I am going to send with this information is a documentation
> >> patch so that others do not need to figure this out for themselves.
> >> I remember being unsure even after looking through the source code, which
> >> is why I am asking here.
> > 
> > See the top of xen/include/xen/mm.h which has an overview of
> > terminology, including an explanation of why Xen doesn't know what the
> > guest thinks of as PFN.
> I read that and am still confused.  Are you specifically referring to PV
> guests?  For PVH and HVM guests, Xen needs to know what the guest’s PFNs
> are so that it can correctly set up its own page tables.

The term PFN on PVH and HVM is confusing, and IMO it shouldn't be used
in that context.  PFNs should only be used in PV domains context.

I'm afraid I cannot understand the question in your last sentence.
What's "its own page tables"?  Are you referring to the domain second
stage translation page-tables, iow: the p2m?  Or is it something
else?

Regards, Roger.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Mapping memory into a domain
  2025-05-04 22:51 Mapping memory into a domain Demi Marie Obenour
  2025-05-04 22:56 ` Andrew Cooper
@ 2025-05-05 11:32 ` Alejandro Vallejo
  2025-05-06  1:02   ` Demi Marie Obenour
  1 sibling, 1 reply; 20+ messages in thread
From: Alejandro Vallejo @ 2025-05-05 11:32 UTC (permalink / raw)
  To: Demi Marie Obenour, Xen developer discussion
  Cc: Andrew Cooper, Juergen Gross, Xen-devel

I suppose this is still about multiplexing the GPU driver the way we
last discussed at Xen Summit?

On Mon May 5, 2025 at 12:51 AM CEST, Demi Marie Obenour wrote:
> What are the appropriate Xen internal functions for:
>
> 1. Turning a PFN into an MFN?
> 2. Mapping an MFN into a guest?
> 3. Unmapping that MFN from a guest?

The p2m is the single source of truth about such mappings.

This is all racy business. You want to keep the p2m lock for the full
duration of whatever operation you wish do, or you risk another CPU
taking it and pulling the rug under your feet at the most inconvenient
time.

In general all this faff is hidden under way too many layers beneath
copy_{to,from}_guest(). Other p2m manipulation high-level constructs
that might do interesting things worth looking at may be {map,unmap}_mmio_region()

Note that not every pfn has an associated mfn. Not even every valid pfn
has necessarily an associated mfn (there's pod). And all of this is
volatile business in the presence of a baloon driver or vPCI placing
mmio windows over guest memory.

In general anything up this alley would need a cohesive pair for
map/unmap and a credible plan for concurrency and how it's all handled
in conjunction with other bits that touch the p2m.

>
> The first patch I am going to send with this information is a documentation
> patch so that others do not need to figure this out for themselves.
> I remember being unsure even after looking through the source code, which
> is why I am asking here.

That's not surprising. There's per-arch stuff, per-p2mtype stuff,
per-guesttype stuff. Plus madness like on-demand memory. It's no wonder
such helpers don't exist and the general manipulations are hard to
explain.

Cheers,
Alejandro

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Mapping memory into a domain
  2025-05-05 11:32 ` Alejandro Vallejo
@ 2025-05-06  1:02   ` Demi Marie Obenour
  2025-05-06 13:06     ` Alejandro Vallejo
  0 siblings, 1 reply; 20+ messages in thread
From: Demi Marie Obenour @ 2025-05-06  1:02 UTC (permalink / raw)
  To: Alejandro Vallejo, Xen developer discussion
  Cc: Andrew Cooper, Juergen Gross, Xen-devel


[-- Attachment #1.1.1: Type: text/plain, Size: 2274 bytes --]

On 5/5/25 7:32 AM, Alejandro Vallejo wrote:
> I suppose this is still about multiplexing the GPU driver the way we
> last discussed at Xen Summit?
> 
> On Mon May 5, 2025 at 12:51 AM CEST, Demi Marie Obenour wrote:
>> What are the appropriate Xen internal functions for:
>>
>> 1. Turning a PFN into an MFN?
>> 2. Mapping an MFN into a guest?
>> 3. Unmapping that MFN from a guest?
> 
> The p2m is the single source of truth about such mappings.
> 
> This is all racy business. You want to keep the p2m lock for the full
> duration of whatever operation you wish do, or you risk another CPU
> taking it and pulling the rug under your feet at the most inconvenient
> time.
> 
> In general all this faff is hidden under way too many layers beneath
> copy_{to,from}_guest(). Other p2m manipulation high-level constructs
> that might do interesting things worth looking at may be {map,unmap}_mmio_region()
> 
> Note that not every pfn has an associated mfn. Not even every valid pfn
> has necessarily an associated mfn (there's pod). And all of this is
> volatile business in the presence of a baloon driver or vPCI placing
> mmio windows over guest memory.

Can I check that POD is not in use?  

> In general anything up this alley would need a cohesive pair for
> map/unmap and a credible plan for concurrency and how it's all handled
> in conjunction with other bits that touch the p2m.

Is taking the p2m lock for the entire operation a reasonable approach
for concurrency?  Will this cause too much lock contention?

>> The first patch I am going to send with this information is a documentation
>> patch so that others do not need to figure this out for themselves.
>> I remember being unsure even after looking through the source code, which
>> is why I am asking here.
> 
> That's not surprising. There's per-arch stuff, per-p2mtype stuff,
> per-guesttype stuff. Plus madness like on-demand memory. It's no wonder
> such helpers don't exist and the general manipulations are hard to
> explain.

Is this a task that is only suitable for someone who has several years
experience working on Xen, or is it something that would make sense for
someone who is less experienced?
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Mapping memory into a domain
  2025-05-06  1:02   ` Demi Marie Obenour
@ 2025-05-06 13:06     ` Alejandro Vallejo
  2025-05-06 20:56       ` Demi Marie Obenour
  0 siblings, 1 reply; 20+ messages in thread
From: Alejandro Vallejo @ 2025-05-06 13:06 UTC (permalink / raw)
  To: Demi Marie Obenour, Xen developer discussion
  Cc: Andrew Cooper, Juergen Gross, Xen-devel

On Tue May 6, 2025 at 3:02 AM CEST, Demi Marie Obenour wrote:
> On 5/5/25 7:32 AM, Alejandro Vallejo wrote:
>> I suppose this is still about multiplexing the GPU driver the way we
>> last discussed at Xen Summit?
>> 
>> On Mon May 5, 2025 at 12:51 AM CEST, Demi Marie Obenour wrote:
>>> What are the appropriate Xen internal functions for:
>>>
>>> 1. Turning a PFN into an MFN?
>>> 2. Mapping an MFN into a guest?
>>> 3. Unmapping that MFN from a guest?
>> 
>> The p2m is the single source of truth about such mappings.
>> 
>> This is all racy business. You want to keep the p2m lock for the full
>> duration of whatever operation you wish do, or you risk another CPU
>> taking it and pulling the rug under your feet at the most inconvenient
>> time.
>> 
>> In general all this faff is hidden under way too many layers beneath
>> copy_{to,from}_guest(). Other p2m manipulation high-level constructs
>> that might do interesting things worth looking at may be {map,unmap}_mmio_region()
>> 
>> Note that not every pfn has an associated mfn. Not even every valid pfn
>> has necessarily an associated mfn (there's pod). And all of this is
>> volatile business in the presence of a baloon driver or vPCI placing
>> mmio windows over guest memory.
>
> Can I check that POD is not in use?  

Maybe, but now you're reaching exponential complexity considering each
individual knob of the p2m into account.

>
>> In general anything up this alley would need a cohesive pair for
>> map/unmap and a credible plan for concurrency and how it's all handled
>> in conjunction with other bits that touch the p2m.
>
> Is taking the p2m lock for the entire operation a reasonable approach
> for concurrency?  Will this cause too much lock contention?

Maybe. It'd be fine for a page. Likely not so for several GiB if they
aren't already superpages.

>
>>> The first patch I am going to send with this information is a documentation
>>> patch so that others do not need to figure this out for themselves.
>>> I remember being unsure even after looking through the source code, which
>>> is why I am asking here.
>> 
>> That's not surprising. There's per-arch stuff, per-p2mtype stuff,
>> per-guesttype stuff. Plus madness like on-demand memory. It's no wonder
>> such helpers don't exist and the general manipulations are hard to
>> explain.
>
> Is this a task that is only suitable for someone who has several years
> experience working on Xen, or is it something that would make sense for
> someone who is less experienced?

The p2m is a very complex beast that integrates more features than I
care to count. It requires a lot of prior knowledge. Whoever does it
must know Xen fairly well in many configurations.

The real problem is finding the right primitives that do what you want
without overcomplicating everything else, preserving system security
invariants and have benign (and ideally clear) edge cases.

This was the last email you sent (I think?). Has any of the requirements
changed in any direction?

  https://lore.kernel.org/xen-devel/Z5794ysNE4KDkFuT@itl-email/

Something I'm missing there is how everything works without Xen. That
might help (me, at least) guage what could prove enough to support the
usecase. Are there sequence diagrams anywhere about how this whole thing
works without Xen? I vaguely remember you showing something last year in
Xen Summit in the design session, but my memory isn't that good :)

Cheers,
Alejandro


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Mapping memory into a domain
  2025-05-06 13:06     ` Alejandro Vallejo
@ 2025-05-06 20:56       ` Demi Marie Obenour
  2025-05-07 17:39         ` Roger Pau Monné
  0 siblings, 1 reply; 20+ messages in thread
From: Demi Marie Obenour @ 2025-05-06 20:56 UTC (permalink / raw)
  To: Alejandro Vallejo, Xen developer discussion
  Cc: Andrew Cooper, Juergen Gross, Xen-devel


[-- Attachment #1.1.1: Type: text/plain, Size: 6202 bytes --]

On 5/6/25 9:06 AM, Alejandro Vallejo wrote:
> On Tue May 6, 2025 at 3:02 AM CEST, Demi Marie Obenour wrote:
>> On 5/5/25 7:32 AM, Alejandro Vallejo wrote:
>>> I suppose this is still about multiplexing the GPU driver the way we
>>> last discussed at Xen Summit?
>>>
>>> On Mon May 5, 2025 at 12:51 AM CEST, Demi Marie Obenour wrote:
>>>> What are the appropriate Xen internal functions for:
>>>>
>>>> 1. Turning a PFN into an MFN?
>>>> 2. Mapping an MFN into a guest?
>>>> 3. Unmapping that MFN from a guest?
>>>
>>> The p2m is the single source of truth about such mappings.
>>>
>>> This is all racy business. You want to keep the p2m lock for the full
>>> duration of whatever operation you wish do, or you risk another CPU
>>> taking it and pulling the rug under your feet at the most inconvenient
>>> time.
>>>
>>> In general all this faff is hidden under way too many layers beneath
>>> copy_{to,from}_guest(). Other p2m manipulation high-level constructs
>>> that might do interesting things worth looking at may be {map,unmap}_mmio_region()
>>>
>>> Note that not every pfn has an associated mfn. Not even every valid pfn
>>> has necessarily an associated mfn (there's pod). And all of this is
>>> volatile business in the presence of a baloon driver or vPCI placing
>>> mmio windows over guest memory.
>>
>> Can I check that POD is not in use?  
> 
> Maybe, but now you're reaching exponential complexity considering each
> individual knob of the p2m into account.
> 
>>
>>> In general anything up this alley would need a cohesive pair for
>>> map/unmap and a credible plan for concurrency and how it's all handled
>>> in conjunction with other bits that touch the p2m.
>>
>> Is taking the p2m lock for the entire operation a reasonable approach
>> for concurrency?  Will this cause too much lock contention?
> 
> Maybe. It'd be fine for a page. Likely not so for several GiB if they
> aren't already superpages.
> 
>>
>>>> The first patch I am going to send with this information is a documentation
>>>> patch so that others do not need to figure this out for themselves.
>>>> I remember being unsure even after looking through the source code, which
>>>> is why I am asking here.
>>>
>>> That's not surprising. There's per-arch stuff, per-p2mtype stuff,
>>> per-guesttype stuff. Plus madness like on-demand memory. It's no wonder
>>> such helpers don't exist and the general manipulations are hard to
>>> explain.
>>
>> Is this a task that is only suitable for someone who has several years
>> experience working on Xen, or is it something that would make sense for
>> someone who is less experienced?
> 
> The p2m is a very complex beast that integrates more features than I
> care to count. It requires a lot of prior knowledge. Whoever does it
> must know Xen fairly well in many configurations.
> 
> The real problem is finding the right primitives that do what you want
> without overcomplicating everything else, preserving system security
> invariants and have benign (and ideally clear) edge cases.
> 
> This was the last email you sent (I think?). Has any of the requirements
> changed in any direction?
> 
>   https://lore.kernel.org/xen-devel/Z5794ysNE4KDkFuT@itl-email/

Map and Revoke are still needed, with the same requirements as described
in this email.  Steal and Return were needed for GPU shared virtual memory,
but it has been decided to not support this with virtio-GPU, so these
primitives are no longer needed.

> Something I'm missing there is how everything works without Xen. That
> might help (me, at least) guage what could prove enough to support the
> usecase. Are there sequence diagrams anywhere about how this whole thing
> works without Xen? I vaguely remember you showing something last year in
> Xen Summit in the design session, but my memory isn't that good :)

A Linux driver that needs access to userspace memory
pages can get it in two different ways:

1. It can pin the pages using the pin_user_pages family of APIs.
   If these functions succeed, the driver is guaranteed to be able
   to access the pages until it unpins them.  However, this also
   means that the pages cannot be paged out or migrated.  Furthermore,
   file-backed pages cannot be safely pinned, and pinning GPU memory
   isn’t supported.  (At a minimum, it would prevent the pages from
   migrating from system RAM to VRAM, so all access by a dGPU would
   cross the PCIe bus, which would be very slow.)

2. It can grab the *current* location of the pages and register an
   MMU notifier.  This works for GPU memory and file-backed memory.
   However, when the invalidate_range function of this callback, the
   driver *must* stop all further accesses to the pages.

   The invalidate_range callback is not allowed to block for a long
   period of time.  My understanding is that things like dirty page
   writeback are blocked while the callback is in progress.  My
   understanding is also that the callback is not allowed to fail.
   I believe it can return a retryable error but I don’t think that
   it is allowed to keep failing forever.

   Linux’s grant table driver actually had a bug in this area, which
   led to deadlocks.  I fixed that a while back.

KVM implements the second option: it maps pages into the stage-2
page tables (or shadow page tables, if that is chosen) and unmaps
them when the invalidate_range callback is called.  Furthermore,
if a page fault happens while the page is unmapped, KVM will try
to bring the pages back into memory so the guest can access it.

For GPU acceleration via virtio-GPU native contexts to work,
the Xen interface driver needs to do the same thing with GPU
buffers that KVM does: it needs to fault the pages into guest
memory on-demand and revoke access to the pages when the host
kernel demands them back.  There really is no alternative that
I am aware of.  The need to handle guest page faults doesn’t
come from the host kernel, but rather from guest userspace.
It isn’t practical to change guest userspace to remove this
requirement.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Mapping memory into a domain
  2025-05-06 20:56       ` Demi Marie Obenour
@ 2025-05-07 17:39         ` Roger Pau Monné
  2025-05-08  0:36           ` Demi Marie Obenour
  0 siblings, 1 reply; 20+ messages in thread
From: Roger Pau Monné @ 2025-05-07 17:39 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Alejandro Vallejo, Xen developer discussion, Andrew Cooper,
	Juergen Gross, Xen-devel

On Tue, May 06, 2025 at 04:56:12PM -0400, Demi Marie Obenour wrote:
> On 5/6/25 9:06 AM, Alejandro Vallejo wrote:
> > On Tue May 6, 2025 at 3:02 AM CEST, Demi Marie Obenour wrote:
> >> On 5/5/25 7:32 AM, Alejandro Vallejo wrote:
> >>> I suppose this is still about multiplexing the GPU driver the way we
> >>> last discussed at Xen Summit?
> >>>
> >>> On Mon May 5, 2025 at 12:51 AM CEST, Demi Marie Obenour wrote:
> >>>> What are the appropriate Xen internal functions for:
> >>>>
> >>>> 1. Turning a PFN into an MFN?
> >>>> 2. Mapping an MFN into a guest?
> >>>> 3. Unmapping that MFN from a guest?
> >>>
> >>> The p2m is the single source of truth about such mappings.
> >>>
> >>> This is all racy business. You want to keep the p2m lock for the full
> >>> duration of whatever operation you wish do, or you risk another CPU
> >>> taking it and pulling the rug under your feet at the most inconvenient
> >>> time.
> >>>
> >>> In general all this faff is hidden under way too many layers beneath
> >>> copy_{to,from}_guest(). Other p2m manipulation high-level constructs
> >>> that might do interesting things worth looking at may be {map,unmap}_mmio_region()
> >>>
> >>> Note that not every pfn has an associated mfn. Not even every valid pfn
> >>> has necessarily an associated mfn (there's pod). And all of this is
> >>> volatile business in the presence of a baloon driver or vPCI placing
> >>> mmio windows over guest memory.
> >>
> >> Can I check that POD is not in use?  
> > 
> > Maybe, but now you're reaching exponential complexity considering each
> > individual knob of the p2m into account.
> > 
> >>
> >>> In general anything up this alley would need a cohesive pair for
> >>> map/unmap and a credible plan for concurrency and how it's all handled
> >>> in conjunction with other bits that touch the p2m.
> >>
> >> Is taking the p2m lock for the entire operation a reasonable approach
> >> for concurrency?  Will this cause too much lock contention?
> > 
> > Maybe. It'd be fine for a page. Likely not so for several GiB if they
> > aren't already superpages.
> > 
> >>
> >>>> The first patch I am going to send with this information is a documentation
> >>>> patch so that others do not need to figure this out for themselves.
> >>>> I remember being unsure even after looking through the source code, which
> >>>> is why I am asking here.
> >>>
> >>> That's not surprising. There's per-arch stuff, per-p2mtype stuff,
> >>> per-guesttype stuff. Plus madness like on-demand memory. It's no wonder
> >>> such helpers don't exist and the general manipulations are hard to
> >>> explain.
> >>
> >> Is this a task that is only suitable for someone who has several years
> >> experience working on Xen, or is it something that would make sense for
> >> someone who is less experienced?
> > 
> > The p2m is a very complex beast that integrates more features than I
> > care to count. It requires a lot of prior knowledge. Whoever does it
> > must know Xen fairly well in many configurations.
> > 
> > The real problem is finding the right primitives that do what you want
> > without overcomplicating everything else, preserving system security
> > invariants and have benign (and ideally clear) edge cases.
> > 
> > This was the last email you sent (I think?). Has any of the requirements
> > changed in any direction?
> > 
> >   https://lore.kernel.org/xen-devel/Z5794ysNE4KDkFuT@itl-email/
> 
> Map and Revoke are still needed, with the same requirements as described
> in this email.  Steal and Return were needed for GPU shared virtual memory,
> but it has been decided to not support this with virtio-GPU, so these
> primitives are no longer needed.
> 
> > Something I'm missing there is how everything works without Xen. That
> > might help (me, at least) guage what could prove enough to support the
> > usecase. Are there sequence diagrams anywhere about how this whole thing
> > works without Xen? I vaguely remember you showing something last year in
> > Xen Summit in the design session, but my memory isn't that good :)

Hello,

Sorry, possibly replying a bit out of context here.

Since I will mention this in several places: p2m is the second stage
page-tables used by Xen for PVH and HVM guests.  A p2m violation is
the equivalent of a page-fault for guest p2m accesses.

> A Linux driver that needs access to userspace memory
> pages can get it in two different ways:
> 
> 1. It can pin the pages using the pin_user_pages family of APIs.
>    If these functions succeed, the driver is guaranteed to be able
>    to access the pages until it unpins them.  However, this also
>    means that the pages cannot be paged out or migrated.  Furthermore,
>    file-backed pages cannot be safely pinned, and pinning GPU memory
>    isn’t supported.  (At a minimum, it would prevent the pages from
>    migrating from system RAM to VRAM, so all access by a dGPU would
>    cross the PCIe bus, which would be very slow.)

From a Xen p2m this is all fine - Xen will never remove pages from the
p2m unless it's requested to.  So the pining, while needed on the Linux
side, doesn't need to be propagated to Xen I would think.

> 
> 2. It can grab the *current* location of the pages and register an
>    MMU notifier.  This works for GPU memory and file-backed memory.
>    However, when the invalidate_range function of this callback, the
>    driver *must* stop all further accesses to the pages.
> 
>    The invalidate_range callback is not allowed to block for a long
>    period of time.  My understanding is that things like dirty page
>    writeback are blocked while the callback is in progress.  My
>    understanding is also that the callback is not allowed to fail.
>    I believe it can return a retryable error but I don’t think that
>    it is allowed to keep failing forever.
> 
>    Linux’s grant table driver actually had a bug in this area, which
>    led to deadlocks.  I fixed that a while back.
> 
> KVM implements the second option: it maps pages into the stage-2
> page tables (or shadow page tables, if that is chosen) and unmaps
> them when the invalidate_range callback is called.

I assume this map and unmap is done by the host as a result of some
guest action?

> Furthermore,
> if a page fault happens while the page is unmapped, KVM will try
> to bring the pages back into memory so the guest can access it.

You could likely handle this in Xen in the following way:

 - A device model will get p2m violations forwarded, as it's the same
   model that's used to handle emulation of device MMIO.  You will
   need to register an ioreq server to request those faults to be
   forwarded, I think the hardware domain kernel will handle those?

 - Allow ioreqs to signal to Xen that a guest operation must be
   retried.  IOW: resume guest execution without advancing the IP.

I think this last bit is the one that will require changes to Xen, so
that you can add a type of ioreq reply that implies a retry from the
guest context.

> For GPU acceleration via virtio-GPU native contexts to work,
> the Xen interface driver needs to do the same thing with GPU
> buffers that KVM does: it needs to fault the pages into guest
> memory on-demand and revoke access to the pages when the host
> kernel demands them back.  There really is no alternative that
> I am aware of.  The need to handle guest page faults doesn’t
> come from the host kernel, but rather from guest userspace.

I'm a bit confused with this last sentence, the "page faults"
mentioned here are p2m violations I think?

Hope this makes some sense.

Regards, Roger.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Mapping memory into a domain
  2025-05-07 17:39         ` Roger Pau Monné
@ 2025-05-08  0:36           ` Demi Marie Obenour
  2025-05-08  7:52             ` Roger Pau Monné
  0 siblings, 1 reply; 20+ messages in thread
From: Demi Marie Obenour @ 2025-05-08  0:36 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Alejandro Vallejo, Xen developer discussion, Andrew Cooper,
	Juergen Gross, Xen-devel


[-- Attachment #1.1.1: Type: text/plain, Size: 9446 bytes --]

On 5/7/25 1:39 PM, Roger Pau Monné wrote:
> On Tue, May 06, 2025 at 04:56:12PM -0400, Demi Marie Obenour wrote:
>> On 5/6/25 9:06 AM, Alejandro Vallejo wrote:
>>> On Tue May 6, 2025 at 3:02 AM CEST, Demi Marie Obenour wrote:
>>>> On 5/5/25 7:32 AM, Alejandro Vallejo wrote:
>>>>> I suppose this is still about multiplexing the GPU driver the way we
>>>>> last discussed at Xen Summit?
>>>>>
>>>>> On Mon May 5, 2025 at 12:51 AM CEST, Demi Marie Obenour wrote:
>>>>>> What are the appropriate Xen internal functions for:
>>>>>>
>>>>>> 1. Turning a PFN into an MFN?
>>>>>> 2. Mapping an MFN into a guest?
>>>>>> 3. Unmapping that MFN from a guest?
>>>>>
>>>>> The p2m is the single source of truth about such mappings.
>>>>>
>>>>> This is all racy business. You want to keep the p2m lock for the full
>>>>> duration of whatever operation you wish do, or you risk another CPU
>>>>> taking it and pulling the rug under your feet at the most inconvenient
>>>>> time.
>>>>>
>>>>> In general all this faff is hidden under way too many layers beneath
>>>>> copy_{to,from}_guest(). Other p2m manipulation high-level constructs
>>>>> that might do interesting things worth looking at may be {map,unmap}_mmio_region()
>>>>>
>>>>> Note that not every pfn has an associated mfn. Not even every valid pfn
>>>>> has necessarily an associated mfn (there's pod). And all of this is
>>>>> volatile business in the presence of a baloon driver or vPCI placing
>>>>> mmio windows over guest memory.
>>>>
>>>> Can I check that POD is not in use?  
>>>
>>> Maybe, but now you're reaching exponential complexity considering each
>>> individual knob of the p2m into account.
>>>
>>>>
>>>>> In general anything up this alley would need a cohesive pair for
>>>>> map/unmap and a credible plan for concurrency and how it's all handled
>>>>> in conjunction with other bits that touch the p2m.
>>>>
>>>> Is taking the p2m lock for the entire operation a reasonable approach
>>>> for concurrency?  Will this cause too much lock contention?
>>>
>>> Maybe. It'd be fine for a page. Likely not so for several GiB if they
>>> aren't already superpages.
>>>
>>>>
>>>>>> The first patch I am going to send with this information is a documentation
>>>>>> patch so that others do not need to figure this out for themselves.
>>>>>> I remember being unsure even after looking through the source code, which
>>>>>> is why I am asking here.
>>>>>
>>>>> That's not surprising. There's per-arch stuff, per-p2mtype stuff,
>>>>> per-guesttype stuff. Plus madness like on-demand memory. It's no wonder
>>>>> such helpers don't exist and the general manipulations are hard to
>>>>> explain.
>>>>
>>>> Is this a task that is only suitable for someone who has several years
>>>> experience working on Xen, or is it something that would make sense for
>>>> someone who is less experienced?
>>>
>>> The p2m is a very complex beast that integrates more features than I
>>> care to count. It requires a lot of prior knowledge. Whoever does it
>>> must know Xen fairly well in many configurations.
>>>
>>> The real problem is finding the right primitives that do what you want
>>> without overcomplicating everything else, preserving system security
>>> invariants and have benign (and ideally clear) edge cases.
>>>
>>> This was the last email you sent (I think?). Has any of the requirements
>>> changed in any direction?
>>>
>>>   https://lore.kernel.org/xen-devel/Z5794ysNE4KDkFuT@itl-email/
>>
>> Map and Revoke are still needed, with the same requirements as described
>> in this email.  Steal and Return were needed for GPU shared virtual memory,
>> but it has been decided to not support this with virtio-GPU, so these
>> primitives are no longer needed.
>>
>>> Something I'm missing there is how everything works without Xen. That
>>> might help (me, at least) guage what could prove enough to support the
>>> usecase. Are there sequence diagrams anywhere about how this whole thing
>>> works without Xen? I vaguely remember you showing something last year in
>>> Xen Summit in the design session, but my memory isn't that good :)
> 
> Hello,
> 
> Sorry, possibly replying a bit out of context here.
> 
> Since I will mention this in several places: p2m is the second stage
> page-tables used by Xen for PVH and HVM guests.  A p2m violation is
> the equivalent of a page-fault for guest p2m accesses.
> 
>> A Linux driver that needs access to userspace memory
>> pages can get it in two different ways:
>>
>> 1. It can pin the pages using the pin_user_pages family of APIs.
>>    If these functions succeed, the driver is guaranteed to be able
>>    to access the pages until it unpins them.  However, this also
>>    means that the pages cannot be paged out or migrated.  Furthermore,
>>    file-backed pages cannot be safely pinned, and pinning GPU memory
>>    isn’t supported.  (At a minimum, it would prevent the pages from
>>    migrating from system RAM to VRAM, so all access by a dGPU would
>>    cross the PCIe bus, which would be very slow.)
> 
> From a Xen p2m this is all fine - Xen will never remove pages from the
> p2m unless it's requested to.  So the pining, while needed on the Linux
> side, doesn't need to be propagated to Xen I would think.

If pinning were enough things would be simple, but sadly it’s not.

>> 2. It can grab the *current* location of the pages and register an
>>    MMU notifier.  This works for GPU memory and file-backed memory.
>>    However, when the invalidate_range function of this callback, the
>>    driver *must* stop all further accesses to the pages.
>>
>>    The invalidate_range callback is not allowed to block for a long
>>    period of time.  My understanding is that things like dirty page
>>    writeback are blocked while the callback is in progress.  My
>>    understanding is also that the callback is not allowed to fail.
>>    I believe it can return a retryable error but I don’t think that
>>    it is allowed to keep failing forever.
>>
>>    Linux’s grant table driver actually had a bug in this area, which
>>    led to deadlocks.  I fixed that a while back.
>>
>> KVM implements the second option: it maps pages into the stage-2
>> page tables (or shadow page tables, if that is chosen) and unmaps
>> them when the invalidate_range callback is called.
> 
> I assume this map and unmap is done by the host as a result of some
> guest action?

Unmapping can happen at any time for any or no reason.  Semantically,
it would be correct to only map the pages in response to a p2m violation,
but for performance it might be better to map the pages eagerly instead.

>> Furthermore,
>> if a page fault happens while the page is unmapped, KVM will try
>> to bring the pages back into memory so the guest can access it.
> 
> You could likely handle this in Xen in the following way:
> 
>  - A device model will get p2m violations forwarded, as it's the same
>    model that's used to handle emulation of device MMIO.  You will
>    need to register an ioreq server to request those faults to be
>    forwarded, I think the hardware domain kernel will handle those?
> 
>  - Allow ioreqs to signal to Xen that a guest operation must be
>    retried.  IOW: resume guest execution without advancing the IP.
> 
> I think this last bit is the one that will require changes to Xen, so
> that you can add a type of ioreq reply that implies a retry from the
> guest context.
I’m not actually sure if this is needed, though it would be nice.  It
might be possible for Xen to instead emulate the current instruction and
continue, with the ioreq server just returning the current value of the
pages.  What I’m more concerned about is being able to provide a page
into the p2m so that the *next* access doesn’t fault, and being able
to remove that page from the p2m so that the next access *does* fault.

Are there any hypercalls that can be used for these operations right
now?  If not, which Xen functions would one use to implement them?
Some notes:

- The p2m might need to be made to point to a PCI BAR or system RAM.
  The guest kernel and host userspace don’t know which, and in any
  case don’t need to care.  The host kernel knows, but I don’t know
  if the information is exposed to the Xen driver.

- If the p2m needs to point to system RAM, the RAM will be memory
  that belongs to the backend.

- If the p2m needs to point to a PCI BAR, it will initially need
  to point to a real PCI device that is owned by the backend.

- The switch from “emulated MMIO” to “MMIO or real RAM” needs to
  be atomic from the guest’s perspective.

>> For GPU acceleration via virtio-GPU native contexts to work,
>> the Xen interface driver needs to do the same thing with GPU
>> buffers that KVM does: it needs to fault the pages into guest
>> memory on-demand and revoke access to the pages when the host
>> kernel demands them back.  There really is no alternative that
>> I am aware of.  The need to handle guest page faults doesn’t
>> come from the host kernel, but rather from guest userspace.
> 
> I'm a bit confused with this last sentence, the "page faults"
> mentioned here are p2m violations I think?

Yes, they are.

> Hope this makes some sense.

It does.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Mapping memory into a domain
  2025-05-08  0:36           ` Demi Marie Obenour
@ 2025-05-08  7:52             ` Roger Pau Monné
  2025-05-09  4:52               ` Demi Marie Obenour
  0 siblings, 1 reply; 20+ messages in thread
From: Roger Pau Monné @ 2025-05-08  7:52 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Alejandro Vallejo, Xen developer discussion, Andrew Cooper,
	Juergen Gross, Xen-devel

On Wed, May 07, 2025 at 08:36:07PM -0400, Demi Marie Obenour wrote:
> On 5/7/25 1:39 PM, Roger Pau Monné wrote:
> > On Tue, May 06, 2025 at 04:56:12PM -0400, Demi Marie Obenour wrote:
> >> On 5/6/25 9:06 AM, Alejandro Vallejo wrote:
> >>> On Tue May 6, 2025 at 3:02 AM CEST, Demi Marie Obenour wrote:
> >>>> On 5/5/25 7:32 AM, Alejandro Vallejo wrote:
> >>>>> I suppose this is still about multiplexing the GPU driver the way we
> >>>>> last discussed at Xen Summit?
> >>>>>
> >>>>> On Mon May 5, 2025 at 12:51 AM CEST, Demi Marie Obenour wrote:
> >>>>>> What are the appropriate Xen internal functions for:
> >>>>>>
> >>>>>> 1. Turning a PFN into an MFN?
> >>>>>> 2. Mapping an MFN into a guest?
> >>>>>> 3. Unmapping that MFN from a guest?
> >>>>>
> >>>>> The p2m is the single source of truth about such mappings.
> >>>>>
> >>>>> This is all racy business. You want to keep the p2m lock for the full
> >>>>> duration of whatever operation you wish do, or you risk another CPU
> >>>>> taking it and pulling the rug under your feet at the most inconvenient
> >>>>> time.
> >>>>>
> >>>>> In general all this faff is hidden under way too many layers beneath
> >>>>> copy_{to,from}_guest(). Other p2m manipulation high-level constructs
> >>>>> that might do interesting things worth looking at may be {map,unmap}_mmio_region()
> >>>>>
> >>>>> Note that not every pfn has an associated mfn. Not even every valid pfn
> >>>>> has necessarily an associated mfn (there's pod). And all of this is
> >>>>> volatile business in the presence of a baloon driver or vPCI placing
> >>>>> mmio windows over guest memory.
> >>>>
> >>>> Can I check that POD is not in use?  
> >>>
> >>> Maybe, but now you're reaching exponential complexity considering each
> >>> individual knob of the p2m into account.
> >>>
> >>>>
> >>>>> In general anything up this alley would need a cohesive pair for
> >>>>> map/unmap and a credible plan for concurrency and how it's all handled
> >>>>> in conjunction with other bits that touch the p2m.
> >>>>
> >>>> Is taking the p2m lock for the entire operation a reasonable approach
> >>>> for concurrency?  Will this cause too much lock contention?
> >>>
> >>> Maybe. It'd be fine for a page. Likely not so for several GiB if they
> >>> aren't already superpages.
> >>>
> >>>>
> >>>>>> The first patch I am going to send with this information is a documentation
> >>>>>> patch so that others do not need to figure this out for themselves.
> >>>>>> I remember being unsure even after looking through the source code, which
> >>>>>> is why I am asking here.
> >>>>>
> >>>>> That's not surprising. There's per-arch stuff, per-p2mtype stuff,
> >>>>> per-guesttype stuff. Plus madness like on-demand memory. It's no wonder
> >>>>> such helpers don't exist and the general manipulations are hard to
> >>>>> explain.
> >>>>
> >>>> Is this a task that is only suitable for someone who has several years
> >>>> experience working on Xen, or is it something that would make sense for
> >>>> someone who is less experienced?
> >>>
> >>> The p2m is a very complex beast that integrates more features than I
> >>> care to count. It requires a lot of prior knowledge. Whoever does it
> >>> must know Xen fairly well in many configurations.
> >>>
> >>> The real problem is finding the right primitives that do what you want
> >>> without overcomplicating everything else, preserving system security
> >>> invariants and have benign (and ideally clear) edge cases.
> >>>
> >>> This was the last email you sent (I think?). Has any of the requirements
> >>> changed in any direction?
> >>>
> >>>   https://lore.kernel.org/xen-devel/Z5794ysNE4KDkFuT@itl-email/
> >>
> >> Map and Revoke are still needed, with the same requirements as described
> >> in this email.  Steal and Return were needed for GPU shared virtual memory,
> >> but it has been decided to not support this with virtio-GPU, so these
> >> primitives are no longer needed.
> >>
> >>> Something I'm missing there is how everything works without Xen. That
> >>> might help (me, at least) guage what could prove enough to support the
> >>> usecase. Are there sequence diagrams anywhere about how this whole thing
> >>> works without Xen? I vaguely remember you showing something last year in
> >>> Xen Summit in the design session, but my memory isn't that good :)
> > 
> > Hello,
> > 
> > Sorry, possibly replying a bit out of context here.
> > 
> > Since I will mention this in several places: p2m is the second stage
> > page-tables used by Xen for PVH and HVM guests.  A p2m violation is
> > the equivalent of a page-fault for guest p2m accesses.
> > 
> >> A Linux driver that needs access to userspace memory
> >> pages can get it in two different ways:
> >>
> >> 1. It can pin the pages using the pin_user_pages family of APIs.
> >>    If these functions succeed, the driver is guaranteed to be able
> >>    to access the pages until it unpins them.  However, this also
> >>    means that the pages cannot be paged out or migrated.  Furthermore,
> >>    file-backed pages cannot be safely pinned, and pinning GPU memory
> >>    isn’t supported.  (At a minimum, it would prevent the pages from
> >>    migrating from system RAM to VRAM, so all access by a dGPU would
> >>    cross the PCIe bus, which would be very slow.)
> > 
> > From a Xen p2m this is all fine - Xen will never remove pages from the
> > p2m unless it's requested to.  So the pining, while needed on the Linux
> > side, doesn't need to be propagated to Xen I would think.
> 
> If pinning were enough things would be simple, but sadly it’s not.
> 
> >> 2. It can grab the *current* location of the pages and register an
> >>    MMU notifier.  This works for GPU memory and file-backed memory.
> >>    However, when the invalidate_range function of this callback, the
> >>    driver *must* stop all further accesses to the pages.
> >>
> >>    The invalidate_range callback is not allowed to block for a long
> >>    period of time.  My understanding is that things like dirty page
> >>    writeback are blocked while the callback is in progress.  My
> >>    understanding is also that the callback is not allowed to fail.
> >>    I believe it can return a retryable error but I don’t think that
> >>    it is allowed to keep failing forever.
> >>
> >>    Linux’s grant table driver actually had a bug in this area, which
> >>    led to deadlocks.  I fixed that a while back.
> >>
> >> KVM implements the second option: it maps pages into the stage-2
> >> page tables (or shadow page tables, if that is chosen) and unmaps
> >> them when the invalidate_range callback is called.
> > 
> > I assume this map and unmap is done by the host as a result of some
> > guest action?
> 
> Unmapping can happen at any time for any or no reason.  Semantically,
> it would be correct to only map the pages in response to a p2m violation,
> but for performance it might be better to map the pages eagerly instead.

That's an implementation detail, you can certainly map the pages
eagerly, or even map multiple contiguous pages as a result of a single
p2m violation.

I would focus on making a functioning prototype first, performance
comes afterwards.

> >> Furthermore,
> >> if a page fault happens while the page is unmapped, KVM will try
> >> to bring the pages back into memory so the guest can access it.
> > 
> > You could likely handle this in Xen in the following way:
> > 
> >  - A device model will get p2m violations forwarded, as it's the same
> >    model that's used to handle emulation of device MMIO.  You will
> >    need to register an ioreq server to request those faults to be
> >    forwarded, I think the hardware domain kernel will handle those?
> > 
> >  - Allow ioreqs to signal to Xen that a guest operation must be
> >    retried.  IOW: resume guest execution without advancing the IP.
> > 
> > I think this last bit is the one that will require changes to Xen, so
> > that you can add a type of ioreq reply that implies a retry from the
> > guest context.
> I’m not actually sure if this is needed, though it would be nice.  It
> might be possible for Xen to instead emulate the current instruction and
> continue, with the ioreq server just returning the current value of the
> pages.

You can, indeed, but it's cumbersome?  You might have to map the page
in the context of the entity that implements the ioreq server to
access the data.  Allowing retries would be more generic, and reduce
the code in the ioreq server handler, that would only map the page
to the guest p2m and request a retry.

> What I’m more concerned about is being able to provide a page
> into the p2m so that the *next* access doesn’t fault, and being able
> to remove that page from the p2m so that the next access *does* fault.

Maybe I'm not getting the question right, all Xen modifications to the
p2m take immediate effect.  By the time a XEN_DOMCTL_memory_mapping
hypercall returns the operation would have taken effect.

> Are there any hypercalls that can be used for these operations right
> now?

With some trickery you could likely use XEN_DOMCTL_memory_mapping to
add and remove those pages.  You will need calls to
XEN_DOMCTL_iomem_permission beforehand so that you grant the receiving
domain permissions to access those (and of course the granting domain
needs to have full access to them).

This is no ideal if mapping RAM pages, AFAICT there are no strict
checks that the added page is not RAM, but still you will need to
handle RAM pages as IOMEM so and grant them using
XEN_DOMCTL_iomem_permission which is not great.  Also note that this
is a domctl, so not stable.  It might however be enough for a
prototype.

Long term I think we want to expand XENMEM_add_to_physmap{,_batch} to
handle this use-case.

> If not, which Xen functions would one use to implement them?
> Some notes:
> 
> - The p2m might need to be made to point to a PCI BAR or system RAM.
>   The guest kernel and host userspace don’t know which, and in any
>   case don’t need to care.  The host kernel knows, but I don’t know
>   if the information is exposed to the Xen driver.

Hm, as said above, while you could possible handle RAM as IOMEM, it
has the slight inconvenience of having to add such RAM pages to the
d->iomem_caps rangeset for XEN_DOMCTL_memory_mapping to succeed.

From a guest PoV, it doesn't matter if the underlying page is RAM or
MMIO, as long as it's mapped in the p2m.

> 
> - If the p2m needs to point to system RAM, the RAM will be memory
>   that belongs to the backend.
> 
> - If the p2m needs to point to a PCI BAR, it will initially need
>   to point to a real PCI device that is owned by the backend.

As long as you give the destination domain access to the page using
XEN_DOMCTL_iomem_permission prior to the XEN_DOMCTL_memory_mapping
call it should work.

How does this work for device DMA accesses?  If the device is assigned
to the backend domain (and thus using the backend domain IOMMU context
entry and page-tables) DMA accesses cannot be done against guest
provided addresses, there needs to be some kind of translation layer
that filters commands?

My initial recommendation would be to look into what you can do with
the existing XEN_DOMCTL_iomem_permission and XEN_DOMCTL_memory_mapping
hypercalls.

> - The switch from “emulated MMIO” to “MMIO or real RAM” needs to
>   be atomic from the guest’s perspective.

Updates of p2m PTEs are always atomic.

Regards, Roger.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Mapping memory into a domain
  2025-05-08  7:52             ` Roger Pau Monné
@ 2025-05-09  4:52               ` Demi Marie Obenour
  2025-05-09  8:53                 ` Roger Pau Monné
  2025-05-09  9:47                 ` Alejandro Vallejo
  0 siblings, 2 replies; 20+ messages in thread
From: Demi Marie Obenour @ 2025-05-09  4:52 UTC (permalink / raw)
  To: Roger Pau Monné, Xenia Ragiadakou, Stefano Stabellini
  Cc: Alejandro Vallejo, Xen developer discussion, Andrew Cooper,
	Juergen Gross, Xen-devel


[-- Attachment #1.1.1: Type: text/plain, Size: 12711 bytes --]

On 5/8/25 3:52 AM, Roger Pau Monné wrote:
> On Wed, May 07, 2025 at 08:36:07PM -0400, Demi Marie Obenour wrote:
>> On 5/7/25 1:39 PM, Roger Pau Monné wrote:
>>> On Tue, May 06, 2025 at 04:56:12PM -0400, Demi Marie Obenour wrote:
>>>> On 5/6/25 9:06 AM, Alejandro Vallejo wrote:
>>>>> On Tue May 6, 2025 at 3:02 AM CEST, Demi Marie Obenour wrote:
>>>>>> On 5/5/25 7:32 AM, Alejandro Vallejo wrote:
>>>>>>> I suppose this is still about multiplexing the GPU driver the way we
>>>>>>> last discussed at Xen Summit?
>>>>>>>
>>>>>>> On Mon May 5, 2025 at 12:51 AM CEST, Demi Marie Obenour wrote:
>>>>>>>> What are the appropriate Xen internal functions for:
>>>>>>>>
>>>>>>>> 1. Turning a PFN into an MFN?
>>>>>>>> 2. Mapping an MFN into a guest?
>>>>>>>> 3. Unmapping that MFN from a guest?
>>>>>>>
>>>>>>> The p2m is the single source of truth about such mappings.
>>>>>>>
>>>>>>> This is all racy business. You want to keep the p2m lock for the full
>>>>>>> duration of whatever operation you wish do, or you risk another CPU
>>>>>>> taking it and pulling the rug under your feet at the most inconvenient
>>>>>>> time.
>>>>>>>
>>>>>>> In general all this faff is hidden under way too many layers beneath
>>>>>>> copy_{to,from}_guest(). Other p2m manipulation high-level constructs
>>>>>>> that might do interesting things worth looking at may be {map,unmap}_mmio_region()
>>>>>>>
>>>>>>> Note that not every pfn has an associated mfn. Not even every valid pfn
>>>>>>> has necessarily an associated mfn (there's pod). And all of this is
>>>>>>> volatile business in the presence of a baloon driver or vPCI placing
>>>>>>> mmio windows over guest memory.
>>>>>>
>>>>>> Can I check that POD is not in use?  
>>>>>
>>>>> Maybe, but now you're reaching exponential complexity considering each
>>>>> individual knob of the p2m into account.
>>>>>
>>>>>>
>>>>>>> In general anything up this alley would need a cohesive pair for
>>>>>>> map/unmap and a credible plan for concurrency and how it's all handled
>>>>>>> in conjunction with other bits that touch the p2m.
>>>>>>
>>>>>> Is taking the p2m lock for the entire operation a reasonable approach
>>>>>> for concurrency?  Will this cause too much lock contention?
>>>>>
>>>>> Maybe. It'd be fine for a page. Likely not so for several GiB if they
>>>>> aren't already superpages.
>>>>>
>>>>>>
>>>>>>>> The first patch I am going to send with this information is a documentation
>>>>>>>> patch so that others do not need to figure this out for themselves.
>>>>>>>> I remember being unsure even after looking through the source code, which
>>>>>>>> is why I am asking here.
>>>>>>>
>>>>>>> That's not surprising. There's per-arch stuff, per-p2mtype stuff,
>>>>>>> per-guesttype stuff. Plus madness like on-demand memory. It's no wonder
>>>>>>> such helpers don't exist and the general manipulations are hard to
>>>>>>> explain.
>>>>>>
>>>>>> Is this a task that is only suitable for someone who has several years
>>>>>> experience working on Xen, or is it something that would make sense for
>>>>>> someone who is less experienced?
>>>>>
>>>>> The p2m is a very complex beast that integrates more features than I
>>>>> care to count. It requires a lot of prior knowledge. Whoever does it
>>>>> must know Xen fairly well in many configurations.
>>>>>
>>>>> The real problem is finding the right primitives that do what you want
>>>>> without overcomplicating everything else, preserving system security
>>>>> invariants and have benign (and ideally clear) edge cases.
>>>>>
>>>>> This was the last email you sent (I think?). Has any of the requirements
>>>>> changed in any direction?
>>>>>
>>>>>   https://lore.kernel.org/xen-devel/Z5794ysNE4KDkFuT@itl-email/
>>>>
>>>> Map and Revoke are still needed, with the same requirements as described
>>>> in this email.  Steal and Return were needed for GPU shared virtual memory,
>>>> but it has been decided to not support this with virtio-GPU, so these
>>>> primitives are no longer needed.
>>>>
>>>>> Something I'm missing there is how everything works without Xen. That
>>>>> might help (me, at least) guage what could prove enough to support the
>>>>> usecase. Are there sequence diagrams anywhere about how this whole thing
>>>>> works without Xen? I vaguely remember you showing something last year in
>>>>> Xen Summit in the design session, but my memory isn't that good :)
>>>
>>> Hello,
>>>
>>> Sorry, possibly replying a bit out of context here.
>>>
>>> Since I will mention this in several places: p2m is the second stage
>>> page-tables used by Xen for PVH and HVM guests.  A p2m violation is
>>> the equivalent of a page-fault for guest p2m accesses.
>>>
>>>> A Linux driver that needs access to userspace memory
>>>> pages can get it in two different ways:
>>>>
>>>> 1. It can pin the pages using the pin_user_pages family of APIs.
>>>>    If these functions succeed, the driver is guaranteed to be able
>>>>    to access the pages until it unpins them.  However, this also
>>>>    means that the pages cannot be paged out or migrated.  Furthermore,
>>>>    file-backed pages cannot be safely pinned, and pinning GPU memory
>>>>    isn’t supported.  (At a minimum, it would prevent the pages from
>>>>    migrating from system RAM to VRAM, so all access by a dGPU would
>>>>    cross the PCIe bus, which would be very slow.)
>>>
>>> From a Xen p2m this is all fine - Xen will never remove pages from the
>>> p2m unless it's requested to.  So the pining, while needed on the Linux
>>> side, doesn't need to be propagated to Xen I would think.
>>
>> If pinning were enough things would be simple, but sadly it’s not.
>>
>>>> 2. It can grab the *current* location of the pages and register an
>>>>    MMU notifier.  This works for GPU memory and file-backed memory.
>>>>    However, when the invalidate_range function of this callback, the
>>>>    driver *must* stop all further accesses to the pages.
>>>>
>>>>    The invalidate_range callback is not allowed to block for a long
>>>>    period of time.  My understanding is that things like dirty page
>>>>    writeback are blocked while the callback is in progress.  My
>>>>    understanding is also that the callback is not allowed to fail.
>>>>    I believe it can return a retryable error but I don’t think that
>>>>    it is allowed to keep failing forever.
>>>>
>>>>    Linux’s grant table driver actually had a bug in this area, which
>>>>    led to deadlocks.  I fixed that a while back.
>>>>
>>>> KVM implements the second option: it maps pages into the stage-2
>>>> page tables (or shadow page tables, if that is chosen) and unmaps
>>>> them when the invalidate_range callback is called.
>>>
>>> I assume this map and unmap is done by the host as a result of some
>>> guest action?
>>
>> Unmapping can happen at any time for any or no reason.  Semantically,
>> it would be correct to only map the pages in response to a p2m violation,
>> but for performance it might be better to map the pages eagerly instead.
> 
> That's an implementation detail, you can certainly map the pages
> eagerly, or even map multiple contiguous pages as a result of a single
> p2m violation.
> 
> I would focus on making a functioning prototype first, performance
> comes afterwards.

Makes sense.

>>>> Furthermore,
>>>> if a page fault happens while the page is unmapped, KVM will try
>>>> to bring the pages back into memory so the guest can access it.
>>>
>>> You could likely handle this in Xen in the following way:
>>>
>>>  - A device model will get p2m violations forwarded, as it's the same
>>>    model that's used to handle emulation of device MMIO.  You will
>>>    need to register an ioreq server to request those faults to be
>>>    forwarded, I think the hardware domain kernel will handle those?
>>>
>>>  - Allow ioreqs to signal to Xen that a guest operation must be
>>>    retried.  IOW: resume guest execution without advancing the IP.
>>>
>>> I think this last bit is the one that will require changes to Xen, so
>>> that you can add a type of ioreq reply that implies a retry from the
>>> guest context.
>> I’m not actually sure if this is needed, though it would be nice.  It
>> might be possible for Xen to instead emulate the current instruction and
>> continue, with the ioreq server just returning the current value of the
>> pages.
> 
> You can, indeed, but it's cumbersome?  You might have to map the page
> in the context of the entity that implements the ioreq server to
> access the data.  Allowing retries would be more generic, and reduce
> the code in the ioreq server handler, that would only map the page
> to the guest p2m and request a retry.

Yeah, it is cumbersome indeed.

>> What I’m more concerned about is being able to provide a page
>> into the p2m so that the *next* access doesn’t fault, and being able
>> to remove that page from the p2m so that the next access *does* fault.
> 
> Maybe I'm not getting the question right, all Xen modifications to the
> p2m take immediate effect.  By the time a XEN_DOMCTL_memory_mapping
> hypercall returns the operation would have taken effect.

Ah, that makes sense.  When revoking access, can XEN_DOMCTL_iomem_permission
and XEN_DOMCTL_memory_mapping fail even if the parameters are correct and
the caller has enough permissions, or will they always succeed?

>> Are there any hypercalls that can be used for these operations right
>> now?
> 
> With some trickery you could likely use XEN_DOMCTL_memory_mapping to
> add and remove those pages.  You will need calls to
> XEN_DOMCTL_iomem_permission beforehand so that you grant the receiving
> domain permissions to access those (and of course the granting domain
> needs to have full access to them).
> 
> This is no ideal if mapping RAM pages, AFAICT there are no strict
> checks that the added page is not RAM, but still you will need to
> handle RAM pages as IOMEM so and grant them using
> XEN_DOMCTL_iomem_permission which is not great.  Also note that this
> is a domctl, so not stable.  It might however be enough for a
> prototype.

Unfortunately this won’t work if the backend is a PVH domain, as a PVH
domain doesn’t know its own MFNs.  It also won’t work for deprivileged
backends because XEN_DOMCTL_iomem_permission is subject to XSA-77.

> Long term I think we want to expand XENMEM_add_to_physmap{,_batch} to
> handle this use-case.

That would indeed be better.

>> If not, which Xen functions would one use to implement them?
>> Some notes:
>>
>> - The p2m might need to be made to point to a PCI BAR or system RAM.
>>   The guest kernel and host userspace don’t know which, and in any
>>   case don’t need to care.  The host kernel knows, but I don’t know
>>   if the information is exposed to the Xen driver.
> 
> Hm, as said above, while you could possible handle RAM as IOMEM, it
> has the slight inconvenience of having to add such RAM pages to the
> d->iomem_caps rangeset for XEN_DOMCTL_memory_mapping to succeed.
> 
> From a guest PoV, it doesn't matter if the underlying page is RAM or
> MMIO, as long as it's mapped in the p2m.

Understood, thanks!

>> - If the p2m needs to point to system RAM, the RAM will be memory
>>   that belongs to the backend.
>>
>> - If the p2m needs to point to a PCI BAR, it will initially need
>>   to point to a real PCI device that is owned by the backend.
> 
> As long as you give the destination domain access to the page using
> XEN_DOMCTL_iomem_permission prior to the XEN_DOMCTL_memory_mapping
> call it should work.
> 
> How does this work for device DMA accesses?  If the device is assigned
> to the backend domain (and thus using the backend domain IOMMU context
> entry and page-tables) DMA accesses cannot be done against guest
> provided addresses, there needs to be some kind of translation layer
> that filters commands?

Thankfully, this is handled by the backend.

> My initial recommendation would be to look into what you can do with
> the existing XEN_DOMCTL_iomem_permission and XEN_DOMCTL_memory_mapping
> hypercalls.

I think this would be suitable for a prototype but not for production.

>> - The switch from “emulated MMIO” to “MMIO or real RAM” needs to
>>   be atomic from the guest’s perspective.
> 
> Updates of p2m PTEs are always atomic.
That’s good.

Xenia, would it be possible for AMD to post whatever has been
implemented so far?  I think this would help a lot, even if it
is incomplete.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Mapping memory into a domain
  2025-05-09  4:52               ` Demi Marie Obenour
@ 2025-05-09  8:53                 ` Roger Pau Monné
  2025-05-09  9:47                 ` Alejandro Vallejo
  1 sibling, 0 replies; 20+ messages in thread
From: Roger Pau Monné @ 2025-05-09  8:53 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Xenia Ragiadakou, Stefano Stabellini, Alejandro Vallejo,
	Xen developer discussion, Andrew Cooper, Juergen Gross, Xen-devel

On Fri, May 09, 2025 at 12:52:28AM -0400, Demi Marie Obenour wrote:
> On 5/8/25 3:52 AM, Roger Pau Monné wrote:
> > On Wed, May 07, 2025 at 08:36:07PM -0400, Demi Marie Obenour wrote:
> >> On 5/7/25 1:39 PM, Roger Pau Monné wrote:
> >>> On Tue, May 06, 2025 at 04:56:12PM -0400, Demi Marie Obenour wrote:
> >>>> On 5/6/25 9:06 AM, Alejandro Vallejo wrote:
> >>>>> On Tue May 6, 2025 at 3:02 AM CEST, Demi Marie Obenour wrote:
> >>>>>> On 5/5/25 7:32 AM, Alejandro Vallejo wrote:
> >>>>>>> I suppose this is still about multiplexing the GPU driver the way we
> >>>>>>> last discussed at Xen Summit?
> >>>>>>>
> >>>>>>> On Mon May 5, 2025 at 12:51 AM CEST, Demi Marie Obenour wrote:
> >>>>>>>> What are the appropriate Xen internal functions for:
> >>>>>>>>
> >>>>>>>> 1. Turning a PFN into an MFN?
> >>>>>>>> 2. Mapping an MFN into a guest?
> >>>>>>>> 3. Unmapping that MFN from a guest?
> >>>>>>>
> >>>>>>> The p2m is the single source of truth about such mappings.
> >>>>>>>
> >>>>>>> This is all racy business. You want to keep the p2m lock for the full
> >>>>>>> duration of whatever operation you wish do, or you risk another CPU
> >>>>>>> taking it and pulling the rug under your feet at the most inconvenient
> >>>>>>> time.
> >>>>>>>
> >>>>>>> In general all this faff is hidden under way too many layers beneath
> >>>>>>> copy_{to,from}_guest(). Other p2m manipulation high-level constructs
> >>>>>>> that might do interesting things worth looking at may be {map,unmap}_mmio_region()
> >>>>>>>
> >>>>>>> Note that not every pfn has an associated mfn. Not even every valid pfn
> >>>>>>> has necessarily an associated mfn (there's pod). And all of this is
> >>>>>>> volatile business in the presence of a baloon driver or vPCI placing
> >>>>>>> mmio windows over guest memory.
> >>>>>>
> >>>>>> Can I check that POD is not in use?  
> >>>>>
> >>>>> Maybe, but now you're reaching exponential complexity considering each
> >>>>> individual knob of the p2m into account.
> >>>>>
> >>>>>>
> >>>>>>> In general anything up this alley would need a cohesive pair for
> >>>>>>> map/unmap and a credible plan for concurrency and how it's all handled
> >>>>>>> in conjunction with other bits that touch the p2m.
> >>>>>>
> >>>>>> Is taking the p2m lock for the entire operation a reasonable approach
> >>>>>> for concurrency?  Will this cause too much lock contention?
> >>>>>
> >>>>> Maybe. It'd be fine for a page. Likely not so for several GiB if they
> >>>>> aren't already superpages.
> >>>>>
> >>>>>>
> >>>>>>>> The first patch I am going to send with this information is a documentation
> >>>>>>>> patch so that others do not need to figure this out for themselves.
> >>>>>>>> I remember being unsure even after looking through the source code, which
> >>>>>>>> is why I am asking here.
> >>>>>>>
> >>>>>>> That's not surprising. There's per-arch stuff, per-p2mtype stuff,
> >>>>>>> per-guesttype stuff. Plus madness like on-demand memory. It's no wonder
> >>>>>>> such helpers don't exist and the general manipulations are hard to
> >>>>>>> explain.
> >>>>>>
> >>>>>> Is this a task that is only suitable for someone who has several years
> >>>>>> experience working on Xen, or is it something that would make sense for
> >>>>>> someone who is less experienced?
> >>>>>
> >>>>> The p2m is a very complex beast that integrates more features than I
> >>>>> care to count. It requires a lot of prior knowledge. Whoever does it
> >>>>> must know Xen fairly well in many configurations.
> >>>>>
> >>>>> The real problem is finding the right primitives that do what you want
> >>>>> without overcomplicating everything else, preserving system security
> >>>>> invariants and have benign (and ideally clear) edge cases.
> >>>>>
> >>>>> This was the last email you sent (I think?). Has any of the requirements
> >>>>> changed in any direction?
> >>>>>
> >>>>>   https://lore.kernel.org/xen-devel/Z5794ysNE4KDkFuT@itl-email/
> >>>>
> >>>> Map and Revoke are still needed, with the same requirements as described
> >>>> in this email.  Steal and Return were needed for GPU shared virtual memory,
> >>>> but it has been decided to not support this with virtio-GPU, so these
> >>>> primitives are no longer needed.
> >>>>
> >>>>> Something I'm missing there is how everything works without Xen. That
> >>>>> might help (me, at least) guage what could prove enough to support the
> >>>>> usecase. Are there sequence diagrams anywhere about how this whole thing
> >>>>> works without Xen? I vaguely remember you showing something last year in
> >>>>> Xen Summit in the design session, but my memory isn't that good :)
> >>>
> >>> Hello,
> >>>
> >>> Sorry, possibly replying a bit out of context here.
> >>>
> >>> Since I will mention this in several places: p2m is the second stage
> >>> page-tables used by Xen for PVH and HVM guests.  A p2m violation is
> >>> the equivalent of a page-fault for guest p2m accesses.
> >>>
> >>>> A Linux driver that needs access to userspace memory
> >>>> pages can get it in two different ways:
> >>>>
> >>>> 1. It can pin the pages using the pin_user_pages family of APIs.
> >>>>    If these functions succeed, the driver is guaranteed to be able
> >>>>    to access the pages until it unpins them.  However, this also
> >>>>    means that the pages cannot be paged out or migrated.  Furthermore,
> >>>>    file-backed pages cannot be safely pinned, and pinning GPU memory
> >>>>    isn’t supported.  (At a minimum, it would prevent the pages from
> >>>>    migrating from system RAM to VRAM, so all access by a dGPU would
> >>>>    cross the PCIe bus, which would be very slow.)
> >>>
> >>> From a Xen p2m this is all fine - Xen will never remove pages from the
> >>> p2m unless it's requested to.  So the pining, while needed on the Linux
> >>> side, doesn't need to be propagated to Xen I would think.
> >>
> >> If pinning were enough things would be simple, but sadly it’s not.
> >>
> >>>> 2. It can grab the *current* location of the pages and register an
> >>>>    MMU notifier.  This works for GPU memory and file-backed memory.
> >>>>    However, when the invalidate_range function of this callback, the
> >>>>    driver *must* stop all further accesses to the pages.
> >>>>
> >>>>    The invalidate_range callback is not allowed to block for a long
> >>>>    period of time.  My understanding is that things like dirty page
> >>>>    writeback are blocked while the callback is in progress.  My
> >>>>    understanding is also that the callback is not allowed to fail.
> >>>>    I believe it can return a retryable error but I don’t think that
> >>>>    it is allowed to keep failing forever.
> >>>>
> >>>>    Linux’s grant table driver actually had a bug in this area, which
> >>>>    led to deadlocks.  I fixed that a while back.
> >>>>
> >>>> KVM implements the second option: it maps pages into the stage-2
> >>>> page tables (or shadow page tables, if that is chosen) and unmaps
> >>>> them when the invalidate_range callback is called.
> >>>
> >>> I assume this map and unmap is done by the host as a result of some
> >>> guest action?
> >>
> >> Unmapping can happen at any time for any or no reason.  Semantically,
> >> it would be correct to only map the pages in response to a p2m violation,
> >> but for performance it might be better to map the pages eagerly instead.
> > 
> > That's an implementation detail, you can certainly map the pages
> > eagerly, or even map multiple contiguous pages as a result of a single
> > p2m violation.
> > 
> > I would focus on making a functioning prototype first, performance
> > comes afterwards.
> 
> Makes sense.
> 
> >>>> Furthermore,
> >>>> if a page fault happens while the page is unmapped, KVM will try
> >>>> to bring the pages back into memory so the guest can access it.
> >>>
> >>> You could likely handle this in Xen in the following way:
> >>>
> >>>  - A device model will get p2m violations forwarded, as it's the same
> >>>    model that's used to handle emulation of device MMIO.  You will
> >>>    need to register an ioreq server to request those faults to be
> >>>    forwarded, I think the hardware domain kernel will handle those?
> >>>
> >>>  - Allow ioreqs to signal to Xen that a guest operation must be
> >>>    retried.  IOW: resume guest execution without advancing the IP.
> >>>
> >>> I think this last bit is the one that will require changes to Xen, so
> >>> that you can add a type of ioreq reply that implies a retry from the
> >>> guest context.
> >> I’m not actually sure if this is needed, though it would be nice.  It
> >> might be possible for Xen to instead emulate the current instruction and
> >> continue, with the ioreq server just returning the current value of the
> >> pages.
> > 
> > You can, indeed, but it's cumbersome?  You might have to map the page
> > in the context of the entity that implements the ioreq server to
> > access the data.  Allowing retries would be more generic, and reduce
> > the code in the ioreq server handler, that would only map the page
> > to the guest p2m and request a retry.
> 
> Yeah, it is cumbersome indeed.
> 
> >> What I’m more concerned about is being able to provide a page
> >> into the p2m so that the *next* access doesn’t fault, and being able
> >> to remove that page from the p2m so that the next access *does* fault.
> > 
> > Maybe I'm not getting the question right, all Xen modifications to the
> > p2m take immediate effect.  By the time a XEN_DOMCTL_memory_mapping
> > hypercall returns the operation would have taken effect.
> 
> Ah, that makes sense.  When revoking access, can XEN_DOMCTL_iomem_permission
> and XEN_DOMCTL_memory_mapping fail even if the parameters are correct and
> the caller has enough permissions, or will they always succeed?

They can fail, but not for a guest induced reason.

For example XEN_DOMCTL_iomem_permission manipulates a rangeset and
revoking access might require a range to be split, and hence memory
allocated.  That allocation of memory can fail, but that's not under
guest control.

> >> Are there any hypercalls that can be used for these operations right
> >> now?
> > 
> > With some trickery you could likely use XEN_DOMCTL_memory_mapping to
> > add and remove those pages.  You will need calls to
> > XEN_DOMCTL_iomem_permission beforehand so that you grant the receiving
> > domain permissions to access those (and of course the granting domain
> > needs to have full access to them).
> > 
> > This is no ideal if mapping RAM pages, AFAICT there are no strict
> > checks that the added page is not RAM, but still you will need to
> > handle RAM pages as IOMEM so and grant them using
> > XEN_DOMCTL_iomem_permission which is not great.  Also note that this
> > is a domctl, so not stable.  It might however be enough for a
> > prototype.
> 
> Unfortunately this won’t work if the backend is a PVH domain, as a PVH
> domain doesn’t know its own MFNs.  It also won’t work for deprivileged
> backends because XEN_DOMCTL_iomem_permission is subject to XSA-77.

Hm, I think solving this will be complicated using a single hypercall,
because you have to deal with both MMIO and RAM, which are
traditionally handled differently in Xen, also when mapped in the
p2m.

You could possibly use XENMEM_add_to_physmap_batch to create a foreign
mapping in a remote guest p2m when mapping RAM, and
XEN_DOMCTL_memory_mapping when mapping IOMEM.  But that requires the
emulator/mediator to know when it's attempting to map RAM or IOMEM
(which I think you wanted to avoid?)

Otherwise a new XENMEM_add_to_physmap{,_batch} `phys_map_space` option
needs to be added to cater for your requirements.

> > Long term I think we want to expand XENMEM_add_to_physmap{,_batch} to
> > handle this use-case.
> 
> That would indeed be better.
> 
> >> If not, which Xen functions would one use to implement them?
> >> Some notes:
> >>
> >> - The p2m might need to be made to point to a PCI BAR or system RAM.
> >>   The guest kernel and host userspace don’t know which, and in any
> >>   case don’t need to care.  The host kernel knows, but I don’t know
> >>   if the information is exposed to the Xen driver.
> > 
> > Hm, as said above, while you could possible handle RAM as IOMEM, it
> > has the slight inconvenience of having to add such RAM pages to the
> > d->iomem_caps rangeset for XEN_DOMCTL_memory_mapping to succeed.
> > 
> > From a guest PoV, it doesn't matter if the underlying page is RAM or
> > MMIO, as long as it's mapped in the p2m.
> 
> Understood, thanks!
> 
> >> - If the p2m needs to point to system RAM, the RAM will be memory
> >>   that belongs to the backend.
> >>
> >> - If the p2m needs to point to a PCI BAR, it will initially need
> >>   to point to a real PCI device that is owned by the backend.
> > 
> > As long as you give the destination domain access to the page using
> > XEN_DOMCTL_iomem_permission prior to the XEN_DOMCTL_memory_mapping
> > call it should work.
> > 
> > How does this work for device DMA accesses?  If the device is assigned
> > to the backend domain (and thus using the backend domain IOMMU context
> > entry and page-tables) DMA accesses cannot be done against guest
> > provided addresses, there needs to be some kind of translation layer
> > that filters commands?
> 
> Thankfully, this is handled by the backend.

Oh, I see.  So the device IOMMU context is always set to the hardware
domain one, and the emulator handles all the translation required?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Mapping memory into a domain
  2025-05-09  4:52               ` Demi Marie Obenour
  2025-05-09  8:53                 ` Roger Pau Monné
@ 2025-05-09  9:47                 ` Alejandro Vallejo
  2025-05-09 10:50                   ` Roger Pau Monné
                                     ` (2 more replies)
  1 sibling, 3 replies; 20+ messages in thread
From: Alejandro Vallejo @ 2025-05-09  9:47 UTC (permalink / raw)
  To: Demi Marie Obenour, Roger Pau Monné, Xenia Ragiadakou,
	Stefano Stabellini
  Cc: Xen developer discussion, Andrew Cooper, Juergen Gross, Xen-devel

>>>>> A Linux driver that needs access to userspace memory
>>>>> pages can get it in two different ways:
>>>>>
>>>>> 1. It can pin the pages using the pin_user_pages family of APIs.
>>>>>    If these functions succeed, the driver is guaranteed to be able
>>>>>    to access the pages until it unpins them.  However, this also
>>>>>    means that the pages cannot be paged out or migrated.  Furthermore,
>>>>>    file-backed pages cannot be safely pinned, and pinning GPU memory
>>>>>    isn’t supported.  (At a minimum, it would prevent the pages from
>>>>>    migrating from system RAM to VRAM, so all access by a dGPU would
>>>>>    cross the PCIe bus, which would be very slow.)
>>>>
>>>> From a Xen p2m this is all fine - Xen will never remove pages from the
>>>> p2m unless it's requested to.  So the pining, while needed on the Linux
>>>> side, doesn't need to be propagated to Xen I would think.

It might still be helpful to have the concept of pinning to avoid them
being evicted for other reasons (ballooning?). I don't think it'd be
sane to allow returning to Xen a page that a domain ever shared with a
device.

re: being requested. Are there real promises from Xen to that effect? I
could make a hypervisor oversubscribing on memory that swaps non-IOVA
mem in and out to disk, moving it around all the time and it would be
compliant with the current behaviour AIUI, but it wouldn't work with
this scheme, because the mfn's would be off more often than not.

>>>
>>> If pinning were enough things would be simple, but sadly it’s not.
>>>
>>>>> 2. It can grab the *current* location of the pages and register an
>>>>>    MMU notifier.  This works for GPU memory and file-backed memory.
>>>>>    However, when the invalidate_range function of this callback, the
>>>>>    driver *must* stop all further accesses to the pages.
>>>>>
>>>>>    The invalidate_range callback is not allowed to block for a long
>>>>>    period of time.  My understanding is that things like dirty page
>>>>>    writeback are blocked while the callback is in progress.  My
>>>>>    understanding is also that the callback is not allowed to fail.
>>>>>    I believe it can return a retryable error but I don’t think that
>>>>>    it is allowed to keep failing forever.
>>>>>
>>>>>    Linux’s grant table driver actually had a bug in this area, which
>>>>>    led to deadlocks.  I fixed that a while back.
>>>>>
>>>>> KVM implements the second option: it maps pages into the stage-2
>>>>> page tables (or shadow page tables, if that is chosen) and unmaps
>>>>> them when the invalidate_range callback is called.

I'm still lost as to what is where, who initiates what and what the end
goal is. Is this about using userspace memory in dom0, and THEN sharing
that with guests for as long as its live? And make enough magic so the
guests don't notice the transitionary period in which there may not be
any memory?

Or is this about using domU memory for the driver living in dom0?

Or is this about something else entirely?

For my own education. Is the following sequence diagram remotely accurate?

dom0                              domU
 |                                  |
 |---+                              |
 |   | use gfn3 in the driver       |
 |   | (mapped on user thread)      |
 |<--+                              |
 |                                  |
 |  map mfn(gfn3) in domU BAR       |
 |--------------------------------->|
 |                              +---|
 |              happily use BAR |   |
 |                              +-->|
 |---+                              |
 |   | mmu notifier for gfn3        |
 |   | (invalidate_range)           |
 |<--+                              |
 |                                  |
 |  unmap mfn(gfn3)                 |
 |--------------------------------->| <--- Plus some means to making guest 
 |---+                          +---|      vCPUs pause on access.
 |   | reclaim gfn3    block on |   |
 |<--+                 access   |   |
 |                              |   |
 |---+                          |   |
 |   | use gfn7 in the driver   |   |
 |   | (mapped on user thread)  |   |
 |<--+                          |   |
 |                              |   |
 |  map mfn(gfn7) in domU BAR   |   |
 |------------------------------+-->| <--- Unpause blocked domU vCPUs
 |                                  |

>>> - The switch from “emulated MMIO” to “MMIO or real RAM” needs to
>>>   be atomic from the guest’s perspective.
>> 
>> Updates of p2m PTEs are always atomic.
> That’s good.

Updates to a single PTE are atomic, sure. But mapping/unmapping sizes
not congruent with a whole superpage size (i.e: 256 KiB, more than a
page, less than a superpage) wouldn't be, as far as the guest is
concerned.

But if my understanding above is correct maybe it doesn't matter? It
only needs to be atomic wrt the hypercall that requests it, so that the
gfn is never reused while the guest p2m still holds that mfn.

Cheers,
Alejandro


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Mapping memory into a domain
  2025-05-09  9:47                 ` Alejandro Vallejo
@ 2025-05-09 10:50                   ` Roger Pau Monné
  2025-05-09 18:21                     ` Demi Marie Obenour
  2025-05-09 18:14                   ` Demi Marie Obenour
  2025-05-09 18:30                   ` Demi Marie Obenour
  2 siblings, 1 reply; 20+ messages in thread
From: Roger Pau Monné @ 2025-05-09 10:50 UTC (permalink / raw)
  To: Alejandro Vallejo
  Cc: Demi Marie Obenour, Xenia Ragiadakou, Stefano Stabellini,
	Xen developer discussion, Andrew Cooper, Juergen Gross, Xen-devel

On Fri, May 09, 2025 at 11:47:36AM +0200, Alejandro Vallejo wrote:
> >>>>> A Linux driver that needs access to userspace memory
> >>>>> pages can get it in two different ways:
> >>>>>
> >>>>> 1. It can pin the pages using the pin_user_pages family of APIs.
> >>>>>    If these functions succeed, the driver is guaranteed to be able
> >>>>>    to access the pages until it unpins them.  However, this also
> >>>>>    means that the pages cannot be paged out or migrated.  Furthermore,
> >>>>>    file-backed pages cannot be safely pinned, and pinning GPU memory
> >>>>>    isn’t supported.  (At a minimum, it would prevent the pages from
> >>>>>    migrating from system RAM to VRAM, so all access by a dGPU would
> >>>>>    cross the PCIe bus, which would be very slow.)
> >>>>
> >>>> From a Xen p2m this is all fine - Xen will never remove pages from the
> >>>> p2m unless it's requested to.  So the pining, while needed on the Linux
> >>>> side, doesn't need to be propagated to Xen I would think.
> 
> It might still be helpful to have the concept of pinning to avoid them
> being evicted for other reasons (ballooning?). I don't think it'd be
> sane to allow returning to Xen a page that a domain ever shared with a
> device.

If mapped using the p2m_mmio_direct type in the p2m a domain won't be
able to balloon them out.  It would also be misguided for a guest
kernel to attempt to balloon out memory that I presume will be inside
of a PCI device BAR from the guest point of view.

> re: being requested. Are there real promises from Xen to that effect? I
> could make a hypervisor oversubscribing on memory that swaps non-IOVA
> mem in and out to disk, moving it around all the time and it would be
> compliant with the current behaviour AIUI, but it wouldn't work with
> this scheme, because the mfn's would be off more often than not.

Even if Xen supported domain memory swapping, that could never be used
with domains that have devices attached, as it's not possible to fixup
the p2m on IOMMU fault and retry the access.

Not sure you could even move mfns around, as you would need an atomic
way to copy the previous page contents and set the PTE to point to the
new page.

Unless you want to get into a (IMO) complicated scheme where the
domain notifies the hypervisor which ranges are being used for device
DMA accesses (and thus requires guest kernel changes), I think
swapping of guest memory when there are assigned devices is a no-go.

Xen has (or had? as I never actually seen it being used) a mechanism
to swap domain memory to a dom0 file (see tools/xenpaging.c).  However
more than one provider had mentioned to me that one feature they
particularly preferred of Xen over KVM is that it would never swap
guest memory.  Not sure if that's still the case, but some struggled
to prevent KVM from swapping guest memory, and got complains of
slowness from their tenants.

For the purposes of getting a prototype I would suggest that you
assume p2m memory cannot be randomly swapped out, unless requested by
either the guest or the control domain.

> >>>
> >>> If pinning were enough things would be simple, but sadly it’s not.
> >>>
> >>>>> 2. It can grab the *current* location of the pages and register an
> >>>>>    MMU notifier.  This works for GPU memory and file-backed memory.
> >>>>>    However, when the invalidate_range function of this callback, the
> >>>>>    driver *must* stop all further accesses to the pages.
> >>>>>
> >>>>>    The invalidate_range callback is not allowed to block for a long
> >>>>>    period of time.  My understanding is that things like dirty page
> >>>>>    writeback are blocked while the callback is in progress.  My
> >>>>>    understanding is also that the callback is not allowed to fail.
> >>>>>    I believe it can return a retryable error but I don’t think that
> >>>>>    it is allowed to keep failing forever.
> >>>>>
> >>>>>    Linux’s grant table driver actually had a bug in this area, which
> >>>>>    led to deadlocks.  I fixed that a while back.
> >>>>>
> >>>>> KVM implements the second option: it maps pages into the stage-2
> >>>>> page tables (or shadow page tables, if that is chosen) and unmaps
> >>>>> them when the invalidate_range callback is called.
> 
> I'm still lost as to what is where, who initiates what and what the end
> goal is. Is this about using userspace memory in dom0, and THEN sharing
> that with guests for as long as its live? And make enough magic so the
> guests don't notice the transitionary period in which there may not be
> any memory?
> 
> Or is this about using domU memory for the driver living in dom0?
> 
> Or is this about something else entirely?
> 
> For my own education. Is the following sequence diagram remotely accurate?
> 
> dom0                              domU
>  |                                  |
>  |---+                              |
>  |   | use gfn3 in the driver       |
>  |   | (mapped on user thread)      |
>  |<--+                              |
>  |                                  |
>  |  map mfn(gfn3) in domU BAR       |
>  |--------------------------------->|
>  |                              +---|
>  |              happily use BAR |   |
>  |                              +-->|
>  |---+                              |
>  |   | mmu notifier for gfn3        |
>  |   | (invalidate_range)           |
>  |<--+                              |
>  |                                  |
>  |  unmap mfn(gfn3)                 |
>  |--------------------------------->| <--- Plus some means to making guest 
>  |---+                          +---|      vCPUs pause on access.
>  |   | reclaim gfn3    block on |   |
>  |<--+                 access   |   |
>  |                              |   |
>  |---+                          |   |
>  |   | use gfn7 in the driver   |   |
>  |   | (mapped on user thread)  |   |
>  |<--+                          |   |
>  |                              |   |
>  |  map mfn(gfn7) in domU BAR   |   |
>  |------------------------------+-->| <--- Unpause blocked domU vCPUs

The guest vCPU will already pause on access if there's a p2m
violation, until the ioreq has completed and the vCPU execution can
resume.  That's in control of the ioreq server that handles the
request.

I don't know about the dom0 user-space part, but that's possibly of no
concern for the implementation side in Xen?

My understanding of the actions needed from the Xen side is:

 1. Map either RAM owned by the hardware domain or an MMIO page into
    a domain p2m.
 2. Remove entries from a domain p2m.
 3. Handle p2m violations resulting from guest accesses, using 1. and
    force a guest access retry (or emulate the access).

1. Can possibly be done with XEN_DOMCTL_memory_mapping and
XENMEM_add_to_physmap_batch, but as I understood it it's not ideal.
Demi would like a way to use the same hypercall to map either RAM or
IOMEM into a domain p2m.

2. What hypercall to use depends on how the memory is mapped.

3. ioreq servers will already get requests for accesses to unmapped
regions they have registered for.  If the access is to be retried we
need to expand ioreq interface a bit to handle this case.  Adding a
new ioreq state like STATE_IORESP_RETRY might be enough?  Maybe I'm
being naive though.

>  |                                  |
> 
> >>> - The switch from “emulated MMIO” to “MMIO or real RAM” needs to
> >>>   be atomic from the guest’s perspective.
> >> 
> >> Updates of p2m PTEs are always atomic.
> > That’s good.
> 
> Updates to a single PTE are atomic, sure. But mapping/unmapping sizes
> not congruent with a whole superpage size (i.e: 256 KiB, more than a
> page, less than a superpage) wouldn't be, as far as the guest is
> concerned.

I've assumed the question was towards PTE updates, as to whether
PTE entries where always consistent.

> But if my understanding above is correct maybe it doesn't matter? It
> only needs to be atomic wrt the hypercall that requests it, so that the
> gfn is never reused while the guest p2m still holds that mfn.

I think it only matters that the PTE is always consistent, either
mapped or unmapped (and thus generate an ioreq request on access when
unmapped).

Regards, Roger.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Mapping memory into a domain
  2025-05-09 10:50                   ` Roger Pau Monné
@ 2025-05-09 18:21                     ` Demi Marie Obenour
  2025-05-12  8:08                       ` Roger Pau Monné
  0 siblings, 1 reply; 20+ messages in thread
From: Demi Marie Obenour @ 2025-05-09 18:21 UTC (permalink / raw)
  To: Roger Pau Monné, Alejandro Vallejo
  Cc: Xenia Ragiadakou, Stefano Stabellini, Xen developer discussion,
	Andrew Cooper, Juergen Gross, Xen-devel


[-- Attachment #1.1.1: Type: text/plain, Size: 9285 bytes --]

On 5/9/25 6:50 AM, Roger Pau Monné wrote:
> On Fri, May 09, 2025 at 11:47:36AM +0200, Alejandro Vallejo wrote:
>>>>>>> A Linux driver that needs access to userspace memory
>>>>>>> pages can get it in two different ways:
>>>>>>>
>>>>>>> 1. It can pin the pages using the pin_user_pages family of APIs.
>>>>>>>    If these functions succeed, the driver is guaranteed to be able
>>>>>>>    to access the pages until it unpins them.  However, this also
>>>>>>>    means that the pages cannot be paged out or migrated.  Furthermore,
>>>>>>>    file-backed pages cannot be safely pinned, and pinning GPU memory
>>>>>>>    isn’t supported.  (At a minimum, it would prevent the pages from
>>>>>>>    migrating from system RAM to VRAM, so all access by a dGPU would
>>>>>>>    cross the PCIe bus, which would be very slow.)
>>>>>>
>>>>>> From a Xen p2m this is all fine - Xen will never remove pages from the
>>>>>> p2m unless it's requested to.  So the pining, while needed on the Linux
>>>>>> side, doesn't need to be propagated to Xen I would think.
>>
>> It might still be helpful to have the concept of pinning to avoid them
>> being evicted for other reasons (ballooning?). I don't think it'd be
>> sane to allow returning to Xen a page that a domain ever shared with a
>> device.
> 
> If mapped using the p2m_mmio_direct type in the p2m a domain won't be
> able to balloon them out.  It would also be misguided for a guest
> kernel to attempt to balloon out memory that I presume will be inside
> of a PCI device BAR from the guest point of view.

Indeed it will be inside a BAR.

>> re: being requested. Are there real promises from Xen to that effect? I
>> could make a hypervisor oversubscribing on memory that swaps non-IOVA
>> mem in and out to disk, moving it around all the time and it would be
>> compliant with the current behaviour AIUI, but it wouldn't work with
>> this scheme, because the mfn's would be off more often than not.
> 
> Even if Xen supported domain memory swapping, that could never be used
> with domains that have devices attached, as it's not possible to fixup
> the p2m on IOMMU fault and retry the access.
> 
> Not sure you could even move mfns around, as you would need an atomic
> way to copy the previous page contents and set the PTE to point to the
> new page.
> 
> Unless you want to get into a (IMO) complicated scheme where the
> domain notifies the hypervisor which ranges are being used for device
> DMA accesses (and thus requires guest kernel changes), I think
> swapping of guest memory when there are assigned devices is a no-go.
> 
> Xen has (or had? as I never actually seen it being used) a mechanism
> to swap domain memory to a dom0 file (see tools/xenpaging.c).  However
> more than one provider had mentioned to me that one feature they
> particularly preferred of Xen over KVM is that it would never swap
> guest memory.  Not sure if that's still the case, but some struggled
> to prevent KVM from swapping guest memory, and got complains of
> slowness from their tenants.
> 
> For the purposes of getting a prototype I would suggest that you
> assume p2m memory cannot be randomly swapped out, unless requested by
> either the guest or the control domain.

The API being discussed here needs to support frontends that have
assigned PCI devices, but the pages should never be mapped into
the frontend domain’s IOMMU context.  If the frontend tries to
DMA into one of these pages it’s a frontend bug.

>>>>> If pinning were enough things would be simple, but sadly it’s not.
>>>>>
>>>>>>> 2. It can grab the *current* location of the pages and register an
>>>>>>>    MMU notifier.  This works for GPU memory and file-backed memory.
>>>>>>>    However, when the invalidate_range function of this callback, the
>>>>>>>    driver *must* stop all further accesses to the pages.
>>>>>>>
>>>>>>>    The invalidate_range callback is not allowed to block for a long
>>>>>>>    period of time.  My understanding is that things like dirty page
>>>>>>>    writeback are blocked while the callback is in progress.  My
>>>>>>>    understanding is also that the callback is not allowed to fail.
>>>>>>>    I believe it can return a retryable error but I don’t think that
>>>>>>>    it is allowed to keep failing forever.
>>>>>>>
>>>>>>>    Linux’s grant table driver actually had a bug in this area, which
>>>>>>>    led to deadlocks.  I fixed that a while back.
>>>>>>>
>>>>>>> KVM implements the second option: it maps pages into the stage-2
>>>>>>> page tables (or shadow page tables, if that is chosen) and unmaps
>>>>>>> them when the invalidate_range callback is called.
>>
>> I'm still lost as to what is where, who initiates what and what the end
>> goal is. Is this about using userspace memory in dom0, and THEN sharing
>> that with guests for as long as its live? And make enough magic so the
>> guests don't notice the transitionary period in which there may not be
>> any memory?
>>
>> Or is this about using domU memory for the driver living in dom0?
>>
>> Or is this about something else entirely?
>>
>> For my own education. Is the following sequence diagram remotely accurate?
>>
>> dom0                              domU
>>  |                                  |
>>  |---+                              |
>>  |   | use gfn3 in the driver       |
>>  |   | (mapped on user thread)      |
>>  |<--+                              |
>>  |                                  |
>>  |  map mfn(gfn3) in domU BAR       |
>>  |--------------------------------->|
>>  |                              +---|
>>  |              happily use BAR |   |
>>  |                              +-->|
>>  |---+                              |
>>  |   | mmu notifier for gfn3        |
>>  |   | (invalidate_range)           |
>>  |<--+                              |
>>  |                                  |
>>  |  unmap mfn(gfn3)                 |
>>  |--------------------------------->| <--- Plus some means to making guest 
>>  |---+                          +---|      vCPUs pause on access.
>>  |   | reclaim gfn3    block on |   |
>>  |<--+                 access   |   |
>>  |                              |   |
>>  |---+                          |   |
>>  |   | use gfn7 in the driver   |   |
>>  |   | (mapped on user thread)  |   |
>>  |<--+                          |   |
>>  |                              |   |
>>  |  map mfn(gfn7) in domU BAR   |   |
>>  |------------------------------+-->| <--- Unpause blocked domU vCPUs
> 
> The guest vCPU will already pause on access if there's a p2m
> violation, until the ioreq has completed and the vCPU execution can
> resume.  That's in control of the ioreq server that handles the
> request.
> 
> I don't know about the dom0 user-space part, but that's possibly of no
> concern for the implementation side in Xen?

I believe so, yes.

> My understanding of the actions needed from the Xen side is:
> 
>  1. Map either RAM owned by the hardware domain or an MMIO page into
>     a domain p2m.
>  2. Remove entries from a domain p2m.
>  3. Handle p2m violations resulting from guest accesses, using 1. and
>     force a guest access retry (or emulate the access).
> 
> 1. Can possibly be done with XEN_DOMCTL_memory_mapping and
> XENMEM_add_to_physmap_batch, but as I understood it it's not ideal.
> Demi would like a way to use the same hypercall to map either RAM or
> IOMEM into a domain p2m.

Indeed so, and also the backend domain might be a driver domain instead
of the hardware domain.  It needs to have privilege over the frontend,
but it should not need privilege over the whole system.

> 2. What hypercall to use depends on how the memory is mapped.
> 
> 3. ioreq servers will already get requests for accesses to unmapped
> regions they have registered for.  If the access is to be retried we
> need to expand ioreq interface a bit to handle this case.  Adding a
> new ioreq state like STATE_IORESP_RETRY might be enough?  Maybe I'm
> being naive though.

This is where an implementation in a real userspace emulator would
be very useful, to ensure that the API being implemented is actually
usable in practice.

>>>>> - The switch from “emulated MMIO” to “MMIO or real RAM” needs to
>>>>>   be atomic from the guest’s perspective.
>>>>
>>>> Updates of p2m PTEs are always atomic.
>>> That’s good.
>>
>> Updates to a single PTE are atomic, sure. But mapping/unmapping sizes
>> not congruent with a whole superpage size (i.e: 256 KiB, more than a
>> page, less than a superpage) wouldn't be, as far as the guest is
>> concerned.
> 
> I've assumed the question was towards PTE updates, as to whether
> PTE entries where always consistent.
> 
>> But if my understanding above is correct maybe it doesn't matter? It
>> only needs to be atomic wrt the hypercall that requests it, so that the
>> gfn is never reused while the guest p2m still holds that mfn.
> 
> I think it only matters that the PTE is always consistent, either
> mapped or unmapped (and thus generate an ioreq request on access when
> unmapped).
You are correct.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Mapping memory into a domain
  2025-05-09 18:21                     ` Demi Marie Obenour
@ 2025-05-12  8:08                       ` Roger Pau Monné
  2025-05-12 23:18                         ` Demi Marie Obenour
  0 siblings, 1 reply; 20+ messages in thread
From: Roger Pau Monné @ 2025-05-12  8:08 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Alejandro Vallejo, Xenia Ragiadakou, Stefano Stabellini,
	Xen developer discussion, Andrew Cooper, Juergen Gross, Xen-devel

On Fri, May 09, 2025 at 02:21:57PM -0400, Demi Marie Obenour wrote:
> On 5/9/25 6:50 AM, Roger Pau Monné wrote:
> > On Fri, May 09, 2025 at 11:47:36AM +0200, Alejandro Vallejo wrote:
> >>>>>>> A Linux driver that needs access to userspace memory
> >>>>>>> pages can get it in two different ways:
> >>>>>>>
> >>>>>>> 1. It can pin the pages using the pin_user_pages family of APIs.
> >>>>>>>    If these functions succeed, the driver is guaranteed to be able
> >>>>>>>    to access the pages until it unpins them.  However, this also
> >>>>>>>    means that the pages cannot be paged out or migrated.  Furthermore,
> >>>>>>>    file-backed pages cannot be safely pinned, and pinning GPU memory
> >>>>>>>    isn’t supported.  (At a minimum, it would prevent the pages from
> >>>>>>>    migrating from system RAM to VRAM, so all access by a dGPU would
> >>>>>>>    cross the PCIe bus, which would be very slow.)
> >>>>>>
> >>>>>> From a Xen p2m this is all fine - Xen will never remove pages from the
> >>>>>> p2m unless it's requested to.  So the pining, while needed on the Linux
> >>>>>> side, doesn't need to be propagated to Xen I would think.
> >>
> >> It might still be helpful to have the concept of pinning to avoid them
> >> being evicted for other reasons (ballooning?). I don't think it'd be
> >> sane to allow returning to Xen a page that a domain ever shared with a
> >> device.
> > 
> > If mapped using the p2m_mmio_direct type in the p2m a domain won't be
> > able to balloon them out.  It would also be misguided for a guest
> > kernel to attempt to balloon out memory that I presume will be inside
> > of a PCI device BAR from the guest point of view.
> 
> Indeed it will be inside a BAR.
> 
> >> re: being requested. Are there real promises from Xen to that effect? I
> >> could make a hypervisor oversubscribing on memory that swaps non-IOVA
> >> mem in and out to disk, moving it around all the time and it would be
> >> compliant with the current behaviour AIUI, but it wouldn't work with
> >> this scheme, because the mfn's would be off more often than not.
> > 
> > Even if Xen supported domain memory swapping, that could never be used
> > with domains that have devices attached, as it's not possible to fixup
> > the p2m on IOMMU fault and retry the access.
> > 
> > Not sure you could even move mfns around, as you would need an atomic
> > way to copy the previous page contents and set the PTE to point to the
> > new page.
> > 
> > Unless you want to get into a (IMO) complicated scheme where the
> > domain notifies the hypervisor which ranges are being used for device
> > DMA accesses (and thus requires guest kernel changes), I think
> > swapping of guest memory when there are assigned devices is a no-go.
> > 
> > Xen has (or had? as I never actually seen it being used) a mechanism
> > to swap domain memory to a dom0 file (see tools/xenpaging.c).  However
> > more than one provider had mentioned to me that one feature they
> > particularly preferred of Xen over KVM is that it would never swap
> > guest memory.  Not sure if that's still the case, but some struggled
> > to prevent KVM from swapping guest memory, and got complains of
> > slowness from their tenants.
> > 
> > For the purposes of getting a prototype I would suggest that you
> > assume p2m memory cannot be randomly swapped out, unless requested by
> > either the guest or the control domain.
> 
> The API being discussed here needs to support frontends that have
> assigned PCI devices, but the pages should never be mapped into
> the frontend domain’s IOMMU context.  If the frontend tries to
> DMA into one of these pages it’s a frontend bug.

That's a detail I didn't get from your previous description.  If
memory is not to be added to the IOMMU page-tables you will need an
extra flag or similar to signal this, as by default all memory added
to a guest p2m is also added to the IOMMU page-tables.  And when using
shared page-tables between the IOMMU and the CPU there's no way to add
mappings to the CPU only.

Do you really need such mappings to be added only to the p2m, and not
the IOMMU page-tables?  I don't think the pages "should never be
mapped", but rather "don't need to be mapped" as the implementation
won't support DMA accesses (iow: "never" is too strong in this
context).  IMO it is fine if for an initial prototype the pages are
also added to the IOMMU page-tables, and later you can add a flag (or
a new hypercall) that strictly only adds pages to the p2m and not the
IOMMU page-tables, it's likely to also be a good performance
improvement.

> >>>>> If pinning were enough things would be simple, but sadly it’s not.
> >>>>>
> >>>>>>> 2. It can grab the *current* location of the pages and register an
> >>>>>>>    MMU notifier.  This works for GPU memory and file-backed memory.
> >>>>>>>    However, when the invalidate_range function of this callback, the
> >>>>>>>    driver *must* stop all further accesses to the pages.
> >>>>>>>
> >>>>>>>    The invalidate_range callback is not allowed to block for a long
> >>>>>>>    period of time.  My understanding is that things like dirty page
> >>>>>>>    writeback are blocked while the callback is in progress.  My
> >>>>>>>    understanding is also that the callback is not allowed to fail.
> >>>>>>>    I believe it can return a retryable error but I don’t think that
> >>>>>>>    it is allowed to keep failing forever.
> >>>>>>>
> >>>>>>>    Linux’s grant table driver actually had a bug in this area, which
> >>>>>>>    led to deadlocks.  I fixed that a while back.
> >>>>>>>
> >>>>>>> KVM implements the second option: it maps pages into the stage-2
> >>>>>>> page tables (or shadow page tables, if that is chosen) and unmaps
> >>>>>>> them when the invalidate_range callback is called.
> >>
> >> I'm still lost as to what is where, who initiates what and what the end
> >> goal is. Is this about using userspace memory in dom0, and THEN sharing
> >> that with guests for as long as its live? And make enough magic so the
> >> guests don't notice the transitionary period in which there may not be
> >> any memory?
> >>
> >> Or is this about using domU memory for the driver living in dom0?
> >>
> >> Or is this about something else entirely?
> >>
> >> For my own education. Is the following sequence diagram remotely accurate?
> >>
> >> dom0                              domU
> >>  |                                  |
> >>  |---+                              |
> >>  |   | use gfn3 in the driver       |
> >>  |   | (mapped on user thread)      |
> >>  |<--+                              |
> >>  |                                  |
> >>  |  map mfn(gfn3) in domU BAR       |
> >>  |--------------------------------->|
> >>  |                              +---|
> >>  |              happily use BAR |   |
> >>  |                              +-->|
> >>  |---+                              |
> >>  |   | mmu notifier for gfn3        |
> >>  |   | (invalidate_range)           |
> >>  |<--+                              |
> >>  |                                  |
> >>  |  unmap mfn(gfn3)                 |
> >>  |--------------------------------->| <--- Plus some means to making guest 
> >>  |---+                          +---|      vCPUs pause on access.
> >>  |   | reclaim gfn3    block on |   |
> >>  |<--+                 access   |   |
> >>  |                              |   |
> >>  |---+                          |   |
> >>  |   | use gfn7 in the driver   |   |
> >>  |   | (mapped on user thread)  |   |
> >>  |<--+                          |   |
> >>  |                              |   |
> >>  |  map mfn(gfn7) in domU BAR   |   |
> >>  |------------------------------+-->| <--- Unpause blocked domU vCPUs
> > 
> > The guest vCPU will already pause on access if there's a p2m
> > violation, until the ioreq has completed and the vCPU execution can
> > resume.  That's in control of the ioreq server that handles the
> > request.
> > 
> > I don't know about the dom0 user-space part, but that's possibly of no
> > concern for the implementation side in Xen?
> 
> I believe so, yes.
> 
> > My understanding of the actions needed from the Xen side is:
> > 
> >  1. Map either RAM owned by the hardware domain or an MMIO page into
> >     a domain p2m.
> >  2. Remove entries from a domain p2m.
> >  3. Handle p2m violations resulting from guest accesses, using 1. and
> >     force a guest access retry (or emulate the access).
> > 
> > 1. Can possibly be done with XEN_DOMCTL_memory_mapping and
> > XENMEM_add_to_physmap_batch, but as I understood it it's not ideal.
> > Demi would like a way to use the same hypercall to map either RAM or
> > IOMEM into a domain p2m.
> 
> Indeed so, and also the backend domain might be a driver domain instead
> of the hardware domain.  It needs to have privilege over the frontend,
> but it should not need privilege over the whole system.

This can all be arranged for, I wouldn't get bugged down on this
details initially.

> > 2. What hypercall to use depends on how the memory is mapped.
> > 
> > 3. ioreq servers will already get requests for accesses to unmapped
> > regions they have registered for.  If the access is to be retried we
> > need to expand ioreq interface a bit to handle this case.  Adding a
> > new ioreq state like STATE_IORESP_RETRY might be enough?  Maybe I'm
> > being naive though.
> 
> This is where an implementation in a real userspace emulator would
> be very useful, to ensure that the API being implemented is actually
> usable in practice.

My suggestion for adding a "retry" response type to the ioreq
interface was so that your ioreq model implementation would be
simpler.  However if that's more hassle for you, I would initially map
and emulate the access, as that would also be correct and shouldn't
require any changes to the ioreq interface.  It can always be expanded
later to support a map and retry model.

AFAICT, from the ongoing discussion above, the only uncertainty is
which hypercall(s) to use to map either MMIO or RAM into a guest p2m.
I wouldn't invest a huge amount of time into prototyping something
very complex, and rather get a very simple hypercall implemented that
fits your needs.  You could likely make a frankenhypercall based on the
implementations of XEN_DOMCTL_memory_mapping and
XENMEM_add_to_physmap, so that you can get a prototype working.

I think at this point it's important to get a functional prototype, so
that we know exactly the requirements of the interfaces that you need.
I wouldn't bother to design a detailed interface until we know exactly
that such interface is suitable for your goals, and to that end we
need a prototype with whatever you can glue together.

Regards, Roger.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Mapping memory into a domain
  2025-05-12  8:08                       ` Roger Pau Monné
@ 2025-05-12 23:18                         ` Demi Marie Obenour
  0 siblings, 0 replies; 20+ messages in thread
From: Demi Marie Obenour @ 2025-05-12 23:18 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Alejandro Vallejo, Xenia Ragiadakou, Stefano Stabellini,
	Xen developer discussion, Andrew Cooper, Juergen Gross, Xen-devel


[-- Attachment #1.1.1: Type: text/plain, Size: 11873 bytes --]

On 5/12/25 4:08 AM, Roger Pau Monné wrote:
> On Fri, May 09, 2025 at 02:21:57PM -0400, Demi Marie Obenour wrote:
>> On 5/9/25 6:50 AM, Roger Pau Monné wrote:
>>> On Fri, May 09, 2025 at 11:47:36AM +0200, Alejandro Vallejo wrote:
>>>>>>>>> A Linux driver that needs access to userspace memory
>>>>>>>>> pages can get it in two different ways:
>>>>>>>>>
>>>>>>>>> 1. It can pin the pages using the pin_user_pages family of APIs.
>>>>>>>>>    If these functions succeed, the driver is guaranteed to be able
>>>>>>>>>    to access the pages until it unpins them.  However, this also
>>>>>>>>>    means that the pages cannot be paged out or migrated.  Furthermore,
>>>>>>>>>    file-backed pages cannot be safely pinned, and pinning GPU memory
>>>>>>>>>    isn’t supported.  (At a minimum, it would prevent the pages from
>>>>>>>>>    migrating from system RAM to VRAM, so all access by a dGPU would
>>>>>>>>>    cross the PCIe bus, which would be very slow.)
>>>>>>>>
>>>>>>>> From a Xen p2m this is all fine - Xen will never remove pages from the
>>>>>>>> p2m unless it's requested to.  So the pining, while needed on the Linux
>>>>>>>> side, doesn't need to be propagated to Xen I would think.
>>>>
>>>> It might still be helpful to have the concept of pinning to avoid them
>>>> being evicted for other reasons (ballooning?). I don't think it'd be
>>>> sane to allow returning to Xen a page that a domain ever shared with a
>>>> device.
>>>
>>> If mapped using the p2m_mmio_direct type in the p2m a domain won't be
>>> able to balloon them out.  It would also be misguided for a guest
>>> kernel to attempt to balloon out memory that I presume will be inside
>>> of a PCI device BAR from the guest point of view.
>>
>> Indeed it will be inside a BAR.
>>
>>>> re: being requested. Are there real promises from Xen to that effect? I
>>>> could make a hypervisor oversubscribing on memory that swaps non-IOVA
>>>> mem in and out to disk, moving it around all the time and it would be
>>>> compliant with the current behaviour AIUI, but it wouldn't work with
>>>> this scheme, because the mfn's would be off more often than not.
>>>
>>> Even if Xen supported domain memory swapping, that could never be used
>>> with domains that have devices attached, as it's not possible to fixup
>>> the p2m on IOMMU fault and retry the access.
>>>
>>> Not sure you could even move mfns around, as you would need an atomic
>>> way to copy the previous page contents and set the PTE to point to the
>>> new page.
>>>
>>> Unless you want to get into a (IMO) complicated scheme where the
>>> domain notifies the hypervisor which ranges are being used for device
>>> DMA accesses (and thus requires guest kernel changes), I think
>>> swapping of guest memory when there are assigned devices is a no-go.
>>>
>>> Xen has (or had? as I never actually seen it being used) a mechanism
>>> to swap domain memory to a dom0 file (see tools/xenpaging.c).  However
>>> more than one provider had mentioned to me that one feature they
>>> particularly preferred of Xen over KVM is that it would never swap
>>> guest memory.  Not sure if that's still the case, but some struggled
>>> to prevent KVM from swapping guest memory, and got complains of
>>> slowness from their tenants.
>>>
>>> For the purposes of getting a prototype I would suggest that you
>>> assume p2m memory cannot be randomly swapped out, unless requested by
>>> either the guest or the control domain.
>>
>> The API being discussed here needs to support frontends that have
>> assigned PCI devices, but the pages should never be mapped into
>> the frontend domain’s IOMMU context.  If the frontend tries to
>> DMA into one of these pages it’s a frontend bug.
> 
> That's a detail I didn't get from your previous description.  If
> memory is not to be added to the IOMMU page-tables you will need an
> extra flag or similar to signal this, as by default all memory added
> to a guest p2m is also added to the IOMMU page-tables.  And when using
> shared page-tables between the IOMMU and the CPU there's no way to add
> mappings to the CPU only.

I suspect that in practice, shared CPU/IOMMU page tables will need to be
disabled when this API is in use, as...

> Do you really need such mappings to be added only to the p2m, and not
> the IOMMU page-tables?  I don't think the pages "should never be
> mapped", but rather "don't need to be mapped" as the implementation
> won't support DMA accesses (iow: "never" is too strong in this
> context).  IMO it is fine if for an initial prototype the pages are
> also added to the IOMMU page-tables, and later you can add a flag (or
> a new hypercall) that strictly only adds pages to the p2m and not the
> IOMMU page-tables, it's likely to also be a good performance
> improvement.

...I doubt that an IOTLB flush from an MMU notifier that might be called
fairly frequently will be acceptable for anything but a prototype.
I don't have any benchmarks, though.  Also, having DMA operations succeed
or fail non-deterministically would be much harder to debug than for them
to always fail.

>>>>>>> If pinning were enough things would be simple, but sadly it’s not.
>>>>>>>
>>>>>>>>> 2. It can grab the *current* location of the pages and register an
>>>>>>>>>    MMU notifier.  This works for GPU memory and file-backed memory.
>>>>>>>>>    However, when the invalidate_range function of this callback, the
>>>>>>>>>    driver *must* stop all further accesses to the pages.
>>>>>>>>>
>>>>>>>>>    The invalidate_range callback is not allowed to block for a long
>>>>>>>>>    period of time.  My understanding is that things like dirty page
>>>>>>>>>    writeback are blocked while the callback is in progress.  My
>>>>>>>>>    understanding is also that the callback is not allowed to fail.
>>>>>>>>>    I believe it can return a retryable error but I don’t think that
>>>>>>>>>    it is allowed to keep failing forever.
>>>>>>>>>
>>>>>>>>>    Linux’s grant table driver actually had a bug in this area, which
>>>>>>>>>    led to deadlocks.  I fixed that a while back.
>>>>>>>>>
>>>>>>>>> KVM implements the second option: it maps pages into the stage-2
>>>>>>>>> page tables (or shadow page tables, if that is chosen) and unmaps
>>>>>>>>> them when the invalidate_range callback is called.
>>>>
>>>> I'm still lost as to what is where, who initiates what and what the end
>>>> goal is. Is this about using userspace memory in dom0, and THEN sharing
>>>> that with guests for as long as its live? And make enough magic so the
>>>> guests don't notice the transitionary period in which there may not be
>>>> any memory?
>>>>
>>>> Or is this about using domU memory for the driver living in dom0?
>>>>
>>>> Or is this about something else entirely?
>>>>
>>>> For my own education. Is the following sequence diagram remotely accurate?
>>>>
>>>> dom0                              domU
>>>>  |                                  |
>>>>  |---+                              |
>>>>  |   | use gfn3 in the driver       |
>>>>  |   | (mapped on user thread)      |
>>>>  |<--+                              |
>>>>  |                                  |
>>>>  |  map mfn(gfn3) in domU BAR       |
>>>>  |--------------------------------->|
>>>>  |                              +---|
>>>>  |              happily use BAR |   |
>>>>  |                              +-->|
>>>>  |---+                              |
>>>>  |   | mmu notifier for gfn3        |
>>>>  |   | (invalidate_range)           |
>>>>  |<--+                              |
>>>>  |                                  |
>>>>  |  unmap mfn(gfn3)                 |
>>>>  |--------------------------------->| <--- Plus some means to making guest 
>>>>  |---+                          +---|      vCPUs pause on access.
>>>>  |   | reclaim gfn3    block on |   |
>>>>  |<--+                 access   |   |
>>>>  |                              |   |
>>>>  |---+                          |   |
>>>>  |   | use gfn7 in the driver   |   |
>>>>  |   | (mapped on user thread)  |   |
>>>>  |<--+                          |   |
>>>>  |                              |   |
>>>>  |  map mfn(gfn7) in domU BAR   |   |
>>>>  |------------------------------+-->| <--- Unpause blocked domU vCPUs
>>>
>>> The guest vCPU will already pause on access if there's a p2m
>>> violation, until the ioreq has completed and the vCPU execution can
>>> resume.  That's in control of the ioreq server that handles the
>>> request.
>>>
>>> I don't know about the dom0 user-space part, but that's possibly of no
>>> concern for the implementation side in Xen?
>>
>> I believe so, yes.
>>
>>> My understanding of the actions needed from the Xen side is:
>>>
>>>  1. Map either RAM owned by the hardware domain or an MMIO page into
>>>     a domain p2m.
>>>  2. Remove entries from a domain p2m.
>>>  3. Handle p2m violations resulting from guest accesses, using 1. and
>>>     force a guest access retry (or emulate the access).
>>>
>>> 1. Can possibly be done with XEN_DOMCTL_memory_mapping and
>>> XENMEM_add_to_physmap_batch, but as I understood it it's not ideal.
>>> Demi would like a way to use the same hypercall to map either RAM or
>>> IOMEM into a domain p2m.
>>
>> Indeed so, and also the backend domain might be a driver domain instead
>> of the hardware domain.  It needs to have privilege over the frontend,
>> but it should not need privilege over the whole system.
> 
> This can all be arranged for, I wouldn't get bugged down on this
> details initially.

That's good to know.

>>> 2. What hypercall to use depends on how the memory is mapped.
>>>
>>> 3. ioreq servers will already get requests for accesses to unmapped
>>> regions they have registered for.  If the access is to be retried we
>>> need to expand ioreq interface a bit to handle this case.  Adding a
>>> new ioreq state like STATE_IORESP_RETRY might be enough?  Maybe I'm
>>> being naive though.
>>
>> This is where an implementation in a real userspace emulator would
>> be very useful, to ensure that the API being implemented is actually
>> usable in practice.
> 
> My suggestion for adding a "retry" response type to the ioreq
> interface was so that your ioreq model implementation would be
> simpler.  However if that's more hassle for you, I would initially map
> and emulate the access, as that would also be correct and shouldn't
> require any changes to the ioreq interface.  It can always be expanded
> later to support a map and retry model.
> 
> AFAICT, from the ongoing discussion above, the only uncertainty is
> which hypercall(s) to use to map either MMIO or RAM into a guest p2m.
> I wouldn't invest a huge amount of time into prototyping something
> very complex, and rather get a very simple hypercall implemented that
> fits your needs.  You could likely make a frankenhypercall based on the
> implementations of XEN_DOMCTL_memory_mapping and
> XENMEM_add_to_physmap, so that you can get a prototype working.
> 
> I think at this point it's important to get a functional prototype, so
> that we know exactly the requirements of the interfaces that you need.
> I wouldn't bother to design a detailed interface until we know exactly
> that such interface is suitable for your goals, and to that end we
> need a prototype with whatever you can glue together.

I believe AMD needs this for their automotive use-case.  What is the
status of this code?  I believe it would be best to send whatever
code AMD has available right now.  Just mark it as RFC if it is
incomplete.  That allows getting upstream feedback sooner rather
than later.

Xenia, Alejandro, Stefano, would this be feasible?
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Mapping memory into a domain
  2025-05-09  9:47                 ` Alejandro Vallejo
  2025-05-09 10:50                   ` Roger Pau Monné
@ 2025-05-09 18:14                   ` Demi Marie Obenour
  2025-05-09 18:30                   ` Demi Marie Obenour
  2 siblings, 0 replies; 20+ messages in thread
From: Demi Marie Obenour @ 2025-05-09 18:14 UTC (permalink / raw)
  To: Alejandro Vallejo, Roger Pau Monné, Xenia Ragiadakou,
	Stefano Stabellini
  Cc: Xen developer discussion, Andrew Cooper, Juergen Gross, Xen-devel


[-- Attachment #1.1.1: Type: text/plain, Size: 3921 bytes --]

On 5/9/25 5:47 AM, Alejandro Vallejo wrote:
>>>>>> 2. It can grab the *current* location of the pages and register an
>>>>>>    MMU notifier.  This works for GPU memory and file-backed memory.
>>>>>>    However, when the invalidate_range function of this callback, the
>>>>>>    driver *must* stop all further accesses to the pages.
>>>>>>
>>>>>>    The invalidate_range callback is not allowed to block for a long
>>>>>>    period of time.  My understanding is that things like dirty page
>>>>>>    writeback are blocked while the callback is in progress.  My
>>>>>>    understanding is also that the callback is not allowed to fail.
>>>>>>    I believe it can return a retryable error but I don’t think that
>>>>>>    it is allowed to keep failing forever.
>>>>>>
>>>>>>    Linux’s grant table driver actually had a bug in this area, which
>>>>>>    led to deadlocks.  I fixed that a while back.
>>>>>>
>>>>>> KVM implements the second option: it maps pages into the stage-2
>>>>>> page tables (or shadow page tables, if that is chosen) and unmaps
>>>>>> them when the invalidate_range callback is called.
> 
> I'm still lost as to what is where, who initiates what and what the end
> goal is. Is this about using userspace memory in dom0, and THEN sharing
> that with guests for as long as its live? And make enough magic so the
> guests don't notice the transitionary period in which there may not be
> any memory?
> 
> Or is this about using domU memory for the driver living in dom0?
> 
> Or is this about something else entirely?
> 
> For my own education. Is the following sequence diagram remotely accurate?
> 
> dom0                              domU
>  |                                  |
>  |---+                              |
>  |   | use gfn3 in the driver       |
>  |   | (mapped on user thread)      |
>  |<--+                              |
>  |                                  |
>  |  map mfn(gfn3) in domU BAR       |
>  |--------------------------------->|
>  |                              +---|
>  |              happily use BAR |   |
>  |                              +-->|
>  |---+                              |
>  |   | mmu notifier for gfn3        |
>  |   | (invalidate_range)           |
>  |<--+                              |
>  |                                  |
>  |  unmap mfn(gfn3)                 |
>  |--------------------------------->| <--- Plus some means to making guest 
>  |---+                          +---|      vCPUs pause on access.
>  |   | reclaim gfn3    block on |   |
>  |<--+                 access   |   |
>  |                              |   |
>  |---+                          |   |
>  |   | use gfn7 in the driver   |   |
>  |   | (mapped on user thread)  |   |
>  |<--+                          |   |
>  |                              |   |
>  |  map mfn(gfn7) in domU BAR   |   |
>  |------------------------------+-->| <--- Unpause blocked domU vCPUs
>  |                                  |

I believe this is accurate, yes.

>>>> - The switch from “emulated MMIO” to “MMIO or real RAM” needs to
>>>>   be atomic from the guest’s perspective.
>>>
>>> Updates of p2m PTEs are always atomic.
>> That’s good.
> 
> Updates to a single PTE are atomic, sure. But mapping/unmapping sizes
> not congruent with a whole superpage size (i.e: 256 KiB, more than a
> page, less than a superpage) wouldn't be, as far as the guest is
> concerned.
> 
> But if my understanding above is correct maybe it doesn't matter? It
> only needs to be atomic wrt the hypercall that requests it, so that the
> gfn is never reused while the guest p2m still holds that mfn.

I believe you are correct.  The only requirement is that the guest behaves
correctly if its page faults race against what is happening in the backend
domain.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Mapping memory into a domain
  2025-05-09  9:47                 ` Alejandro Vallejo
  2025-05-09 10:50                   ` Roger Pau Monné
  2025-05-09 18:14                   ` Demi Marie Obenour
@ 2025-05-09 18:30                   ` Demi Marie Obenour
  2 siblings, 0 replies; 20+ messages in thread
From: Demi Marie Obenour @ 2025-05-09 18:30 UTC (permalink / raw)
  To: Alejandro Vallejo, Roger Pau Monné, Xenia Ragiadakou,
	Stefano Stabellini
  Cc: Xen developer discussion, Andrew Cooper, Juergen Gross, Xen-devel


[-- Attachment #1.1.1: Type: text/plain, Size: 5779 bytes --]

On 5/9/25 5:47 AM, Alejandro Vallejo wrote:
>>>>>> A Linux driver that needs access to userspace memory
>>>>>> pages can get it in two different ways:
>>>>>>
>>>>>> 1. It can pin the pages using the pin_user_pages family of APIs.
>>>>>>    If these functions succeed, the driver is guaranteed to be able
>>>>>>    to access the pages until it unpins them.  However, this also
>>>>>>    means that the pages cannot be paged out or migrated.  Furthermore,
>>>>>>    file-backed pages cannot be safely pinned, and pinning GPU memory
>>>>>>    isn’t supported.  (At a minimum, it would prevent the pages from
>>>>>>    migrating from system RAM to VRAM, so all access by a dGPU would
>>>>>>    cross the PCIe bus, which would be very slow.)
>>>>>
>>>>> From a Xen p2m this is all fine - Xen will never remove pages from the
>>>>> p2m unless it's requested to.  So the pining, while needed on the Linux
>>>>> side, doesn't need to be propagated to Xen I would think.
> 
> It might still be helpful to have the concept of pinning to avoid them
> being evicted for other reasons (ballooning?). I don't think it'd be
> sane to allow returning to Xen a page that a domain ever shared with a
> device.

Memory mapped into a domain using this API must not be mapped into the
IOMMU contexts of any devices assigned to the frontend.  This ensures
that the backend unexpectedly reclaiming the memory doesn’t cause
problems, even if the frontend has an assigned PCI device.  If the
frontend tries to perform DMA from a different PCI device into memory
mapped into a domain using this API, it’s a frontend bug and the IOMMU
must block the access.

>>>> If pinning were enough things would be simple, but sadly it’s not.
>>>>
>>>>>> 2. It can grab the *current* location of the pages and register an
>>>>>>    MMU notifier.  This works for GPU memory and file-backed memory.
>>>>>>    However, when the invalidate_range function of this callback, the
>>>>>>    driver *must* stop all further accesses to the pages.
>>>>>>
>>>>>>    The invalidate_range callback is not allowed to block for a long
>>>>>>    period of time.  My understanding is that things like dirty page
>>>>>>    writeback are blocked while the callback is in progress.  My
>>>>>>    understanding is also that the callback is not allowed to fail.
>>>>>>    I believe it can return a retryable error but I don’t think that
>>>>>>    it is allowed to keep failing forever.
>>>>>>
>>>>>>    Linux’s grant table driver actually had a bug in this area, which
>>>>>>    led to deadlocks.  I fixed that a while back.
>>>>>>
>>>>>> KVM implements the second option: it maps pages into the stage-2
>>>>>> page tables (or shadow page tables, if that is chosen) and unmaps
>>>>>> them when the invalidate_range callback is called.
> 
> I'm still lost as to what is where, who initiates what and what the end
> goal is. Is this about using userspace memory in dom0, and THEN sharing
> that with guests for as long as its live? And make enough magic so the
> guests don't notice the transitionary period in which there may not be
> any memory?

It is exactly about this, except that the backend domain can be any domain
that is privileged over the frontend domain, rather than only dom0 or the
hardware domain.

> Or is this about using domU memory for the driver living in dom0?

Unfortunately no.  This would be a better fit to how Xen is designed,
but it isn’t compatible with Linux’s memory management requirements.

> Or is this about something else entirely?
> 
> For my own education. Is the following sequence diagram remotely accurate?
> 
> dom0                              domU
>  |                                  |
>  |---+                              |
>  |   | use gfn3 in the driver       |
>  |   | (mapped on user thread)      |
>  |<--+                              |
>  |                                  |
>  |  map mfn(gfn3) in domU BAR       |
>  |--------------------------------->|
>  |                              +---|
>  |              happily use BAR |   |
>  |                              +-->|
>  |---+                              |
>  |   | mmu notifier for gfn3        |
>  |   | (invalidate_range)           |
>  |<--+                              |
>  |                                  |
>  |  unmap mfn(gfn3)                 |
>  |--------------------------------->| <--- Plus some means to making guest 
>  |---+                          +---|      vCPUs pause on access.
>  |   | reclaim gfn3    block on |   |
>  |<--+                 access   |   |
>  |                              |   |
>  |---+                          |   |
>  |   | use gfn7 in the driver   |   |
>  |   | (mapped on user thread)  |   |
>  |<--+                          |   |
>  |                              |   |
>  |  map mfn(gfn7) in domU BAR   |   |
>  |------------------------------+-->| <--- Unpause blocked domU vCPUs
>  |                                  |

This diagram is accurate.

>>>> - The switch from “emulated MMIO” to “MMIO or real RAM” needs to
>>>>   be atomic from the guest’s perspective.
>>>
>>> Updates of p2m PTEs are always atomic.
>> That’s good.
> 
> Updates to a single PTE are atomic, sure. But mapping/unmapping sizes
> not congruent with a whole superpage size (i.e: 256 KiB, more than a
> page, less than a superpage) wouldn't be, as far as the guest is
> concerned.
> 
> But if my understanding above is correct maybe it doesn't matter? It
> only needs to be atomic wrt the hypercall that requests it, so that the
> gfn is never reused while the guest p2m still holds that mfn.
Correct.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2025-05-12 23:18 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-04 22:51 Mapping memory into a domain Demi Marie Obenour
2025-05-04 22:56 ` Andrew Cooper
2025-05-04 23:24   ` Demi Marie Obenour
2025-05-05 14:20     ` Roger Pau Monné
2025-05-05 11:32 ` Alejandro Vallejo
2025-05-06  1:02   ` Demi Marie Obenour
2025-05-06 13:06     ` Alejandro Vallejo
2025-05-06 20:56       ` Demi Marie Obenour
2025-05-07 17:39         ` Roger Pau Monné
2025-05-08  0:36           ` Demi Marie Obenour
2025-05-08  7:52             ` Roger Pau Monné
2025-05-09  4:52               ` Demi Marie Obenour
2025-05-09  8:53                 ` Roger Pau Monné
2025-05-09  9:47                 ` Alejandro Vallejo
2025-05-09 10:50                   ` Roger Pau Monné
2025-05-09 18:21                     ` Demi Marie Obenour
2025-05-12  8:08                       ` Roger Pau Monné
2025-05-12 23:18                         ` Demi Marie Obenour
2025-05-09 18:14                   ` Demi Marie Obenour
2025-05-09 18:30                   ` Demi Marie Obenour

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.