- * Re: [RFC] Xen PV IOMMU interface draft B
  2015-06-12 16:43 [RFC] Xen PV IOMMU interface draft B Malcolm Crossley
@ 2015-06-16 13:19 ` Jan Beulich
  2015-06-16 14:47   ` Malcolm Crossley
  2015-06-17 12:48 ` Yu, Zhang
  2015-06-26 10:23 ` Xen PV IOMMU interface draft C Malcolm Crossley
  2 siblings, 1 reply; 20+ messages in thread
From: Jan Beulich @ 2015-06-16 13:19 UTC (permalink / raw)
  To: Malcolm Crossley
  Cc: Kevin Tian, Yu C Zhang, AndrewCooper, Paul Durrant, David Vrabel,
	xen-devel, Zhiyuan Lv
>>> On 12.06.15 at 18:43, <malcolm.crossley@citrix.com> wrote:
> IOMMUOP_query_caps
> ------------------
> 
> This subop queries the runtime capabilities of the PV-IOMMU interface for 
> the
> specific called domain. This subop uses `struct pv_iommu_op` directly.
"calling domain" perhaps?
> ----------------------------------------------------------------------------
> --
> Field          Purpose
> -----          
> ---------------------------------------------------------------
> `flags`        [out] This field details the IOMMUOP capabilities.
> 
> `status`       [out] Status of this op, op specific values listed below
> ----------------------------------------------------------------------------
> --
> 
> Defined bits for flags field:
> 
> ----------------------------------------------------------------------------
> --
> Name                        Bit                Definition
> ----                       ------     ----------------------------------
> IOMMU_QUERY_map_cap          0        IOMMUOP_map_page or IOMMUOP_map_foreign
>                                       can be used for this domain
"this" (see also above) perhaps being the calling domain? In which
case I wonder how the "for" and IOMMUOP_map_foreign are
meant to fit together: I assume the flag to indicate that mapping into
the (calling) domain is possible. Which then makes me wonder - what
use if the new hypercall when this flag isn't set?
> IOMMU_QUERY_map_all_gfns     1        IOMMUOP_map_page subop can map any MFN
>                                       not used by Xen
"gfns" or "MFN"?
> Defined values for map_page subop status field:
> 
> Value   Reason
> ------  
> ----------------------------------------------------------------------
> 0       subop successfully returned
> -EIO    IOMMU unit returned error when attempting to map BFN to GFN.
> -EPERM  GFN could not be mapped because the GFN belongs to Xen.
> -EPERM  Domain is not a  domain and GFN does not belong to domain
"is not a hardware domain"? Also, I think we're pretty determined
for there to ever only be one, so perhaps it should be "the
hardware domain" here and elsewhere.
> IOMMUOP_unmap_page
> ------------------
> This subop uses `struct unmap_page` part of the `struct pv_iommu_op`.
> 
> The subop usage of the "struct pv_iommu_op" and ``struct unmap_page` fields
> are detailed below:
> 
> --------------------------------------------------------------------
> Field          Purpose
> -----          -----------------------------------------------------
> `bfn`          [in] Bus address frame number to be unmapped in DOMID_SELF
Has it been determined that unmapping based on GFN is never
going to be needed, and that unmapping by BFN is the more
practical solution? The map counterpart doesn't seem to exclude
establishing multiple mappings for the same BFN, and hence the
inverse here would become kind of fuzzy in that case.
> IOMMUOP_map_foreign_page
> ----------------
> This subop uses `struct map_foreign_page` part of the `struct pv_iommu_op`.
> 
> It is not valid to use domid representing the calling domain.
And what's the point of that? Was it considered to have only one
map/unmap pair, capable of mapping both local and foreign pages?
If so, what speaks against that?
> The M2B mechanism is a MFN to (BFN,domid,ioserver) tuple.
I didn't see anything explaining the significance of this (namely
the ioserver part, I think I can see the need for the domid), can
you explain the backgound here please?
> Every new M2B entry will take a reference to the MFN backing the GFN.
What happens when that GFN->MFN mapping changes?
> All the following conditions are required to be true for PV IOMMU 
> map_foreign
> subop to succeed:
> 
> 1. IOMMU detected and supported by Xen
> 2. The domain has IOMMU controlled hardware allocated to it
> 3. The domain is a hardware_domain and the following Xen IOMMU options are
>    NOT enabled: dom0-passthrough
Is there a way for the hardware domain to know it is running in
pass-through mode? Also, "the domain" is ambiguous here; I'm
sure you mean the invoking domain, not the one owning the page.
> This subop usage of the "struct pv_iommu_op" and ``struct map_foreign_page`
> fields are detailed below:
> 
> --------------------------------------------------------------------
> Field          Purpose
> -----          -----------------------------------------------------
> `domid`        [in] The domain ID for which the gfn field applies
> 
> `ioserver`     [in] IOREQ server id associated with mapping
> 
> `bfn`          [in] Bus address frame number for gfn address
In the description above you speak of returning data in this field. Is
[in] really correct?
> Defined bits for flags field:
> 
> Name                         Bit                Definition
> ----                        -----      ----------------------------------
> IOMMUOP_readable              0        BFN IOMMU mapping is readable
> IOMMUOP_writeable             1        BFN IOMMU mapping is writeable
> IOMMUOP_swap_mfn              2        BFN IOMMU mapping can be safely
>                                        swapped to scratch page
Scratch page? This needs some explanation.
> Reserved for future use      3-9       Reserved flag bits should be 0
> IOMMU_page_order            10-15      Returns maximum possible page order for
>                                        all other IOMMUOP subops
"Returns"? "other"?
> IOMMU_lookup_foreign_page
> ----------------
> This subop uses `struct lookup_foreign_page` part of the `struct 
> pv_iommu_op`.
> 
> If the BFN is specified as an input and parameter and there is no IOMMU 
> support
> for the calling domain then an error will be returned.
"input and parameter"? The field is marked [out] below, and I also
can't see a flag allowing this to be optional (the more that flags is
also [out]).
> It is the calling domain responsibility to ensure there are no conflicts
No races you mean?
> Each successful subop will add to the M2B if there was not an existing 
> identical
> M2B entry.
> 
> Every new M2B entry will take a reference to the MFN backing the GFN.
This is a lookup - why would it add something somewhere? Perhaps
this is just a copy-and-paste mistake? Or is the use of "lookup" here
misleading (as the last section of the document seems to suggest)?
> Defined bits for flags field:
> 
> Name                         Bit                Definition
> ----                        -----      ----------------------------------
> IOMMUOP_readable              0        Returned BFN IOMMU mapping is 
> readable
> IOMMUOP_writeable             1        Returned BFN IOMMU mapping is 
> writeable
> Reserved for future use      2-9       Reserved flag bits should be 0
Reserved flags will be returned as 0.
> Defined values for lookup_foreign_page subop status field:
> 
> Error code  Reason
> ----------  ------------------------------------------------------------
> 0            subop successfully returned
> -EPERM       Calling domain does not have sufficient privilege over domid
> -ENOENT      There is no available BFN for provided GFN + domid combination
Throughout here there seems to be a mixture of references to
(GFN, domid) pairs anmd (GFN,domid,ioserver) triplets. Along
with there being some explanation missing, these need to be
made consistent to avoid confusion.
> IOMMUOP_unmap_foreign_page
> ----------------
> This subop uses `struct unmap_foreign_page` part of the `struct 
> pv_iommu_op`.
> 
> If there is no IOMMU support then the MFN is returned in the BFN field (that 
> is
> the only valid bus address for the GFN + domid combination).
Copy-and-paste mistake again (doesn't seem to apply to unmap)?
> If there is IOMMU support then the specified BFN is returned for the GFN + 
> domid
> combination
> 
> Each successful subop will add to the M2B if there was not an existing 
> identical
> M2B entry. The
> 
> Every new M2B entry will take a reference to the MFN backing the GFN.
And again?
> This subop usage of the "struct pv_iommu_op" and ``struct 
> unmap_foreign_page` fields
> are detailed below:
> 
> -----------------------------------------------------------------------
> Field          Purpose
> -----          --------------------------------------------------------
> `ioserver`      [in] IOREQ server id associated with mapping
> 
> `bfn`          [in] Bus address frame number for gfn address
> 
> `flags`        [out] Flags for signalling page order of unmap operation
[out]?
> Error code  Reason
> ----------  ------------------------------------------------------------
> 0            subop successfully returned
> -ENOENT      There is no mapped BFN + ioserver id combination to unmap
Yet another pair, the unique mapping properties of which don't
follow from anything said earlier.
> PV IOMMU interactions with self ballooning
> ==========================================
> 
> The guest should clear any IOMMU mappings it has of it's own pages before
> releasing a page back to Xen. It will need to add IOMMU mappings after
> repopulating a page with the populate_physmap hypercall.
> 
> This requires that IOMMU mappings get a writeable page type reference count 
> and
> that guests clear any IOMMU mappings before pinning page table pages.
I suppose this is only for aware PV guests. If so, perhaps this should
be made explicit.
> PV IOMMU interactions with grant map/unmap operations
> =====================================================
> 
> Grant map operations return a Physical device accessible address (BFN) if 
> the
> GNTMAP_device_map flag is set.  This operation currently returns the MFN for 
> PV
> guests which may conflict with the BFN address space the guest uses if PV 
> IOMMU
> map support is available to the guest.
> 
> This design proposes to allow the calling domain to control the BFN address 
> that
> a grant map operation uses.
> 
> This can be achieved by specifying that the dev_bus_addr in the
> gnttab_map_grant_ref structure is used an input parameter instead of the
> output parameter it is currently.
> 
> Only PAGE_SIZE aligned addresses are allowed for dev_bus_addr input 
> parameter.
> 
> The revised structure is shown below for convenience.
> 
>     struct gnttab_map_grant_ref {
>         /* IN parameters. */
>         uint64_t host_addr;
>         uint32_t flags;               /* GNTMAP_* */
>         grant_ref_t ref;
>         domid_t  dom;
>         /* OUT parameters. */
>         int16_t  status;              /* => enum grant_status */
>         grant_handle_t handle;
>         /* IN/OUT parameters */
>         uint64_t dev_bus_addr;
>     };
> 
> 
> The grant map operation would then behave similarly to the IOMMUOP_map_page
> subop for the creation of the IOMMU mapping.
> 
> The grant unmap operation would then behave similarly to the 
> IOMMUOP_unmap_page
> subop for the removal of the IOMMU mapping.
We're talking about mappings of foreign pages here - aren't these the
wrong IOMMUOPs then? And if so, where would the ioserver id come
from?
Jan
^ permalink raw reply	[flat|nested] 20+ messages in thread
- * Re: [RFC] Xen PV IOMMU interface draft B
  2015-06-16 13:19 ` Jan Beulich
@ 2015-06-16 14:47   ` Malcolm Crossley
  2015-06-16 15:56     ` Jan Beulich
  0 siblings, 1 reply; 20+ messages in thread
From: Malcolm Crossley @ 2015-06-16 14:47 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Yu C Zhang, AndrewCooper, Paul Durrant, David Vrabel,
	xen-devel, Zhiyuan Lv
On 16/06/15 14:19, Jan Beulich wrote:
>>>> On 12.06.15 at 18:43, <malcolm.crossley@citrix.com> wrote:
>> IOMMUOP_query_caps
>> ------------------
>>
>> This subop queries the runtime capabilities of the PV-IOMMU interface for 
>> the
>> specific called domain. This subop uses `struct pv_iommu_op` directly.
> 
> "calling domain" perhaps?
> 
>> ----------------------------------------------------------------------------
>> --
>> Field          Purpose
>> -----          
>> ---------------------------------------------------------------
>> `flags`        [out] This field details the IOMMUOP capabilities.
>>
>> `status`       [out] Status of this op, op specific values listed below
>> ----------------------------------------------------------------------------
>> --
>>
>> Defined bits for flags field:
>>
>> ----------------------------------------------------------------------------
>> --
>> Name                        Bit                Definition
>> ----                       ------     ----------------------------------
>> IOMMU_QUERY_map_cap          0        IOMMUOP_map_page or IOMMUOP_map_foreign
>>                                       can be used for this domain
> 
> "this" (see also above) perhaps being the calling domain? In which
> case I wonder how the "for" and IOMMUOP_map_foreign are
> meant to fit together: I assume the flag to indicate that mapping into
> the (calling) domain is possible. Which then makes me wonder - what
> use if the new hypercall when this flag isn't set?
"This" is the calling domain. The IOMMU_lookup_foreign should continue to work if
this flag is not set.
> 
>> IOMMU_QUERY_map_all_gfns     1        IOMMUOP_map_page subop can map any MFN
>>                                       not used by Xen
> 
> "gfns" or "MFN"?
gfns . This is meant to apply to the hardware domain only, it's to allow the
same access control as dom0-relaxed mode allows currently.
> 
>> Defined values for map_page subop status field:
>>
>> Value   Reason
>> ------  
>> ----------------------------------------------------------------------
>> 0       subop successfully returned
>> -EIO    IOMMU unit returned error when attempting to map BFN to GFN.
>> -EPERM  GFN could not be mapped because the GFN belongs to Xen.
>> -EPERM  Domain is not a  domain and GFN does not belong to domain
> 
> "is not a hardware domain"? Also, I think we're pretty determined
> for there to ever only be one, so perhaps it should be "the
> hardware domain" here and elsewhere.
That is a typo. It should say "is not the hardware domain"
I will correct the other occurrences of "a hardware domain" to "the
hardware domain".
> 
>> IOMMUOP_unmap_page
>> ------------------
>> This subop uses `struct unmap_page` part of the `struct pv_iommu_op`.
>>
>> The subop usage of the "struct pv_iommu_op" and ``struct unmap_page` fields
>> are detailed below:
>>
>> --------------------------------------------------------------------
>> Field          Purpose
>> -----          -----------------------------------------------------
>> `bfn`          [in] Bus address frame number to be unmapped in DOMID_SELF
> 
> Has it been determined that unmapping based on GFN is never
> going to be needed, and that unmapping by BFN is the more
> practical solution? The map counterpart doesn't seem to exclude
> establishing multiple mappings for the same BFN, and hence the
> inverse here would become kind of fuzzy in that case.
There will be only one BFN to MFN mapping per domain, the map hypercall will
fail any attempt to have map a BFN to more than one GFN. This is why the unmap
is based on the BFN. It is allowed to have multiple BFN mappings of the same
GFN however.
> 
>> IOMMUOP_map_foreign_page
>> ----------------
>> This subop uses `struct map_foreign_page` part of the `struct pv_iommu_op`.
>>
>> It is not valid to use domid representing the calling domain.
> 
> And what's the point of that? Was it considered to have only one
> map/unmap pair, capable of mapping both local and foreign pages?
> If so, what speaks against that?
It was considered to have one map/unmap pair. The foreign map operation is the
more complex of the two types of mappings and so I thought it would make for a
cleaner API to have a separate subops for each type of mapping. The handling of
M2B in particular is what may make the internal implementation complex.
> 
>> The M2B mechanism is a MFN to (BFN,domid,ioserver) tuple.
> 
> I didn't see anything explaining the significance of this (namely
> the ioserver part, I think I can see the need for the domid), can
> you explain the backgound here please?
The ioserver part of the tuple is mainly for supporting a notification
mechanism when a guest balloons out a GFN. The
> 
>> Every new M2B entry will take a reference to the MFN backing the GFN.
> 
> What happens when that GFN->MFN mapping changes?
The IOREQ server will be notified so that is can ensure any mediated device
is not using the now invalid BFN mapping. Once all BFN mappings are removed
by affected IOREQ servers (decrementing reference count each time) then the
MFN will be released back to Xen.
> 
>> All the following conditions are required to be true for PV IOMMU 
>> map_foreign
>> subop to succeed:
>>
>> 1. IOMMU detected and supported by Xen
>> 2. The domain has IOMMU controlled hardware allocated to it
>> 3. The domain is a hardware_domain and the following Xen IOMMU options are
>>    NOT enabled: dom0-passthrough
> 
> Is there a way for the hardware domain to know it is running in
> pass-through mode? Also, "the domain" is ambiguous here; I'm
> sure you mean the invoking domain, not the one owning the page.
The IOMMU_QUERY_map_cap flag will not be set. I will change "the domain" to
"the calling domain".
> 
>> This subop usage of the "struct pv_iommu_op" and ``struct map_foreign_page`
>> fields are detailed below:
>>
>> --------------------------------------------------------------------
>> Field          Purpose
>> -----          -----------------------------------------------------
>> `domid`        [in] The domain ID for which the gfn field applies
>>
>> `ioserver`     [in] IOREQ server id associated with mapping
>>
>> `bfn`          [in] Bus address frame number for gfn address
> 
> In the description above you speak of returning data in this field. Is
> [in] really correct?
> 
The description is wrong, the map_foreign_page will allow the bfn to be
specified by the calling domain.
>> Defined bits for flags field:
>>
>> Name                         Bit                Definition
>> ----                        -----      ----------------------------------
>> IOMMUOP_readable              0        BFN IOMMU mapping is readable
>> IOMMUOP_writeable             1        BFN IOMMU mapping is writeable
>> IOMMUOP_swap_mfn              2        BFN IOMMU mapping can be safely
>>                                        swapped to scratch page
> 
> Scratch page? This needs some explanation.
Hopefully this is explained further down. It is expected that certain BFN mappings
are mapping data for the mediated device and not control information. By allowing
a scratch page to be used Xen can immediately swap the BFN mappings and remove the
references to the affected MFN and free the MFN back to Xen. This avoids the delay
for the ioreq servers to unmap the BFNs explicitly and hopefully speeds up ballooning.
> 
>> Reserved for future use      3-9       Reserved flag bits should be 0
>> IOMMU_page_order            10-15      Returns maximum possible page order for
>>                                        all other IOMMUOP subops
> 
> "Returns"? "other"?
Copy paste error, this should be the input page order not the returned page order.
> 
>> IOMMU_lookup_foreign_page
>> ----------------
>> This subop uses `struct lookup_foreign_page` part of the `struct 
>> pv_iommu_op`.
>>
>> If the BFN is specified as an input and parameter and there is no IOMMU 
>> support
>> for the calling domain then an error will be returned.
> 
> "input and parameter"? The field is marked [out] below, and I also
> can't see a flag allowing this to be optional (the more that flags is
> also [out]).
Another mistake by me. This is old text from a draft where map and lookup were
the same subop.
The BFN is always an output parameter for the lookup_foreign_page subop.
> 
>> It is the calling domain responsibility to ensure there are no conflicts
> 
> No races you mean?
Race's are also the calling domain responsibility but I'm referring to BFN
address space conflicts here.
> 
>> Each successful subop will add to the M2B if there was not an existing 
>> identical
>> M2B entry.
>>
>> Every new M2B entry will take a reference to the MFN backing the GFN.
> 
> This is a lookup - why would it add something somewhere? Perhaps
> this is just a copy-and-paste mistake? Or is the use of "lookup" here
> misleading (as the last section of the document seems to suggest)?
I can see how the term "lookup" is misleading. This subop really does a
lookup and take reference. Can you suggest an alternative name? I'm wary of
allowing
> 
>> Defined bits for flags field:
>>
>> Name                         Bit                Definition
>> ----                        -----      ----------------------------------
>> IOMMUOP_readable              0        Returned BFN IOMMU mapping is 
>> readable
>> IOMMUOP_writeable             1        Returned BFN IOMMU mapping is 
>> writeable
>> Reserved for future use      2-9       Reserved flag bits should be 0
> 
> Reserved flags will be returned as 0.
Agreed, I will make the change.
> 
>> Defined values for lookup_foreign_page subop status field:
>>
>> Error code  Reason
>> ----------  ------------------------------------------------------------
>> 0            subop successfully returned
>> -EPERM       Calling domain does not have sufficient privilege over domid
>> -ENOENT      There is no available BFN for provided GFN + domid combination
> 
> Throughout here there seems to be a mixture of references to
> (GFN, domid) pairs anmd (GFN,domid,ioserver) triplets. Along
> with there being some explanation missing, these need to be
> made consistent to avoid confusion.
I'll adjust them to all be (GFN,domid,ioserver) triples.
A particular BFN will map to a (GFN, domid) set so a (GFN,domid,ioserver) triple
can be determined via a (BFN, ioserver).
> 
>> IOMMUOP_unmap_foreign_page
>> ----------------
>> This subop uses `struct unmap_foreign_page` part of the `struct 
>> pv_iommu_op`.
>>
>> If there is no IOMMU support then the MFN is returned in the BFN field (that 
>> is
>> the only valid bus address for the GFN + domid combination).
> 
> Copy-and-paste mistake again (doesn't seem to apply to unmap)?
Yes, it's a copy paste mistake.
> 
>> If there is IOMMU support then the specified BFN is returned for the GFN + 
>> domid
>> combination
>>
>> Each successful subop will add to the M2B if there was not an existing 
>> identical
>> M2B entry. The
>>
>> Every new M2B entry will take a reference to the MFN backing the GFN.
> 
> And again?
Yes, it's a copy paste mistake.
> 
>> This subop usage of the "struct pv_iommu_op" and ``struct 
>> unmap_foreign_page` fields
>> are detailed below:
>>
>> -----------------------------------------------------------------------
>> Field          Purpose
>> -----          --------------------------------------------------------
>> `ioserver`      [in] IOREQ server id associated with mapping
>>
>> `bfn`          [in] Bus address frame number for gfn address
>>
>> `flags`        [out] Flags for signalling page order of unmap operation
> 
> [out]?
Yes, it's a copy paste mistake. It should be [in]
> 
>> Error code  Reason
>> ----------  ------------------------------------------------------------
>> 0            subop successfully returned
>> -ENOENT      There is no mapped BFN + ioserver id combination to unmap
> 
> Yet another pair, the unique mapping properties of which don't
> follow from anything said earlier.
A particular BFN will map to a (GFN, domid) set so a (GFN,domid,ioserver) triple
will be determined via a (BFN, ioserver) pair.
The ioserver may have lost track of the (GFN, domid) pair used for a particular
BFN mapping but may have just been notified of it's BFN mapping becoming invalid.
So it's more efficient to unmap based on BFN and ioserver ID.
> 
>> PV IOMMU interactions with self ballooning
>> ==========================================
>>
>> The guest should clear any IOMMU mappings it has of it's own pages before
>> releasing a page back to Xen. It will need to add IOMMU mappings after
>> repopulating a page with the populate_physmap hypercall.
>>
>> This requires that IOMMU mappings get a writeable page type reference count 
>> and
>> that guests clear any IOMMU mappings before pinning page table pages.
> 
> I suppose this is only for aware PV guests. If so, perhaps this should
> be made explicit.
This is only for PV guests. I will make the correction.
> 
>> PV IOMMU interactions with grant map/unmap operations
>> =====================================================
>>
>> Grant map operations return a Physical device accessible address (BFN) if 
>> the
>> GNTMAP_device_map flag is set.  This operation currently returns the MFN for 
>> PV
>> guests which may conflict with the BFN address space the guest uses if PV 
>> IOMMU
>> map support is available to the guest.
>>
>> This design proposes to allow the calling domain to control the BFN address 
>> that
>> a grant map operation uses.
>>
>> This can be achieved by specifying that the dev_bus_addr in the
>> gnttab_map_grant_ref structure is used an input parameter instead of the
>> output parameter it is currently.
>>
>> Only PAGE_SIZE aligned addresses are allowed for dev_bus_addr input 
>> parameter.
>>
>> The revised structure is shown below for convenience.
>>
>>     struct gnttab_map_grant_ref {
>>         /* IN parameters. */
>>         uint64_t host_addr;
>>         uint32_t flags;               /* GNTMAP_* */
>>         grant_ref_t ref;
>>         domid_t  dom;
>>         /* OUT parameters. */
>>         int16_t  status;              /* => enum grant_status */
>>         grant_handle_t handle;
>>         /* IN/OUT parameters */
>>         uint64_t dev_bus_addr;
>>     };
>>
>>
>> The grant map operation would then behave similarly to the IOMMUOP_map_page
>> subop for the creation of the IOMMU mapping.
>>
>> The grant unmap operation would then behave similarly to the 
>> IOMMUOP_unmap_page
>> subop for the removal of the IOMMU mapping.
> 
> We're talking about mappings of foreign pages here - aren't these the
> wrong IOMMUOPs then? And if so, where would the ioserver id come
> from?
> 
I don't expected grant mapped pages to be ballooned out or to be directly used by
ioservers so I believe the grant mapped pages match more closely to the standard
map_page than the foreign_map_page.
Generally I think I need to rework the document to introduce the some concepts
before the actual interface itself.
Thank you for the comments and sorry for the document having quite a few typos and
copy paste errors. It's been through quite a few internal updates.
I think I should add a section to explain that there are two type's of BFN address
space users.
The first type (map_page) is more simple and just allows the calling domain to
control the BFN mappings for GFNs it owns and for any grant mapped pages of other
domains. There is no need to deal with other domains ballooning their pages will
this type of mapping.
The second type (map_foreign_page) is more complex and allows the calling domain to
control the BFN mappings for GFNs of other domains that the  calling domain has privilege
over. This type of mapping is designed for mediated passthrough ioservers. It has to
deal with other domains ballooning their pages.
The interface is designed so that you cannot "remap" an existing BFN mapping. This
allows the different types of BFN mappings to be tracked separately by Xen. There
is no need for the M2B for the first type of mapping.
Thanks for the comments. I can see how busy it is on the list and I wasn't expecting a
response this quickly :)
Malcolm
> Jan
> 
^ permalink raw reply	[flat|nested] 20+ messages in thread
- * Re: [RFC] Xen PV IOMMU interface draft B
  2015-06-16 14:47   ` Malcolm Crossley
@ 2015-06-16 15:56     ` Jan Beulich
  0 siblings, 0 replies; 20+ messages in thread
From: Jan Beulich @ 2015-06-16 15:56 UTC (permalink / raw)
  To: Malcolm Crossley
  Cc: Kevin Tian, Yu C Zhang, AndrewCooper, Paul Durrant, David Vrabel,
	xen-devel, Zhiyuan Lv
>>> On 16.06.15 at 16:47, <malcolm.crossley@citrix.com> wrote:
> On 16/06/15 14:19, Jan Beulich wrote:
>>>>> On 12.06.15 at 18:43, <malcolm.crossley@citrix.com> wrote:
>>> IOMMU_QUERY_map_all_gfns     1        IOMMUOP_map_page subop can map any MFN
>>>                                       not used by Xen
>> 
>> "gfns" or "MFN"?
> 
> gfns . This is meant to apply to the hardware domain only, it's to allow the
> same access control as dom0-relaxed mode allows currently.
But why "gfns" in the name and "any MFN" in the description?
>>> IOMMUOP_unmap_page
>>> ------------------
>>> This subop uses `struct unmap_page` part of the `struct pv_iommu_op`.
>>>
>>> The subop usage of the "struct pv_iommu_op" and ``struct unmap_page` fields
>>> are detailed below:
>>>
>>> --------------------------------------------------------------------
>>> Field          Purpose
>>> -----          -----------------------------------------------------
>>> `bfn`          [in] Bus address frame number to be unmapped in DOMID_SELF
>> 
>> Has it been determined that unmapping based on GFN is never
>> going to be needed, and that unmapping by BFN is the more
>> practical solution? The map counterpart doesn't seem to exclude
>> establishing multiple mappings for the same BFN, and hence the
>> inverse here would become kind of fuzzy in that case.
> 
> There will be only one BFN to MFN mapping per domain, the map hypercall will
> fail any attempt to have map a BFN to more than one GFN. This is why the 
> unmap
> is based on the BFN. It is allowed to have multiple BFN mappings of the same
> GFN however.
Okay, I got confused again by the term BFN - I keep mixing up the
parts of the bus between device and IOMMU vs between IOMMU
and RAM. Alternatives I could think of (DFN for Device Frame Number)
wouldn't be any better, so I guess we need to live with the ambiguity.
>>> Each successful subop will add to the M2B if there was not an existing 
>>> identical
>>> M2B entry.
>>>
>>> Every new M2B entry will take a reference to the MFN backing the GFN.
>> 
>> This is a lookup - why would it add something somewhere? Perhaps
>> this is just a copy-and-paste mistake? Or is the use of "lookup" here
>> misleading (as the last section of the document seems to suggest)?
> 
> I can see how the term "lookup" is misleading. This subop really does a
> lookup and take reference. Can you suggest an alternative name?
get_foreign_page_map? In any event, in particular with possibly
ambiguous name the description should be particularly clear and
obvious to help the reader (which also applies to the code to be
written).
> I'm wary of allowing
???
>>> PV IOMMU interactions with self ballooning
>>> ==========================================
>>>
>>> The guest should clear any IOMMU mappings it has of it's own pages before
>>> releasing a page back to Xen. It will need to add IOMMU mappings after
>>> repopulating a page with the populate_physmap hypercall.
>>>
>>> This requires that IOMMU mappings get a writeable page type reference count 
>>> and
>>> that guests clear any IOMMU mappings before pinning page table pages.
>> 
>> I suppose this is only for aware PV guests. If so, perhaps this should
>> be made explicit.
> 
> This is only for PV guests. I will make the correction.
The emphasis was on "aware", not on "PV" (which is already stated).
>>> The grant map operation would then behave similarly to the IOMMUOP_map_page
>>> subop for the creation of the IOMMU mapping.
>>>
>>> The grant unmap operation would then behave similarly to the 
>>> IOMMUOP_unmap_page
>>> subop for the removal of the IOMMU mapping.
>> 
>> We're talking about mappings of foreign pages here - aren't these the
>> wrong IOMMUOPs then? And if so, where would the ioserver id come
>> from?
>> 
> 
> I don't expected grant mapped pages to be ballooned out or to be directly 
> used by
> ioservers so I believe the grant mapped pages match more closely to the 
> standard
> map_page than the foreign_map_page.
Right, that became clear with you saying that the ioserver id is
meant to be used for balloon out notifications only. But that
should be made explicit.
> Generally I think I need to rework the document to introduce the some 
> concepts
> before the actual interface itself.
Yes, that would be very helpful. The interface spec should probably
the (almost) last thing.
Jan
^ permalink raw reply	[flat|nested] 20+ messages in thread 
 
 
- * Re: [RFC] Xen PV IOMMU interface draft B
  2015-06-12 16:43 [RFC] Xen PV IOMMU interface draft B Malcolm Crossley
  2015-06-16 13:19 ` Jan Beulich
@ 2015-06-17 12:48 ` Yu, Zhang
  2015-06-17 13:34   ` Jan Beulich
  2015-06-17 13:44   ` Malcolm Crossley
  2015-06-26 10:23 ` Xen PV IOMMU interface draft C Malcolm Crossley
  2 siblings, 2 replies; 20+ messages in thread
From: Yu, Zhang @ 2015-06-17 12:48 UTC (permalink / raw)
  To: Malcolm Crossley, xen-devel, Jan Beulich, Konrad Rzeszutek Wilk,
	Andrew Cooper, Paul Durrant, Kevin Tian, Lv, Zhiyuan,
	David Vrabel
Hi Malcolm,
   Thank you very much for accommodate our XenGT requirement in your
design. Following are some XenGT related questions. :)
On 6/13/2015 12:43 AM, Malcolm Crossley wrote:
> Hi All,
>
> Here is a design for allowing guests to control the IOMMU. This
> allows for the guest GFN mapping to be programmed into the IOMMU and
> avoid using the SWIOTLB bounce buffer technique in the Linux kernel
> (except for legacy 32 bit DMA IO devices).
>
> Draft B has been expanded to include Bus Address mapping/lookup for Mediated
> pass-through emulators.
>
> The pandoc markdown format of the document is provided below to allow
> for easier inline comments:
>
> % Xen PV IOMMU interface
> % Malcolm Crossley <<malcolm.crossley@citrix.com>>
>    Paul Durrant <<paul.durrant@citrix.com>>
> % Draft B
>
> Introduction
> ============
>
> Revision History
> ----------------
>
> --------------------------------------------------------------------
> Version  Date         Changes
> -------  -----------  ----------------------------------------------
> Draft A  10 Apr 2014  Initial draft.
>
> Draft B  12 Jun 2015  Second draft.
> --------------------------------------------------------------------
>
> Background
> ==========
>
> Linux kernel SWIOTLB
> --------------------
>
> Xen PV guests use a Pseudophysical Frame Number(PFN) address space which is
> decoupled from the host Machine Frame Number(MFN) address space.
>
> PV guest hardware drivers are only aware of the PFN address space only and
> assume that if PFN addresses are contiguous then the hardware addresses would
> be contiguous as well. The decoupling between PFN and MFN address spaces means
> PFN and MFN addresses may not be contiguous across page boundaries and thus a
> buffer allocated in GFN address space which spans a page boundary may not be
> contiguous in MFN address space.
>
> PV hardware drivers cannot tolerate this behaviour and so a special
> "bounce buffer" region is used to hide this issue from the drivers.
>
> A bounce buffer region is a special part of the PFN address space which has
> been made to be contiguous in both PFN and MFN address spaces. When a driver
> requests a buffer which spans a page boundary be made available for hardware
> to read the core operating system code copies the buffer into a temporarily
> reserved part of the bounce buffer region and then returns the MFN address of
> the reserved part of the bounce buffer region back to the driver itself. The
> driver then instructs the hardware to read the copy of the buffer in the
> bounce buffer. Similarly if the driver requests a buffer is made available
> for hardware to write to the first a region of the bounce buffer is reserved
> and then after the hardware completes writing then the reserved region of
> bounce buffer is copied to the originally allocated buffer.
>
> The overheard of memory copies to/from the bounce buffer region is high
> and damages performance. Furthermore, there is a risk the fixed size
> bounce buffer region will become exhausted and it will not be possible to
> return an hardware address back to the driver. The Linux kernel drivers do not
> tolerate this failure and so the kernel is forced to crash, as an
> uncorrectable error has occurred.
>
> Input/Output Memory Management Units (IOMMU) allow for an inbound address
> mapping to be created from the I/O Bus address space (typically PCI) to
> the machine frame number address space. IOMMU's typically use a page table
> mechanism to manage the mappings and therefore can create mappings of page size
> granularity or larger.
>
> The I/O Bus address space will be referred to as the Bus Frame Number (BFN)
> address space for the rest of this document.
>
>
> Mediated Pass-through Emulators
> -------------------------------
>
> Mediated Pass-through emulators allow guest domains to interact with
> hardware devices via emulator mediation. The emulator runs in a domain separate
> to the guest domain and it is used to enforce security of guest access to the
> hardware devices and isolation of different guests accessing the same hardware
> device.
>
> The emulator requires a mechanism to map guest address's to a bus address that
> the hardware devices can access.
>
>
> Clarification of GFN and BFN fields for different guest types
> -------------------------------------------------------------
> Guest Frame Numbers (GFN) definition varies depending on the guest type.
>
> Diagram below details the memory accesses originating from CPU, per guest type:
>
>        HVM guest                              PV guest
>
>           (VA)                                   (VA)
>            |                                      |
>           MMU                                    MMU
>            |                                      |
>           (GFN)                                   |
>            |                                      | (GFN)
>       HAP a.k.a EPT/NPT                           |
>            |                                      |
>           (MFN)                                  (MFN)
>            |                                      |
>           RAM                                    RAM
>
> For PV guests GFN is equal to MFN for a single page but not for a contiguous
> range of pages.
>
> Bus Frame Numbers (BFN) refer to the address presented on the physical bus
> before being translated by the IOMMU.
>
> Diagram below details memory accesses originating from physical device.
>
>      Physical Device
>            |
>          (BFN)
>            |
> 	   IOMMU-PT
>            |
>          (MFN)
>            |
>           RAM
>
>
>
> Purpose
> =======
>
> 1. Allow Xen guests to create/modify/destroy IOMMU mappings for
> hardware devices that the PV guests has access to. This enables the PV guest to
> program a bus address space mapping which matches it's GFN mapping. Once a 1-1
> mapping of PFN to bus address space is created then a bounce buffer
> region is not required for the IO devices connected to the IOMMU.
>
> 2. Allow for Xen guests to lookup/create/modify/destroy IOMMU mappings for
> guest memory of domains the calling Xen guest has sufficient privilege over.
> This enables domains to provide mediated hardware acceleration to other
> guest domains.
>
>
> Xen Architecture
> ================
>
> The Xen architecture consists of a new hypercall interface and changes to the
> grant map interface.
>
> The existing IOMMU mappings setup at domain creation time will be preserved so
> that PV domains unaware of this feature will continue to function with no
> changes required.
>
> Memory ballooning will be supported by taking an additional reference on the
> MFN backing the GFN for each successful IOMMU mapping created.
>
> An M2B tracking structure will be used to ensure all reference's to a MFN can
> be located easily.
>
> Xen PV IOMMU hypercall interface
> --------------------------------
> A two argument hypercall interface (do_iommu_op).
>
> ret_t do_iommu_op(XEN_GUEST_HANDLE_PARAM(void) arg, unsigned int count)
>
> First argument, guest handle pointer to array of `struct pv_iommu_op`
> Second argument, unsigned integer count of `struct pv_iommu_op` elements in array.
>
> Definition of struct pv_iommu_op:
>
>      struct pv_iommu_op {
>
>          uint16_t subop_id;
>          uint16_t flags;
>          int32_t status;
>
>          union {
>              struct {
>                  uint64_t bfn;
>                  uint64_t gfn;
>              } map_page;
>
>              struct {
>                  uint64_t bfn;
>              } unmap_page;
>
>              struct {
>                  uint64_t bfn;
>                  uint64_t gfn;
>                  uint16_t domid;
>                  ioservid_t ioserver;
>              } map_foreign_page;
>
>              struct {
>                  uint64_t bfn;
>                  uint64_t gfn;
>                  uint16_t domid;
>                  ioservid_t ioserver;
>              } lookup_foreign_page;
>
>              struct {
>                  uint64_t bfn;
>                  ioservid_t ioserver;
>              } unmap_foreign_page;
>          } u;
>      };
>
> Definition of PV IOMMU subops:
>
>      #define IOMMUOP_query_caps            1
>      #define IOMMUOP_map_page              2
>      #define IOMMUOP_unmap_page            3
>      #define IOMMUOP_map_foreign_page      4
>      #define IOMMUOP_lookup_foreign_page   5
>      #define IOMMUOP_unmap_foreign_page    6
>
>
> Design considerations for hypercall op
> -------------------------------------------
> IOMMU map/unmap operations can be slow and can involve flushing the IOMMU TLB
> to ensure the IO device uses the updated mappings.
>
> The op has been designed to take an array of operations and a count as
> parameters. This allows for easily implemented hypercall continuations to be
> used and allows for batches of IOMMU operations to be submitted before flushing
> the IOMMU TLB.
>
> The subop_id to be used for a particular element is encoded into the element
> itself. This allows for map and unmap operations to be performed in one hypercall
> and for the IOMMU TLB flushing optimisations to be still applied.
>
> The hypercall will ensure that the required IOMMU TLB flushes are applied before
> returning to guest via either hypercall completion or a hypercall continuation.
>
> IOMMUOP_query_caps
> ------------------
>
> This subop queries the runtime capabilities of the PV-IOMMU interface for the
> specific called domain. This subop uses `struct pv_iommu_op` directly.
>
> ------------------------------------------------------------------------------
> Field          Purpose
> -----          ---------------------------------------------------------------
> `flags`        [out] This field details the IOMMUOP capabilities.
>
> `status`       [out] Status of this op, op specific values listed below
> ------------------------------------------------------------------------------
>
> Defined bits for flags field:
>
> ------------------------------------------------------------------------------
> Name                        Bit                Definition
> ----                       ------     ----------------------------------
> IOMMU_QUERY_map_cap          0        IOMMUOP_map_page or IOMMUOP_map_foreign
>                                        can be used for this domain
>
> IOMMU_QUERY_map_all_gfns     1        IOMMUOP_map_page subop can map any MFN
>                                        not used by Xen
>
> Reserved for future use     2-9                   n/a
>
> IOMMU_page_order           10-15      Returns maximum possible page order for
>                                        all other IOMMUOP subops
> ------------------------------------------------------------------------------
>
> Defined values for query_caps subop status field:
>
> Value   Reason
> ------  ----------------------------------------------------------
> 0       subop successfully returned
>
> IOMMUOP_map_page
> ----------------------
> This subop uses `struct map_page` part of the `struct pv_iommu_op`.
>
> If IOMMU dom0-strict mode is NOT enabled then the hardware domain will be
> allowed to map all GFN's except for Xen owned MFN's else the hardware
> domain will only be allowed to map GFN's which it owns.
>
> If IOMMU dom0-strict mode is NOT enabled then the hardware domain will be
> allowed to map all GFN's without taking a reference to the MFN backing the GFN
> by setting the IOMMU_MAP_OP_no_ref_cnt flag.
>
> Every successful pv_iommu_op will result in an additional page reference being
> taken on the MFN backing the GFN except for the condition detailed above.
>
> If the map_op flags indicate a writeable mapping is required then a writeable
> page type reference will be taken otherwise a standard page reference will be
> taken.
>
> All the following conditions are required to be true for PV IOMMU map
> subop to succeed:
>
> 1. IOMMU detected and supported by Xen
> 2. The domain has IOMMU controlled hardware allocated to it
> 3. If hardware_domain and the following Xen IOMMU options are
>     NOT enabled: dom0-passthrough
>
> This subop usage of the "struct pv_iommu_op" and ``struct map_page` fields
> are detailed below:
>
> ------------------------------------------------------------------------------
> Field          Purpose
> -----          ---------------------------------------------------------------
> `bfn`          [in]  Bus address frame number(BFN) to be mapped to specified gfn
>                       below
>
> `gfn`          [in]  Guest address frame number for DOMID_SELF
>
> `flags`        [in]  Flags for signalling type of IOMMU mapping to be created,
>                       Flags can be combined.
>
> `status`       [out] Mapping status of this op, op specific values listed below
> ------------------------------------------------------------------------------
>
> Defined bits for flags field:
>
> Name                        Bit                Definition
> ----                       -----      ----------------------------------
> IOMMU_OP_readable            0        Create readable IOMMU mapping
> IOMMU_OP_writeable           1        Create writeable IOMMU mapping
> IOMMU_MAP_OP_no_ref_cnt      2        IOMMU mapping does not take a reference to
>                                        MFN backing BFN mapping
> Reserved for future use     3-9                   n/a
> IOMMU_page_order            10-15     Page order to be used for both gfn and bfn
>
> Defined values for map_page subop status field:
>
> Value   Reason
> ------  ----------------------------------------------------------------------
> 0       subop successfully returned
> -EIO    IOMMU unit returned error when attempting to map BFN to GFN.
> -EPERM  GFN could not be mapped because the GFN belongs to Xen.
> -EPERM  Domain is not a  domain and GFN does not belong to domain
> -EPERM  Domain is a hardware domain, IOMMU dom-strict mode is enabled and
>          GFN does not belong to domain
> -EACCES BFN address conflicts with RMRR regions for device's attached to
>          DOMID_SELF
> -ENOSPC Page order is too large for either BFN, GFN or IOMMU unit
>
> IOMMUOP_unmap_page
> ------------------
> This subop uses `struct unmap_page` part of the `struct pv_iommu_op`.
>
> The subop usage of the "struct pv_iommu_op" and ``struct unmap_page` fields
> are detailed below:
>
> --------------------------------------------------------------------
> Field          Purpose
> -----          -----------------------------------------------------
> `bfn`          [in] Bus address frame number to be unmapped in DOMID_SELF
>
> `flags`        [in] Flags for signalling page order of unmap operation
>
> `status`       [out] Mapping status of this unmap operation, 0 indicates success
> --------------------------------------------------------------------
>
> Defined bits for flags field:
>
> Name                        Bit                Definition
> ----                       -----      ----------------------------------
> Reserved for future use     0-9                   n/a
> IOMMU_page_order            10-15     Page order to be used for bfn
>
>
> Defined values for unmap_page subop status field:
>
> Error code  Reason
> ----------  ------------------------------------------------------------
> 0            subop successfully returned
> -EIO         IOMMU unit returned error when attempting to unmap BFN.
> -ENOSPC      Page order is too large for either BFN address or IOMMU unit
> ------------------------------------------------------------------------
>
>
> IOMMUOP_map_foreign_page
> ----------------
> This subop uses `struct map_foreign_page` part of the `struct pv_iommu_op`.
>
> It is not valid to use domid representing the calling domain.
>
> The hypercall will only succeed if calling domain has sufficient privilege over
> the specified domid
>
> If there is no IOMMU support then the MFN is returned in the BFN field (that is
> the only valid bus address for the GFN + domid combination).
>
> If there IOMMU support then the specified BFN is returned for the GFN + domid
> combination
>
> The M2B mechanism is a MFN to (BFN,domid,ioserver) tuple.
>
> Each successful subop will add to the M2B if there was not an existing identical
> M2B entry.
>
> Every new M2B entry will take a reference to the MFN backing the GFN.
>
> All the following conditions are required to be true for PV IOMMU map_foreign
> subop to succeed:
>
> 1. IOMMU detected and supported by Xen
> 2. The domain has IOMMU controlled hardware allocated to it
> 3. The domain is a hardware_domain and the following Xen IOMMU options are
>     NOT enabled: dom0-passthrough
What if the IOMMU is enabled, and runs in the default mode, which 1:1 
maps all memories except owned by Xen?
>
>
> This subop usage of the "struct pv_iommu_op" and ``struct map_foreign_page`
> fields are detailed below:
>
> --------------------------------------------------------------------
> Field          Purpose
> -----          -----------------------------------------------------
> `domid`        [in] The domain ID for which the gfn field applies
>
> `ioserver`     [in] IOREQ server id associated with mapping
>
> `bfn`          [in] Bus address frame number for gfn address
>
> `gfn`          [in] Guest address frame number
>
> `flags`        [in] Details the status of the BFN mapping
>
> `status`       [out] status of this subop, 0 indicates success
> --------------------------------------------------------------------
>
> Defined bits for flags field:
>
> Name                         Bit                Definition
> ----                        -----      ----------------------------------
> IOMMUOP_readable              0        BFN IOMMU mapping is readable
> IOMMUOP_writeable             1        BFN IOMMU mapping is writeable
> IOMMUOP_swap_mfn              2        BFN IOMMU mapping can be safely
>                                         swapped to scratch page
> Reserved for future use      3-9       Reserved flag bits should be 0
> IOMMU_page_order            10-15      Returns maximum possible page order for
>                                         all other IOMMUOP subops
>
> Defined values for map_foreign_page subop status field:
>
> Error code  Reason
> ----------  ------------------------------------------------------------
> 0            subop successfully returned
> -EIO         IOMMU unit returned error when attempting to map BFN to GFN.
> -EPERM       Calling domain does not have sufficient privilege over domid
> -EPERM       GFN could not be mapped because the GFN belongs to Xen.
> -EPERM       domid maps to DOMID_SELF
> -EACCES      BFN address conflicts with RMRR regions for device's attached to
>               DOMID_SELF
> -ENODEV      Provided ioserver id is not valid
> -ENXIO       Provided domid id is not valid
> -ENXIO       Provided GFN address is not valid
> -ENOSPC      Page order is too large for either BFN, GFN or IOMMU unit
>
> IOMMU_lookup_foreign_page
> ----------------
> This subop uses `struct lookup_foreign_page` part of the `struct pv_iommu_op`.
>
> If the BFN is specified as an input and parameter and there is no IOMMU support
> for the calling domain then an error will be returned.
>
> It is the calling domain responsibility to ensure there are no conflicts
>
> The hypercall will only succeed if calling domain has sufficient privilege over
> the specified domid
>
> If there is no IOMMU support then the MFN is returned in the BFN field (that is
> the only valid bus address for the GFN + domid combination).
Similarly, what if the IOMMU is enabled, and runs in the default mode,
which 1:1 maps all memories except owned by Xen? Will a MFN be returned?
Or should we take the query/map ops instead of the lookup op for this
situation?
>
> Each successful subop will add to the M2B if there was not an existing identical
> M2B entry.
>
> Every new M2B entry will take a reference to the MFN backing the GFN.
>
> This subop usage of the "struct pv_iommu_op" and ``struct lookup_foreign_page`
> fields are detailed below:
>
> --------------------------------------------------------------------
> Field          Purpose
> -----          -----------------------------------------------------
> `domid`        [in] The domain ID for which the gfn field applies
>
> `ioserver`     [in] IOREQ server id associated with mapping
>
> `bfn`          [out] Bus address frame number for gfn address
>
> `gfn`          [in] Guest address frame number
>
> `flags`        [out] Details the status of the BFN mapping
>
> `status`       [out] status of this subop, 0 indicates success
> --------------------------------------------------------------------
>
> Defined bits for flags field:
>
> Name                         Bit                Definition
> ----                        -----      ----------------------------------
> IOMMUOP_readable              0        Returned BFN IOMMU mapping is readable
> IOMMUOP_writeable             1        Returned BFN IOMMU mapping is writeable
> Reserved for future use      2-9       Reserved flag bits should be 0
> IOMMU_page_order            10-15      Returns maximum possible page order for
>                                         all other IOMMUOP subops
>
> Defined values for lookup_foreign_page subop status field:
>
> Error code  Reason
> ----------  ------------------------------------------------------------
> 0            subop successfully returned
> -EPERM       Calling domain does not have sufficient privilege over domid
> -ENOENT      There is no available BFN for provided GFN + domid combination
> -ENODEV      Provided ioserver id is not valid
> -ENXIO       Provided domid id is not valid
> -ENXIO       Provided GFN address is not valid
>
>
> IOMMUOP_unmap_foreign_page
> ----------------
> This subop uses `struct unmap_foreign_page` part of the `struct pv_iommu_op`.
>
> If there is no IOMMU support then the MFN is returned in the BFN field (that is
> the only valid bus address for the GFN + domid combination).
>
> If there is IOMMU support then the specified BFN is returned for the GFN + domid
> combination
>
> Each successful subop will add to the M2B if there was not an existing identical
> M2B entry. The
>
> Every new M2B entry will take a reference to the MFN backing the GFN.
>
> This subop usage of the "struct pv_iommu_op" and ``struct unmap_foreign_page` fields
> are detailed below:
>
> -----------------------------------------------------------------------
> Field          Purpose
> -----          --------------------------------------------------------
> `ioserver`      [in] IOREQ server id associated with mapping
>
> `bfn`          [in] Bus address frame number for gfn address
>
> `flags`        [out] Flags for signalling page order of unmap operation
>
> `status`       [out] status of this subop, 0 indicates success
> -----------------------------------------------------------------------
>
> Defined bits for flags field:
>
> Name                        Bit                Definition
> ----                        -----      ----------------------------------
> Reserved for future use     0-9                   n/a
> IOMMU_page_order            10-15     Page order to be used for bfn
>
> Defined values for unmap_foreign_page subop status field:
>
> Error code  Reason
> ----------  ------------------------------------------------------------
> 0            subop successfully returned
> -ENOENT      There is no mapped BFN + ioserver id combination to unmap
>
>
> IOMMUOP_*_foreign_page interactions with guest domain ballooning
> ================================================================
>
> Guest domains can balloon out a set of GFN mappings at any time and render the
> BFN to GFN mapping invalid.
>
> When a BFN to GFN mapping becomes invalid, Xen will issue a buffered IO request
> of type IOREQ_TYPE_INVALIDATE to the affected IOREQ servers with the now invalid
> BFN address in the data field. If the buffered IO request ring is full then a
> standard (synchronous) IO request of type IOREQ_TYPE_INVALIDATE will be issued
> to the affected IOREQ server the with just invalidated BFN address in the data
> field.
>
> The BFN mappings cannot be simply unmapped at the point of the balloon hypercall
> otherwise a malicious guest could specifically balloon out an in use GFN address
> in use by an emulator and trigger IOMMU faults for the domains with BFN
> mappings.
>
> For hosts with no IOMMU support: The affected emulator(s) must specifically
> issue a IOMMUOP_unmap_foreign_page subop for the now invalid BFN address so that
> the references to the underlying MFN are removed and the MFN can be freed back
> to the Xen memory allocator.
I do not quite understand this. With no IOMMU support, these BFNs are
supplied by hypervisor. So why not let hypervisor do this unmap and
notify the calling domain?
>
> For hosts with IOMMU support:
> If the BFN was mapped without the IOMMUOP_swap_mfn flag set in the
> IOMMUOP_map_foreign_page then the affected affected emulator(s) must
> specifically issue a IOMMUOP_unmap_foreign_page subop for the now invalid BFN
> address so that the references to the underlying MFN are removed.
>
> If the BFN was mapped with the IOMMUOP_swap_mfn flag set in the
> IOMMUOP_map_foreign_page subop for all emulators with mappings of that GFN then
> the BFN mapping will be swapped to point at a scratch MFN page and all BFN
> references to the invalid MFN will be removed by Xen after the BFN mapping has
> been updated to point at the scratch MFN page.
>
> The rationale for swapping the BFN mapping to point at scratch pages is to
> enable guest domains to balloon quickly without requiring hypercall(s) from
> emulators.
>
> Not all BFN mappings can be swapped without potentially causing problems for the
> hardware itself (command rings etc.) so the IOMMUOP_swap_mfn flag is used to
> allow per BFN control of Xen ballooning behaviour.
>
>
> PV IOMMU interactions with self ballooning
> ==========================================
>
> The guest should clear any IOMMU mappings it has of it's own pages before
> releasing a page back to Xen. It will need to add IOMMU mappings after
> repopulating a page with the populate_physmap hypercall.
>
> This requires that IOMMU mappings get a writeable page type reference count and
> that guests clear any IOMMU mappings before pinning page table pages.
>
>
> Security Implications of allowing domain IOMMU control
> ===============================================================
>
> Xen currently allows IO devices attached to hardware domain to have direct
> access to the all of the MFN address space (except Xen hypervisor memory regions),
> provided the Xen IOMMU option dom0-strict is not enabled.
>
> The PV IOMMU feature provides the same level of access to MFN address space
> and the feature is not enabled when the Xen IOMMU option dom0-strict is
> enabled. Therefore security is not degraded by the PV IOMMU feature.
>
> Domains with physical device(s) assigned which are not hardware domains are only
> allowed to map their own GFNs or GFNs for domain(s) they have privilege over.
>
>
> PV IOMMU interactions with grant map/unmap operations
> =====================================================
>
> Grant map operations return a Physical device accessible address (BFN) if the
> GNTMAP_device_map flag is set.  This operation currently returns the MFN for PV
> guests which may conflict with the BFN address space the guest uses if PV IOMMU
> map support is available to the guest.
>
> This design proposes to allow the calling domain to control the BFN address that
> a grant map operation uses.
>
> This can be achieved by specifying that the dev_bus_addr in the
> gnttab_map_grant_ref structure is used an input parameter instead of the
> output parameter it is currently.
>
> Only PAGE_SIZE aligned addresses are allowed for dev_bus_addr input parameter.
>
> The revised structure is shown below for convenience.
>
>      struct gnttab_map_grant_ref {
>          /* IN parameters. */
>          uint64_t host_addr;
>          uint32_t flags;               /* GNTMAP_* */
>          grant_ref_t ref;
>          domid_t  dom;
>          /* OUT parameters. */
>          int16_t  status;              /* => enum grant_status */
>          grant_handle_t handle;
>          /* IN/OUT parameters */
>          uint64_t dev_bus_addr;
>      };
>
>
> The grant map operation would then behave similarly to the IOMMUOP_map_page
> subop for the creation of the IOMMU mapping.
>
> The grant unmap operation would then behave similarly to the IOMMUOP_unmap_page
> subop for the removal of the IOMMU mapping.
>
> A new grantmap flag would be used to indicate the domain is requesting the
> dev_bus_addr field is used an input parameter.
>
>
>      #define _GNTMAP_request_bfn_map      (6)
>      #define GNTMAP_request_bfn_map   (1<<_GNTMAP_request_bfn_map)
>
>
>
> Linux kernel architecture
> =========================
>
> The Linux kernel will use the PV-IOMMU hypercalls to map it's PFN address
> space into the IOMMU. It will map the PFN's to the IOMMU address space using
> a 1:1 mapping, it does this by programming a BFN to GFN mapping which matches
> the PFN to GFN mapping.
>
> The native SWIOTLB will be used to handle device's which cannot DMA to all of
> the kernel's PFN address space.
>
> An interface shall be provided for emulator usage of IOMMUOP_*_foreign_page
> subops which will allow the Linux kernel to centrally manage that domains BFN
> resource and ensure there are no unexpected conflicts.
>
>
> Emulator usage of PV IOMMU interface
> ====================================
>
> Emulators which require bus address mapping of guest RAM must first determine if
> it's possible for the domain to control the bus addresses themselves.
>
> A IOMMUOP_query_caps subop will return the IOMMU_QUERY_map_cap flag. If this
> flag is set then the emulator may specify the BFN address it wishes guest RAM to
> be mapped to via the IOMMUOP_map_foreign_page subop.  If the flag is not set
> then the emulator must use BFN addresses supplied by the Xen via the
> IOMMUOP_lookup_foreign_page.
>
> Operating systems which use the IOMMUOP_map_page subop are expected to provide a
> common interface for emulators
According to our previous internal discussions, my understanding about
the usage is this:
1> PV IOMMU has an interface in dom0's kernel to do the query/map/lookup
all at once, which also includes the BFN allocation algorithm.
2> When XenGT emulator tries to construct a shadow PTE, we can just call
your interface, which returns a BFN whatever.
However, the above description seems the XenGT device model need to do
the query/lookup/map by itself?
Besides, could you please give a more detailed information about this
'common interface'? :)
Thanks
Yu
>
> Emulators should unmap unused GFN mappings as often as possible using
> IOMMUOP_unmap_foreign_page subops so that guest domains can balloon pages
> quickly and efficiently.
>
> Emulators should conform to the ballooning behaviour described section
> "IOMMUOP_*_foreign_page interactions with guest domain ballooning" so that guest
> domains are able to effectively balloon out and in memory.
>
> Emulators must unmap any active BFN mappings when they shutdown.
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>
^ permalink raw reply	[flat|nested] 20+ messages in thread
- * Re: [RFC] Xen PV IOMMU interface draft B
  2015-06-17 12:48 ` Yu, Zhang
@ 2015-06-17 13:34   ` Jan Beulich
  2015-06-17 13:44   ` Malcolm Crossley
  1 sibling, 0 replies; 20+ messages in thread
From: Jan Beulich @ 2015-06-17 13:34 UTC (permalink / raw)
  To: Zhang Yu
  Cc: Kevin Tian, Andrew Cooper, Paul Durrant, David Vrabel, xen-devel,
	Malcolm Crossley, Zhiyuan Lv
>>> On 17.06.15 at 14:48, <yu.c.zhang@linux.intel.com> wrote:
>    Thank you very much for accommodate our XenGT requirement in your
> design. Following are some XenGT related questions. :)
Please trim your replies.
Jan
^ permalink raw reply	[flat|nested] 20+ messages in thread 
- * Re: [RFC] Xen PV IOMMU interface draft B
  2015-06-17 12:48 ` Yu, Zhang
  2015-06-17 13:34   ` Jan Beulich
@ 2015-06-17 13:44   ` Malcolm Crossley
  1 sibling, 0 replies; 20+ messages in thread
From: Malcolm Crossley @ 2015-06-17 13:44 UTC (permalink / raw)
  To: Yu, Zhang, xen-devel, Jan Beulich, Konrad Rzeszutek Wilk,
	Andrew Cooper, Paul Durrant, Kevin Tian, Lv, Zhiyuan,
	David Vrabel
On 17/06/15 13:48, Yu, Zhang wrote:
> Hi Malcolm,
> 
>   Thank you very much for accommodate our XenGT requirement in your
> design. Following are some XenGT related questions. :)
> 
> On 6/13/2015 12:43 AM, Malcolm Crossley wrote:
>> Hi All,
<snip>
>>
>> IOMMUOP_map_foreign_page
>> ----------------
>> This subop uses `struct map_foreign_page` part of the `struct pv_iommu_op`.
>>
>> It is not valid to use domid representing the calling domain.
>>
>> The hypercall will only succeed if calling domain has sufficient privilege over
>> the specified domid
>>
>> If there is no IOMMU support then the MFN is returned in the BFN field (that is
>> the only valid bus address for the GFN + domid combination).
>>
>> If there IOMMU support then the specified BFN is returned for the GFN + domid
>> combination
>>
>> The M2B mechanism is a MFN to (BFN,domid,ioserver) tuple.
>>
>> Each successful subop will add to the M2B if there was not an existing identical
>> M2B entry.
>>
>> Every new M2B entry will take a reference to the MFN backing the GFN.
>>
>> All the following conditions are required to be true for PV IOMMU map_foreign
>> subop to succeed:
>>
>> 1. IOMMU detected and supported by Xen
>> 2. The domain has IOMMU controlled hardware allocated to it
>> 3. The domain is a hardware_domain and the following Xen IOMMU options are
>>     NOT enabled: dom0-passthrough
> What if the IOMMU is enabled, and runs in the default mode, which 1:1 maps all memories except owned
> by Xen?
Good question. A PV IOMMU aware guest will know the 1:1 map exists and can use the
IOMMUOP_unmap_page to remove any mappings which will conflict with it's planned BFN mappings.
For a PV IOMMU unaware guest I think the IOMMUOP_lookup_foreign_page should be used instead. This
will allow the IOSERVER to register interest in the Domid + GFN it's using and allow ballooning to
be used.
FYI, The 1:1 map on PV guests will be setup without taking a reference to the MFN otherwise unaware
PV guests will be unable to create page tables.
>>
>>
>> This subop usage of the "struct pv_iommu_op" and ``struct map_foreign_page`
>> fields are detailed below:
>>
>> --------------------------------------------------------------------
>> Field          Purpose
>> -----          -----------------------------------------------------
>> `domid`        [in] The domain ID for which the gfn field applies
>>
>> `ioserver`     [in] IOREQ server id associated with mapping
>>
>> `bfn`          [in] Bus address frame number for gfn address
>>
>> `gfn`          [in] Guest address frame number
>>
>> `flags`        [in] Details the status of the BFN mapping
>>
>> `status`       [out] status of this subop, 0 indicates success
>> --------------------------------------------------------------------
>>
>> Defined bits for flags field:
>>
>> Name                         Bit                Definition
>> ----                        -----      ----------------------------------
>> IOMMUOP_readable              0        BFN IOMMU mapping is readable
>> IOMMUOP_writeable             1        BFN IOMMU mapping is writeable
>> IOMMUOP_swap_mfn              2        BFN IOMMU mapping can be safely
>>                                         swapped to scratch page
>> Reserved for future use      3-9       Reserved flag bits should be 0
>> IOMMU_page_order            10-15      Returns maximum possible page order for
>>                                         all other IOMMUOP subops
>>
>> Defined values for map_foreign_page subop status field:
>>
>> Error code  Reason
>> ----------  ------------------------------------------------------------
>> 0            subop successfully returned
>> -EIO         IOMMU unit returned error when attempting to map BFN to GFN.
>> -EPERM       Calling domain does not have sufficient privilege over domid
>> -EPERM       GFN could not be mapped because the GFN belongs to Xen.
>> -EPERM       domid maps to DOMID_SELF
>> -EACCES      BFN address conflicts with RMRR regions for device's attached to
>>               DOMID_SELF
>> -ENODEV      Provided ioserver id is not valid
>> -ENXIO       Provided domid id is not valid
>> -ENXIO       Provided GFN address is not valid
>> -ENOSPC      Page order is too large for either BFN, GFN or IOMMU unit
>>
>> IOMMU_lookup_foreign_page
>> ----------------
>> This subop uses `struct lookup_foreign_page` part of the `struct pv_iommu_op`.
>>
>> If the BFN is specified as an input and parameter and there is no IOMMU support
>> for the calling domain then an error will be returned.
>>
>> It is the calling domain responsibility to ensure there are no conflicts
>>
>> The hypercall will only succeed if calling domain has sufficient privilege over
>> the specified domid
>>
>> If there is no IOMMU support then the MFN is returned in the BFN field (that is
>> the only valid bus address for the GFN + domid combination).
> Similarly, what if the IOMMU is enabled, and runs in the default mode,
> which 1:1 maps all memories except owned by Xen? Will a MFN be returned?
> Or should we take the query/map ops instead of the lookup op for this
> situation?
The lookup will return the BFN which is 1:1 mapped to the MFN.
Only the hardware domain will have precreated BFN mappings of other domains memory.
So the logic could look like this:
If dom0 then lookup use P2M to get MFN then use M2B to lookup BFN if this fails
then check if BFN is mapped to MFN 1:1, if so
return BFN else return -ENOENT.
>>
>> Each successful subop will add to the M2B if there was not an existing identical
>> M2B entry.
>>
>> Every new M2B entry will take a reference to the MFN backing the GFN.
>>
<snip>
>>
>> IOMMUOP_*_foreign_page interactions with guest domain ballooning
>> ================================================================
>>
>> Guest domains can balloon out a set of GFN mappings at any time and render the
>> BFN to GFN mapping invalid.
>>
>> When a BFN to GFN mapping becomes invalid, Xen will issue a buffered IO request
>> of type IOREQ_TYPE_INVALIDATE to the affected IOREQ servers with the now invalid
>> BFN address in the data field. If the buffered IO request ring is full then a
>> standard (synchronous) IO request of type IOREQ_TYPE_INVALIDATE will be issued
>> to the affected IOREQ server the with just invalidated BFN address in the data
>> field.
>>
>> The BFN mappings cannot be simply unmapped at the point of the balloon hypercall
>> otherwise a malicious guest could specifically balloon out an in use GFN address
>> in use by an emulator and trigger IOMMU faults for the domains with BFN
>> mappings.
>>
>> For hosts with no IOMMU support: The affected emulator(s) must specifically
>> issue a IOMMUOP_unmap_foreign_page subop for the now invalid BFN address so that
>> the references to the underlying MFN are removed and the MFN can be freed back
>> to the Xen memory allocator.
> I do not quite understand this. With no IOMMU support, these BFNs are
> supplied by hypervisor. So why not let hypervisor do this unmap and
> notify the calling domain?
We need the emulators to do the unmap so that they can ensure that hardware is not actively using
the BFN (same as MFN in this case) otherwise Xen may allocate that MFN to another guest and that
guest will have it's memory corrupted.
Another way to think about it is that a malicious guest could set up a long running DMA to it's RAM
and then deliberately balloons out that RAM whilst the DMA is running. The only way to secure that
scenario is not let the balloon out RAM to be used until the emulator confirms it's safe to do so.
The IOMMUOP_swap_mfn optimisation has been added to allow Xen to drop reference's safely.
Unfortunately it requires the IOMMU to be enabled.
>>
<snip>
>> Emulator usage of PV IOMMU interface
>> ====================================
>>
>> Emulators which require bus address mapping of guest RAM must first determine if
>> it's possible for the domain to control the bus addresses themselves.
>>
>> A IOMMUOP_query_caps subop will return the IOMMU_QUERY_map_cap flag. If this
>> flag is set then the emulator may specify the BFN address it wishes guest RAM to
>> be mapped to via the IOMMUOP_map_foreign_page subop.  If the flag is not set
>> then the emulator must use BFN addresses supplied by the Xen via the
>> IOMMUOP_lookup_foreign_page.
>>
>> Operating systems which use the IOMMUOP_map_page subop are expected to provide a
>> common interface for emulators
> 
> According to our previous internal discussions, my understanding about
> the usage is this:
> 1> PV IOMMU has an interface in dom0's kernel to do the query/map/lookup
> all at once, which also includes the BFN allocation algorithm.
> 2> When XenGT emulator tries to construct a shadow PTE, we can just call
> your interface, which returns a BFN whatever.
> 
> However, the above description seems the XenGT device model need to do
> the query/lookup/map by itself?
The above description is to cover emulator which may run in their own domain (stub domain).
> Besides, could you please give a more detailed information about this
> 'common interface'? :)
I will try to include more details in the next draft.
My current thinking is to reuse the "struct pv_iommu_op" array of ops and just implement a common
function for requesting a BFN mapping. The common function will fill in the subOp_field for the caller.
Thanks for your feedback and please trim your replies as Jan suggested. It makes it much easier to
find and reply to your inline comments.
> 
> Thanks
> Yu
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel
>>
^ permalink raw reply	[flat|nested] 20+ messages in thread 
 
- * Xen PV IOMMU interface draft C
  2015-06-12 16:43 [RFC] Xen PV IOMMU interface draft B Malcolm Crossley
  2015-06-16 13:19 ` Jan Beulich
  2015-06-17 12:48 ` Yu, Zhang
@ 2015-06-26 10:23 ` Malcolm Crossley
  2015-06-26 11:03   ` Ian Campbell
                     ` (2 more replies)
  2 siblings, 3 replies; 20+ messages in thread
From: Malcolm Crossley @ 2015-06-26 10:23 UTC (permalink / raw)
  To: xen-devel, Jan Beulich, Konrad Rzeszutek Wilk, Andrew Cooper,
	Paul Durrant, Kevin Tian, Lv, Zhiyuan, Zhang, Yu C, David Vrabel,
	Ian Campbell
Hi All,
Here is a design for allowing guests to control the IOMMU. This
allows for the guest GFN mapping to be programmed into the IOMMU and
avoid using the SWIOTLB bounce buffer technique in the Linux kernel
(except for legacy 32 bit DMA IO devices).
Draft C has been reordered to explain expected behaviours before the APIs
themselves. There's an additional section to explain the rationale for
separate subops from local GFN mappings and foreign GFN mappings.
There's also further detail on the Linux API for foreign BFN mappings.
The plan is to start writing code against this version so please provide
feedback on any major design problems/concerns.
The pandoc markdown format of the document is provided below to allow
for easier inline comments:
% Xen PV IOMMU interface
% Malcolm Crossley <<malcolm.crossley@citrix.com>>
  Paul Durrant <<paul.durrant@citrix.com>>
% Draft C
Introduction
============
Revision History
----------------
--------------------------------------------------------------------
Version  Date         Changes
-------  -----------  ----------------------------------------------
Draft A  10 Apr 2014  Initial draft.
Draft B  12 Jun 2015  Second draft.
Draft C  26 Jun 2015  Third draft.
--------------------------------------------------------------------
Background
==========
Linux kernel SWIOTLB
--------------------
Xen PV guests use a Pseudophysical Frame Number(PFN) address space which is
decoupled from the host Machine Frame Number(MFN) address space.
PV guest hardware drivers are aware of the PFN address space only and
assume that if PFN addresses are contiguous then the hardware addresses would
be contiguous as well. The decoupling between PFN and MFN address spaces means
PFN and MFN addresses may not be contiguous across page boundaries and thus a
buffer allocated in GFN address space which spans a page boundary may not be
contiguous in MFN address space.
PV hardware drivers cannot tolerate this behaviour and so a special
"bounce buffer" region is used to hide this issue from the drivers.
A bounce buffer region is a special part of the PFN address space which has
been made to be contiguous in both PFN and MFN address spaces. When a driver
requests a buffer which spans a page boundary be made available for hardware
to read the core operating system code copies the buffer into a temporarily
reserved part of the bounce buffer region and then returns the MFN address of
the reserved part of the bounce buffer region back to the driver itself. The
driver then instructs the hardware to read the copy of the buffer in the
bounce buffer. Similarly if the driver requests a buffer is made available
for hardware to write to the first a region of the bounce buffer is reserved
and then after the hardware completes writing then the reserved region of
bounce buffer is copied to the originally allocated buffer.
The overheard of memory copies to/from the bounce buffer region is high
and damages performance. Furthermore, there is a risk the fixed size
bounce buffer region will become exhausted and it will not be possible to
return an hardware address back to the driver. The Linux kernel drivers do not
tolerate this failure and so the kernel is forced to crash, as an
unrecoverable error has occurred.
Input/Output Memory Management Units (IOMMU) allow for an inbound address
mapping to be created from the I/O Bus address space (typically PCI) to
the machine frame number address space. IOMMUs typically use a page table
mechanism to manage the mappings and therefore can create mappings of page size
granularity or larger.
The I/O Bus address space will be referred to as the Bus Frame Number (BFN)
address space for the rest of this document.
Mediated Pass-through Emulators
-------------------------------
Mediated Pass-through emulators allow guest domains to interact with
hardware devices via emulator mediation. The emulator runs in a domain separate
to the guest domain and it is used to enforce security of guest access to the
hardware devices and isolation of different guests accessing the same hardware
device.
The emulator requires a mechanism to map guest addresses to a bus address that
the hardware devices can access.
Clarification of GFN and BFN fields for different guest types
-------------------------------------------------------------
Guest Frame Numbers (GFN) definition varies depending on the guest type.
Diagram below details the memory accesses originating from CPU, per guest type:
      HVM guest                              PV guest
         (VA)                                   (VA)
          |                                      |
         MMU                                    MMU
          |                                      |
         (GFN)                                   |
          |                                      | (GFN)
     HAP a.k.a EPT/NPT                           |
          |                                      |
         (MFN)                                  (MFN)
          |                                      |
         RAM                                    RAM
For PV guests GFN is equal to MFN for a single page but not for a contiguous
range of pages.
Bus Frame Numbers (BFN) refer to the address presented on the physical bus
before being translated by the IOMMU.
Diagram below details memory accesses originating from physical device.
    Physical Device
          |
        (BFN)
          |
	   IOMMU-PT
          |
        (MFN)
          |
         RAM
Purpose
=======
1. Allow Xen guests to create/modify/destroy IOMMU mappings for
hardware devices that the PV guests has access to. This enables the PV guest to
program a bus address space mapping which matches its GFN mapping. Once a 1-1
mapping of PFN to bus address space is created then a bounce buffer
region is not required for the I/O devices connected to the IOMMU.
2. Allow for Xen guests to lookup/create/modify/destroy IOMMU mappings for
guest memory of domains the calling Xen guest has sufficient privilege over.
This enables domains to provide mediated hardware acceleration to other
guest domains.
General principles for PV IOMMU interface
=========================================
There are two different usage models for the BFN address space of a calling
guest based upon the two purposes specified in the section above.
A calling guest may use their BFN address space for only one of the purposes
detailed above and so the PV IOMMU interface has a subop per usage model.
Furthermore, the IOMMU mapping of foreign domains memory is more complex than
IOMMU mapping local domain memory and seperating the subops allows for the
complexity to be split in the implementation.
The PV IOMMU design allows the calling domain to control it's BFN memory map.
Thus the design also assigns the responsiblity of ensuring a BFN address
mapped for local domain memory mappings are not reused for foreign domain
memory mappings without an explict unmap of BFN address first. This simplifies
the usage of the API and the extra overhead for the calling domains should be
minimal as they should be already tracking the BFN address space usage already.
Emulator usage of PV IOMMU interface
====================================
Emulators which require bus address mapping of guest RAM must first determine if
it's possible for the domain to control the bus addresses themselves.
A IOMMUOP_query_caps subop will return the IOMMU_QUERY_map_cap flag. If this
flag is set then the emulator may specify the BFN address it wishes guest RAM to
be mapped to via the IOMMUOP_map_foreign_page subop.  If the flag is not set
then the emulator must use BFN addresses supplied by the Xen via the
IOMMUOP_lookup_foreign_page.
Operating systems which use the IOMMUOP_map_page subop are expected to provide a
common interface for emulators to use. Otherwise emulators will not be aware
of existing BFN mappings created by operating system and will get failed
subops due to conflicts in the BFN address space for the domain.
Emulators should unmap unused GFN mappings as often as possible using
IOMMUOP_unmap_foreign_page subops so that guest domains can balloon pages
quickly and efficiently.
Emulators should conform to the ballooning behaviour described section
"IOMMUOP_*_foreign_page interactions with guest domain ballooning" so that guest
domains are able to effectively balloon out and in memory.
Emulators must unmap any active BFN mappings when they shutdown.
IOMMUOP_*_foreign_page interactions with guest domain ballooning
================================================================
Guest domains can balloon out a set of GFN mappings at any time and render the
BFN to GFN mapping invalid.
When a BFN to GFN mapping becomes invalid, Xen will issue a buffered I/O request
of type IOREQ_TYPE_INVALIDATE to the affected IOREQ servers with the now invalid
BFN address in the data field. If the buffered I/O request ring is full then a
standard (synchronous) I/O request of type IOREQ_TYPE_INVALIDATE will be issued
to the affected IOREQ server the with just invalidated BFN address in the data
field.
The BFN mappings cannot be simply unmapped at the point of the balloon hypercall
otherwise a malicious guest could specifically balloon out an in use GFN address
in use by an emulator and trigger IOMMU faults for the domains with BFN
mappings.
For hosts with no IOMMU support: The affected emulator(s) must specifically
issue a IOMMUOP_unmap_foreign_page subop for the now invalid BFN address so that
the references to the underlying MFN are removed and the MFN can be freed back
to the Xen memory allocator.
For hosts with IOMMU support:
If the BFN was mapped without the IOMMUOP_swap_mfn flag set in the
IOMMUOP_map_foreign_page then the affected affected emulator(s) must
specifically issue a IOMMUOP_unmap_foreign_page subop for the now invalid BFN
address so that the references to the underlying MFN are removed.
If the BFN was mapped with the IOMMUOP_swap_mfn flag set in the
IOMMUOP_map_foreign_page subop for all emulators with mappings of that GFN then
the BFN mapping will be swapped to point at a scratch MFN page and all BFN
references to the invalid MFN will be removed by Xen after the BFN mapping has
been updated to point at the scratch MFN page.
The rationale for swapping the BFN mapping to point at scratch pages is to
enable guest domains to balloon quickly without requiring hypercall(s) from
emulators.
Not all BFN mappings can be swapped without potentially causing problems for the
hardware itself (command rings etc.) so the IOMMUOP_swap_mfn flag is used to
allow per BFN control of Xen ballooning behaviour.
PV IOMMU interactions with self ballooning
==========================================
A guest should clear any IOMMU mappings it has of its own pages before
releasing a page back to Xen. The guest also will need to add IOMMU mappings
after repopulating a page with the populate_physmap hypercall.
PV guests must clear any IOMMU mappings before pinning page table pages
because the IOMMU mappings will take a writable reference count and this will
prevent page table pinning.
Security Implications of allowing domain IOMMU control
======================================================
Xen currently allows I/O devices attached to hardware domain to have direct
access to the all of the MFN address space (except Xen hypervisor memory regions),
provided the Xen IOMMU option dom0-strict is not enabled.
The PV IOMMU feature provides the same level of access to MFN address space
and the feature is not enabled when the Xen IOMMU option dom0-strict is
enabled. Therefore security is not degraded by the PV IOMMU feature.
Domains with physical device(s) assigned which are not hardware domains are only
allowed to map their own GFNs or GFNs for domain(s) they have privilege over.
PV IOMMU interactions with grant map/unmap operations
=====================================================
Grant map operations return a Physical device accessible address (BFN) if the
GNTMAP_device_map flag is set.  This operation currently returns the MFN for PV
guests which may conflict with the BFN address space the guest uses if PV IOMMU
map support is available to the guest.
This design proposes to allow the calling domain to control the BFN address that
a grant map operation uses.
This can be achieved by specifying that the dev_bus_addr in the
gnttab_map_grant_ref structure is used an input parameter instead of the
output parameter it is currently.
Only PAGE_SIZE aligned addresses are allowed for dev_bus_addr input parameter.
The revised structure is shown below for convenience.
    struct gnttab_map_grant_ref {
        /* IN parameters. */
        uint64_t host_addr;
        uint32_t flags;               /* GNTMAP_* */
        grant_ref_t ref;
        domid_t  dom;
        /* OUT parameters. */
        int16_t  status;              /* => enum grant_status */
        grant_handle_t handle;
        /* IN/OUT parameters */
        uint64_t dev_bus_addr;
    };
The grant map operation would then behave similarly to the IOMMUOP_map_page
subop for the creation of the IOMMU mapping.
The grant unmap operation would then behave similarly to the IOMMUOP_unmap_page
subop for the removal of the IOMMU mapping.
A new grantmap flag would be used to indicate the domain is requesting the
dev_bus_addr field is used an input parameter.
    #define _GNTMAP_request_bfn_map      (6)
    #define GNTMAP_request_bfn_map   (1<<_GNTMAP_request_bfn_map)
Xen PV-IOMMU Architecture
=========================
The Xen architecture consists of a new hypercall interface and changes to the
grant map interface.
The existing IOMMU mappings setup at domain creation time will be preserved so
that PV domains unaware of this feature will continue to function with no
changes required.
Memory ballooning will be supported by taking an additional reference on the
MFN backing the GFN for each successful IOMMU mapping created.
An M2B tracking structure will be used to ensure all references to an MFN can
be located efficiently.
Xen PV IOMMU hypercall interface
--------------------------------
A two argument hypercall interface (do_iommu_op).
    ret_t do_iommu_op(XEN_GUEST_HANDLE_PARAM(void) arg, unsigned int count)
First argument, guest handle pointer to array of `struct pv_iommu_op`
Second argument, unsigned integer count of `struct pv_iommu_op` elements in array.
Definition of `struct pv_iommu_op`:
    struct pv_iommu_op {
        uint16_t subop_id;
        uint16_t flags;
        int32_t status;
        union {
            struct {
                uint64_t bfn;
                uint64_t gfn;
            } map_page;
            struct {
                uint64_t bfn;
            } unmap_page;
            struct {
                uint64_t bfn;
                uint64_t gfn;
                uint16_t domid;
                ioservid_t ioserver;
            } map_foreign_page;
            struct {
                uint64_t bfn;
                uint64_t gfn;
                uint16_t domid;
                ioservid_t ioserver;
            } lookup_foreign_page;
            struct {
                uint64_t bfn;
                ioservid_t ioserver;
            } unmap_foreign_page;
        } u;
    };
Definition of PV IOMMU subops:
    #define IOMMUOP_query_caps            1
    #define IOMMUOP_map_page              2
    #define IOMMUOP_unmap_page            3
    #define IOMMUOP_map_foreign_page      4
    #define IOMMUOP_lookup_foreign_page   5
    #define IOMMUOP_unmap_foreign_page    6
Design considerations for hypercall op
-------------------------------------------
IOMMU map/unmap operations can be slow and can involve flushing the IOMMU TLB
to ensure the I/O device uses the updated mappings.
The op has been designed to take an array of operations and a count as
parameters. This allows for easily implemented hypercall continuations to be
used and allows for batches of IOMMU operations to be submitted before flushing
the IOMMU TLB.
The `subop_id` to be used for a particular element is encoded into the element
itself. This allows for map and unmap operations to be performed in one hypercall
and for the IOMMU TLB flushing optimisations to be still applied.
The hypercall will ensure that the required IOMMU TLB flushes are applied before
returning to guest via either hypercall completion or a hypercall continuation.
IOMMUOP_query_caps
------------------
This subop queries the runtime capabilities of the PV-IOMMU interface for the
specific calling domain. This subop uses `struct pv_iommu_op` directly.
------------------------------------------------------------------------------
Field          Purpose
-----          ---------------------------------------------------------------
`flags`        [out] This field details the IOMMUOP capabilities.
`status`       [out] Status of this op, op specific values listed below
------------------------------------------------------------------------------
Defined bits for flags field:
------------------------------------------------------------------------------
Name                        Bit                Definition
----                       ------     ----------------------------------
IOMMU_QUERY_map_cap          0        IOMMUOP_map_page or IOMMUOP_map_foreign
                                      can be used for the calling domain
IOMMU_QUERY_map_all_mfns     1        IOMMUOP_map_page subop can map any MFN
                                      not used by Xen
Reserved for future use     2-9                   n/a
IOMMU_page_order           10-15      Returns maximum possible page order for
                                      all other IOMMUOP subops
------------------------------------------------------------------------------
Defined values for query_caps subop status field:
Value   Reason
------  ----------------------------------------------------------
0       subop successfully returned
IOMMUOP_map_page
----------------------
This subop uses `struct map_page` part of the `struct pv_iommu_op`.
If IOMMU dom0-strict mode is NOT enabled then the hardware domain will be
allowed to map all GFNs except for Xen owned MFNs else the hardware
domain will only be allowed to map GFNs which it owns.
If IOMMU dom0-strict mode is NOT enabled then the hardware domain will be
allowed to map all GFNs without taking a reference to the MFN backing the GFN
by setting the IOMMU_MAP_OP_no_ref_cnt flag.
Every successful pv_iommu_op will result in an additional page reference being
taken on the MFN backing the GFN except for the condition detailed above.
If the map_op flags indicate a writeable mapping is required then a writeable
page type reference will be taken otherwise a standard page reference will be
taken.
All the following conditions are required to be true for PV IOMMU map
subop to succeed:
1. IOMMU detected and supported by Xen
2. The domain has IOMMU controlled hardware allocated to it
3. If hardware_domain and the following Xen IOMMU options are
   NOT enabled: dom0-passthrough
This subop usage of the `struct pv_iommu_op` and `struct map_page` fields
are detailed below:
------------------------------------------------------------------------------
Field          Purpose
-----          ---------------------------------------------------------------
`bfn`          [in]  Bus address frame number(BFN) to be mapped to specified gfn
                     below
`gfn`          [in]  Guest address frame number for DOMID_SELF
`flags`        [in]  Flags for signalling type of IOMMU mapping to be created,
                     Flags can be combined.
`status`       [out] Mapping status of this op, op specific values listed below
------------------------------------------------------------------------------
Defined bits for flags field:
Name                        Bit                Definition
----                       -----      ----------------------------------
IOMMU_OP_readable            0        Create readable IOMMU mapping
IOMMU_OP_writeable           1        Create writeable IOMMU mapping
IOMMU_MAP_OP_no_ref_cnt      2        IOMMU mapping does not take a reference to
                                      MFN backing BFN mapping
Reserved for future use     3-9                   n/a
IOMMU_page_order            10-15     Page order to be used for both gfn and bfn
Defined values for map_page subop status field:
Value   Reason
------  ----------------------------------------------------------------------
0       subop successfully returned
-EIO    IOMMU unit returned error when attempting to map BFN to GFN.
-EPERM  GFN could not be mapped because the GFN belongs to Xen.
-EPERM  Domain is not the hardware domain and GFN does not belong to domain
-EPERM  Domain is the hardware domain, IOMMU dom-strict mode is enabled and
        GFN does not belong to domain
-EACCES BFN address conflicts with RMRR regions for devices attached to
        DOMID_SELF
-ENOSPC Page order is too large for either BFN, GFN or IOMMU unit
IOMMUOP_unmap_page
------------------
This subop uses `struct unmap_page` part of the `struct pv_iommu_op`.
The subop usage of the `struct pv_iommu_op` and `struct unmap_page` fields
are detailed below:
--------------------------------------------------------------------
Field          Purpose
-----          -----------------------------------------------------
`bfn`          [in] Bus address frame number to be unmapped in DOMID_SELF
`flags`        [in] Flags for signalling page order of unmap operation
`status`       [out] Mapping status of this unmap operation, 0 indicates success
--------------------------------------------------------------------
Defined bits for flags field:
Name                        Bit                Definition
----                       -----      ----------------------------------
Reserved for future use     0-9                   n/a
IOMMU_page_order            10-15     Page order to be used for bfn
Defined values for unmap_page subop status field:
Error code  Reason
----------  ------------------------------------------------------------
0            subop successfully returned
-EIO         IOMMU unit returned error when attempting to unmap BFN.
-ENOSPC      Page order is too large for either BFN address or IOMMU unit
------------------------------------------------------------------------
IOMMUOP_map_foreign_page
------------------------
This subop uses `struct map_foreign_page` part of the `struct pv_iommu_op`.
It is not valid to use a domid representing the calling domain.
The hypercall will only succeed if calling domain has sufficient privilege over
the specified domid.
The M2B mechanism is an MFN to (BFN,domid,ioserver) tuple.
Each successful subop will add to the M2B if there was not an existing identical
M2B entry.
Every new M2B entry will take a reference to the MFN backing the GFN.
All the following conditions are required to be true for PV IOMMU map_foreign
subop to succeed:
1. IOMMU detected and supported by Xen
2. The domain has IOMMU controlled hardware allocated to it
3. The domain is the hardware_domain and the following Xen IOMMU options are
   NOT enabled: dom0-passthrough
This subop usage of the `struct pv_iommu_op` and `struct map_foreign_page`
fields are detailed below:
--------------------------------------------------------------------
Field          Purpose
-----          -----------------------------------------------------
`domid`        [in] The domain id for which the gfn field applies
`ioserver`     [in] IOREQ server id associated with mapping
`bfn`          [in] Bus address frame number for gfn address
`gfn`          [in] Guest address frame number
`flags`        [in] Details the status of the BFN mapping
`status`       [out] status of this subop, 0 indicates success
--------------------------------------------------------------------
Defined bits for flags field:
Name                         Bit                Definition
----                        -----      ----------------------------------
IOMMUOP_readable              0        BFN IOMMU mapping is readable
IOMMUOP_writeable             1        BFN IOMMU mapping is writeable
IOMMUOP_swap_mfn              2        BFN IOMMU mapping can be safely
                                       swapped to scratch page
Reserved for future use      3-9       Reserved flag bits should be 0
IOMMU_page_order            10-15      Page order to be used for both gfn and bfn
Defined values for map_foreign_page subop status field:
Error code  Reason
----------  ------------------------------------------------------------
0            subop successfully returned
-EIO         IOMMU unit returned error when attempting to map BFN to GFN.
-EPERM       Calling domain does not have sufficient privilege over domid
-EPERM       GFN could not be mapped because the GFN belongs to Xen.
-EPERM       domid maps to DOMID_SELF
-EACCES      BFN address conflicts with RMRR regions for devices attached to
             DOMID_SELF
-ENODEV      Provided ioserver id is not valid
-ENXIO       Provided domid id is not valid
-ENXIO       Provided GFN address is not valid
-ENOSPC      Page order is too large for either BFN, GFN or IOMMU unit
IOMMU_lookup_foreign_page
-------------------------
This subop uses `struct lookup_foreign_page` part of the `struct pv_iommu_op`.
This subop lookups up a BFN mapping for a ioserver + gfn + target domid
combination.
The hypercall will only succeed if calling domain has sufficient privilege over
the specified domid.
If a 1:1 mapping exists of BFN to MFN then a M2B entry is added and a
reference is taken to the underlying MFN. If an existing mapping is present
then the BFN is returned and no additional reference's will be taken to the
underlying MFN.
A 1:1 mapping will exist if there is no IOMMU support or if the PV hardware
domain was booted in dom0-relaxed mode or in dom0-passthrough mode.
If there is no IOMMU support then the MFN is returned in the BFN field (that is
the only valid bus address for the GFN + domid combination).
Each successful subop will add to the M2B if there was not an existing identical
M2B entry.
Every new M2B entry will take a reference to the MFN backing the GFN.
This subop usage of the `struct pv_iommu_op` and `struct lookup_foreign_page`
fields are detailed below:
--------------------------------------------------------------------
Field          Purpose
-----          -----------------------------------------------------
`domid`        [in] The domain id for which the gfn field applies
`ioserver`     [in] IOREQ server id associated with mapping
`bfn`          [out] Bus address frame number for gfn address
`gfn`          [in] Guest address frame number
`flags`        [out] Details the status of the BFN mapping
`status`       [out] status of this subop, 0 indicates success
--------------------------------------------------------------------
Defined bits for flags field:
Name                         Bit                Definition
----                        -----      ----------------------------------
IOMMUOP_readable              0        Returned BFN IOMMU mapping is readable
IOMMUOP_writeable             1        Returned BFN IOMMU mapping is writeable
Reserved for future use      2-9       Reserved flag bits should be 0
IOMMU_page_order            10-15      Returns maximum possible page order for
                                       all other IOMMUOP subops
Defined values for lookup_foreign_page subop status field:
Error code  Reason
----------  ------------------------------------------------------------
0            subop successfully returned
-EPERM       Calling domain does not have sufficient privilege over domid
-ENOENT      There is no available BFN for provided GFN + domid combination
-ENODEV      Provided ioserver id is not valid
-ENXIO       Provided domid id is not valid
-ENXIO       Provided GFN address is not valid
IOMMUOP_unmap_foreign_page
--------------------------
This subop uses `struct unmap_foreign_page` part of the `struct pv_iommu_op`.
It only allows BFNs acquired via IOMMUOP_map_foreign_page or IOMMUOP_lookup_page
to be unmapped. If an attempt is made to unmap a BFN mapped via IOMMUOP_map_page
then the subop will fail.
The subop will perform a B2M lookup (IO page table walk) for the calling domain
and then index the M2B using the returned MFN. This is safe because a particular
BFN mapping can only map to one MFN for a particular calling domain.
This subop usage of the `struct pv_iommu_op` and `struct unmap_foreign_page` fields
are detailed below:
-----------------------------------------------------------------------
Field          Purpose
-----          --------------------------------------------------------
`ioserver`     [in] IOREQ server id associated with mapping
`bfn`          [in] Bus address frame number for gfn address
`flags`        [in] Flags for signalling page order of unmap operation
`status`       [out] status of this subop, 0 indicates success
-----------------------------------------------------------------------
Defined bits for flags field:
Name                        Bit                Definition
----                        -----     ----------------------------------
Reserved for future use     0-9                   n/a
IOMMU_page_order            10-15     Page order to be used for bfn unmapping
Defined values for unmap_foreign_page subop status field:
Error code  Reason
----------  ------------------------------------------------------------
0            subop successfully returned
-ENOENT      An M2B entry was not found for the specified input parameters.
Linux kernel architecture
=========================
The Linux kernel will use the PV-IOMMU hypercalls to map its PFN address
space into the IOMMU. It will map the PFNs to the IOMMU address space using
a 1:1 mapping, it does this by programming a BFN to GFN mapping which matches
the PFN to GFN mapping.
The native SWIOTLB will be used to handle devices which cannot DMA to all of
the kernel's PFN address space.
An interface shall be provided for emulator usage of IOMMUOP_*_foreign_page
subops which will allow the Linux kernel to centrally manage that domain's BFN
resource and ensure there are no unexpected conflicts.
Kernel Map Foreign GFN to BFN interface
---------------------------------------
An array of 'count' of 'struct pv_iommu_ops' will be passed to the mapping
function.
    int map_foreign_gfn_to_bfn(int count, struct pv_iommu_op *ops)
The calling function will use the `struct map_foreign_page` inside the `struct
pv_iommu_op` and will fill in the domid, gfn and ioserver_id fields.
The kernel function will reuse the passed in struct pv_iommu_op for the
hypercall and will set the subop_id field based on the IOMMU_QUERY_map_cap
capability.
If the IOMMU_QUERY_map_cap is set then the kernel will allocate a suitable BFN
address, set the BFN field in the op to this address and set the subop_id to
IOMMUOP_map_page. It will do this on all 'ops' and then issue the hypercall.
If the IOMMU_QUERY_map_cap is NOT set then the kernel will set the subops_id
to IOMMUOP_lookup_page on all `ops` and then issue the hypercall.
The calling function should check the status field in each op and if the
status field is 0 then it can use the returned BFN address in each op.
Kernel Unmap Foreign GFN to BFN interface
-----------------------------------------
An array of 'count' of 'struct pv_iommu_ops' will be passed the mapping
function.
    int unmap_foreign_gfn_to_bfn(int count, struct pv_iommu_op *ops)
The calling function will use the `struct unmap_foreign_page` inside the `struct
pv_iommu_op` and will fill in the bfn field.
The kernel function will set the subop_id field to IOMMUOP_unmap_foreign_page
in each op and then issue the hypercall.
The calling function should check the status field in each op and if the
status field is 0 then the BFN has been successfully unmapped.
^ permalink raw reply	[flat|nested] 20+ messages in thread
- * Re: Xen PV IOMMU interface draft C
  2015-06-26 10:23 ` Xen PV IOMMU interface draft C Malcolm Crossley
@ 2015-06-26 11:03   ` Ian Campbell
  2015-06-29 14:40     ` Konrad Rzeszutek Wilk
  2015-07-10 19:32   ` Konrad Rzeszutek Wilk
  2016-02-10 10:09   ` Xen PV IOMMU interface draft D Malcolm Crossley
  2 siblings, 1 reply; 20+ messages in thread
From: Ian Campbell @ 2015-06-26 11:03 UTC (permalink / raw)
  To: Malcolm Crossley, Stefano Stabellini, Julien Grall
  Cc: Kevin Tian, Zhang, Yu C, Andrew Cooper, Paul Durrant, Lv, Zhiyuan,
	Jan Beulich, xen-devel, David Vrabel
+ARM devs.
On Fri, 2015-06-26 at 11:23 +0100, Malcolm Crossley wrote:
> Hi All,
I had a chat with Malcolm about this with respect to ARM.
The upshot is that this does not help us to remove the dom0 1:1
workaround or associated swiotlb uses on systems without an SMMU, nor
does it allow us to sensibly do passthrough on systems which lack an
SMMU.
What it will be good for is in the future when doing "mediated
passthrough", that is the xengt like thing where the device is partly
assigned to the guest and partly emulated in a privileged domain.
I had a look through the previous draft earlier in the week and didn't
notice anything which would preclude use on ARM in the future.
Ian.
> 
> Here is a design for allowing guests to control the IOMMU. This
> allows for the guest GFN mapping to be programmed into the IOMMU and
> avoid using the SWIOTLB bounce buffer technique in the Linux kernel
> (except for legacy 32 bit DMA IO devices).
> 
> Draft C has been reordered to explain expected behaviours before the APIs
> themselves. There's an additional section to explain the rationale for
> separate subops from local GFN mappings and foreign GFN mappings.
> 
> There's also further detail on the Linux API for foreign BFN mappings.
> 
> The plan is to start writing code against this version so please provide
> feedback on any major design problems/concerns.
> 
> The pandoc markdown format of the document is provided below to allow
> for easier inline comments:
> 
> % Xen PV IOMMU interface
> % Malcolm Crossley <<malcolm.crossley@citrix.com>>
>   Paul Durrant <<paul.durrant@citrix.com>>
> % Draft C
> 
> Introduction
> ============
> 
> Revision History
> ----------------
> 
> --------------------------------------------------------------------
> Version  Date         Changes
> -------  -----------  ----------------------------------------------
> Draft A  10 Apr 2014  Initial draft.
> 
> Draft B  12 Jun 2015  Second draft.
> 
> Draft C  26 Jun 2015  Third draft.
> --------------------------------------------------------------------
> 
> Background
> ==========
> 
> Linux kernel SWIOTLB
> --------------------
> 
> Xen PV guests use a Pseudophysical Frame Number(PFN) address space which is
> decoupled from the host Machine Frame Number(MFN) address space.
> 
> PV guest hardware drivers are aware of the PFN address space only and
> assume that if PFN addresses are contiguous then the hardware addresses would
> be contiguous as well. The decoupling between PFN and MFN address spaces means
> PFN and MFN addresses may not be contiguous across page boundaries and thus a
> buffer allocated in GFN address space which spans a page boundary may not be
> contiguous in MFN address space.
> 
> PV hardware drivers cannot tolerate this behaviour and so a special
> "bounce buffer" region is used to hide this issue from the drivers.
> 
> A bounce buffer region is a special part of the PFN address space which has
> been made to be contiguous in both PFN and MFN address spaces. When a driver
> requests a buffer which spans a page boundary be made available for hardware
> to read the core operating system code copies the buffer into a temporarily
> reserved part of the bounce buffer region and then returns the MFN address of
> the reserved part of the bounce buffer region back to the driver itself. The
> driver then instructs the hardware to read the copy of the buffer in the
> bounce buffer. Similarly if the driver requests a buffer is made available
> for hardware to write to the first a region of the bounce buffer is reserved
> and then after the hardware completes writing then the reserved region of
> bounce buffer is copied to the originally allocated buffer.
> 
> The overheard of memory copies to/from the bounce buffer region is high
> and damages performance. Furthermore, there is a risk the fixed size
> bounce buffer region will become exhausted and it will not be possible to
> return an hardware address back to the driver. The Linux kernel drivers do not
> tolerate this failure and so the kernel is forced to crash, as an
> unrecoverable error has occurred.
> 
> Input/Output Memory Management Units (IOMMU) allow for an inbound address
> mapping to be created from the I/O Bus address space (typically PCI) to
> the machine frame number address space. IOMMUs typically use a page table
> mechanism to manage the mappings and therefore can create mappings of page size
> granularity or larger.
> 
> The I/O Bus address space will be referred to as the Bus Frame Number (BFN)
> address space for the rest of this document.
> 
> 
> Mediated Pass-through Emulators
> -------------------------------
> 
> Mediated Pass-through emulators allow guest domains to interact with
> hardware devices via emulator mediation. The emulator runs in a domain separate
> to the guest domain and it is used to enforce security of guest access to the
> hardware devices and isolation of different guests accessing the same hardware
> device.
> 
> The emulator requires a mechanism to map guest addresses to a bus address that
> the hardware devices can access.
> 
> 
> Clarification of GFN and BFN fields for different guest types
> -------------------------------------------------------------
> Guest Frame Numbers (GFN) definition varies depending on the guest type.
> 
> Diagram below details the memory accesses originating from CPU, per guest type:
> 
>       HVM guest                              PV guest
> 
>          (VA)                                   (VA)
>           |                                      |
>          MMU                                    MMU
>           |                                      |
>          (GFN)                                   |
>           |                                      | (GFN)
>      HAP a.k.a EPT/NPT                           |
>           |                                      |
>          (MFN)                                  (MFN)
>           |                                      |
>          RAM                                    RAM
> 
> For PV guests GFN is equal to MFN for a single page but not for a contiguous
> range of pages.
> 
> Bus Frame Numbers (BFN) refer to the address presented on the physical bus
> before being translated by the IOMMU.
> 
> Diagram below details memory accesses originating from physical device.
> 
>     Physical Device
>           |
>         (BFN)
>           |
> 	   IOMMU-PT
>           |
>         (MFN)
>           |
>          RAM
> 
> 
> 
> Purpose
> =======
> 
> 1. Allow Xen guests to create/modify/destroy IOMMU mappings for
> hardware devices that the PV guests has access to. This enables the PV guest to
> program a bus address space mapping which matches its GFN mapping. Once a 1-1
> mapping of PFN to bus address space is created then a bounce buffer
> region is not required for the I/O devices connected to the IOMMU.
> 
> 2. Allow for Xen guests to lookup/create/modify/destroy IOMMU mappings for
> guest memory of domains the calling Xen guest has sufficient privilege over.
> This enables domains to provide mediated hardware acceleration to other
> guest domains.
> 
> 
> General principles for PV IOMMU interface
> =========================================
> 
> There are two different usage models for the BFN address space of a calling
> guest based upon the two purposes specified in the section above.
> 
> A calling guest may use their BFN address space for only one of the purposes
> detailed above and so the PV IOMMU interface has a subop per usage model.
> Furthermore, the IOMMU mapping of foreign domains memory is more complex than
> IOMMU mapping local domain memory and seperating the subops allows for the
> complexity to be split in the implementation.
> 
> The PV IOMMU design allows the calling domain to control it's BFN memory map.
> Thus the design also assigns the responsiblity of ensuring a BFN address
> mapped for local domain memory mappings are not reused for foreign domain
> memory mappings without an explict unmap of BFN address first. This simplifies
> the usage of the API and the extra overhead for the calling domains should be
> minimal as they should be already tracking the BFN address space usage already.
> 
> 
> Emulator usage of PV IOMMU interface
> ====================================
> 
> Emulators which require bus address mapping of guest RAM must first determine if
> it's possible for the domain to control the bus addresses themselves.
> 
> A IOMMUOP_query_caps subop will return the IOMMU_QUERY_map_cap flag. If this
> flag is set then the emulator may specify the BFN address it wishes guest RAM to
> be mapped to via the IOMMUOP_map_foreign_page subop.  If the flag is not set
> then the emulator must use BFN addresses supplied by the Xen via the
> IOMMUOP_lookup_foreign_page.
> 
> Operating systems which use the IOMMUOP_map_page subop are expected to provide a
> common interface for emulators to use. Otherwise emulators will not be aware
> of existing BFN mappings created by operating system and will get failed
> subops due to conflicts in the BFN address space for the domain.
> 
> Emulators should unmap unused GFN mappings as often as possible using
> IOMMUOP_unmap_foreign_page subops so that guest domains can balloon pages
> quickly and efficiently.
> 
> Emulators should conform to the ballooning behaviour described section
> "IOMMUOP_*_foreign_page interactions with guest domain ballooning" so that guest
> domains are able to effectively balloon out and in memory.
> 
> Emulators must unmap any active BFN mappings when they shutdown.
> 
> IOMMUOP_*_foreign_page interactions with guest domain ballooning
> ================================================================
> 
> Guest domains can balloon out a set of GFN mappings at any time and render the
> BFN to GFN mapping invalid.
> 
> When a BFN to GFN mapping becomes invalid, Xen will issue a buffered I/O request
> of type IOREQ_TYPE_INVALIDATE to the affected IOREQ servers with the now invalid
> BFN address in the data field. If the buffered I/O request ring is full then a
> standard (synchronous) I/O request of type IOREQ_TYPE_INVALIDATE will be issued
> to the affected IOREQ server the with just invalidated BFN address in the data
> field.
> 
> The BFN mappings cannot be simply unmapped at the point of the balloon hypercall
> otherwise a malicious guest could specifically balloon out an in use GFN address
> in use by an emulator and trigger IOMMU faults for the domains with BFN
> mappings.
> 
> For hosts with no IOMMU support: The affected emulator(s) must specifically
> issue a IOMMUOP_unmap_foreign_page subop for the now invalid BFN address so that
> the references to the underlying MFN are removed and the MFN can be freed back
> to the Xen memory allocator.
> 
> For hosts with IOMMU support:
> If the BFN was mapped without the IOMMUOP_swap_mfn flag set in the
> IOMMUOP_map_foreign_page then the affected affected emulator(s) must
> specifically issue a IOMMUOP_unmap_foreign_page subop for the now invalid BFN
> address so that the references to the underlying MFN are removed.
> 
> If the BFN was mapped with the IOMMUOP_swap_mfn flag set in the
> IOMMUOP_map_foreign_page subop for all emulators with mappings of that GFN then
> the BFN mapping will be swapped to point at a scratch MFN page and all BFN
> references to the invalid MFN will be removed by Xen after the BFN mapping has
> been updated to point at the scratch MFN page.
> 
> The rationale for swapping the BFN mapping to point at scratch pages is to
> enable guest domains to balloon quickly without requiring hypercall(s) from
> emulators.
> 
> Not all BFN mappings can be swapped without potentially causing problems for the
> hardware itself (command rings etc.) so the IOMMUOP_swap_mfn flag is used to
> allow per BFN control of Xen ballooning behaviour.
> 
> 
> PV IOMMU interactions with self ballooning
> ==========================================
> 
> A guest should clear any IOMMU mappings it has of its own pages before
> releasing a page back to Xen. The guest also will need to add IOMMU mappings
> after repopulating a page with the populate_physmap hypercall.
> 
> PV guests must clear any IOMMU mappings before pinning page table pages
> because the IOMMU mappings will take a writable reference count and this will
> prevent page table pinning.
> 
> 
> Security Implications of allowing domain IOMMU control
> ======================================================
> 
> Xen currently allows I/O devices attached to hardware domain to have direct
> access to the all of the MFN address space (except Xen hypervisor memory regions),
> provided the Xen IOMMU option dom0-strict is not enabled.
> 
> The PV IOMMU feature provides the same level of access to MFN address space
> and the feature is not enabled when the Xen IOMMU option dom0-strict is
> enabled. Therefore security is not degraded by the PV IOMMU feature.
> 
> Domains with physical device(s) assigned which are not hardware domains are only
> allowed to map their own GFNs or GFNs for domain(s) they have privilege over.
> 
> 
> PV IOMMU interactions with grant map/unmap operations
> =====================================================
> 
> Grant map operations return a Physical device accessible address (BFN) if the
> GNTMAP_device_map flag is set.  This operation currently returns the MFN for PV
> guests which may conflict with the BFN address space the guest uses if PV IOMMU
> map support is available to the guest.
> 
> This design proposes to allow the calling domain to control the BFN address that
> a grant map operation uses.
> 
> This can be achieved by specifying that the dev_bus_addr in the
> gnttab_map_grant_ref structure is used an input parameter instead of the
> output parameter it is currently.
> 
> Only PAGE_SIZE aligned addresses are allowed for dev_bus_addr input parameter.
> 
> The revised structure is shown below for convenience.
> 
>     struct gnttab_map_grant_ref {
>         /* IN parameters. */
>         uint64_t host_addr;
>         uint32_t flags;               /* GNTMAP_* */
>         grant_ref_t ref;
>         domid_t  dom;
>         /* OUT parameters. */
>         int16_t  status;              /* => enum grant_status */
>         grant_handle_t handle;
>         /* IN/OUT parameters */
>         uint64_t dev_bus_addr;
>     };
> 
> 
> The grant map operation would then behave similarly to the IOMMUOP_map_page
> subop for the creation of the IOMMU mapping.
> 
> The grant unmap operation would then behave similarly to the IOMMUOP_unmap_page
> subop for the removal of the IOMMU mapping.
> 
> A new grantmap flag would be used to indicate the domain is requesting the
> dev_bus_addr field is used an input parameter.
> 
> 
>     #define _GNTMAP_request_bfn_map      (6)
>     #define GNTMAP_request_bfn_map   (1<<_GNTMAP_request_bfn_map)
> 
> 
> Xen PV-IOMMU Architecture
> =========================
> 
> The Xen architecture consists of a new hypercall interface and changes to the
> grant map interface.
> 
> The existing IOMMU mappings setup at domain creation time will be preserved so
> that PV domains unaware of this feature will continue to function with no
> changes required.
> 
> Memory ballooning will be supported by taking an additional reference on the
> MFN backing the GFN for each successful IOMMU mapping created.
> 
> An M2B tracking structure will be used to ensure all references to an MFN can
> be located efficiently.
> 
> Xen PV IOMMU hypercall interface
> --------------------------------
> A two argument hypercall interface (do_iommu_op).
> 
>     ret_t do_iommu_op(XEN_GUEST_HANDLE_PARAM(void) arg, unsigned int count)
> 
> First argument, guest handle pointer to array of `struct pv_iommu_op`
> 
> Second argument, unsigned integer count of `struct pv_iommu_op` elements in array.
> 
> Definition of `struct pv_iommu_op`:
> 
>     struct pv_iommu_op {
> 
>         uint16_t subop_id;
>         uint16_t flags;
>         int32_t status;
> 
>         union {
>             struct {
>                 uint64_t bfn;
>                 uint64_t gfn;
>             } map_page;
> 
>             struct {
>                 uint64_t bfn;
>             } unmap_page;
> 
>             struct {
>                 uint64_t bfn;
>                 uint64_t gfn;
>                 uint16_t domid;
>                 ioservid_t ioserver;
>             } map_foreign_page;
> 
>             struct {
>                 uint64_t bfn;
>                 uint64_t gfn;
>                 uint16_t domid;
>                 ioservid_t ioserver;
>             } lookup_foreign_page;
> 
>             struct {
>                 uint64_t bfn;
>                 ioservid_t ioserver;
>             } unmap_foreign_page;
>         } u;
>     };
> 
> Definition of PV IOMMU subops:
> 
>     #define IOMMUOP_query_caps            1
>     #define IOMMUOP_map_page              2
>     #define IOMMUOP_unmap_page            3
>     #define IOMMUOP_map_foreign_page      4
>     #define IOMMUOP_lookup_foreign_page   5
>     #define IOMMUOP_unmap_foreign_page    6
> 
> 
> Design considerations for hypercall op
> -------------------------------------------
> IOMMU map/unmap operations can be slow and can involve flushing the IOMMU TLB
> to ensure the I/O device uses the updated mappings.
> 
> The op has been designed to take an array of operations and a count as
> parameters. This allows for easily implemented hypercall continuations to be
> used and allows for batches of IOMMU operations to be submitted before flushing
> the IOMMU TLB.
> 
> The `subop_id` to be used for a particular element is encoded into the element
> itself. This allows for map and unmap operations to be performed in one hypercall
> and for the IOMMU TLB flushing optimisations to be still applied.
> 
> The hypercall will ensure that the required IOMMU TLB flushes are applied before
> returning to guest via either hypercall completion or a hypercall continuation.
> 
> IOMMUOP_query_caps
> ------------------
> 
> This subop queries the runtime capabilities of the PV-IOMMU interface for the
> specific calling domain. This subop uses `struct pv_iommu_op` directly.
> 
> ------------------------------------------------------------------------------
> Field          Purpose
> -----          ---------------------------------------------------------------
> `flags`        [out] This field details the IOMMUOP capabilities.
> 
> `status`       [out] Status of this op, op specific values listed below
> ------------------------------------------------------------------------------
> 
> Defined bits for flags field:
> 
> ------------------------------------------------------------------------------
> Name                        Bit                Definition
> ----                       ------     ----------------------------------
> IOMMU_QUERY_map_cap          0        IOMMUOP_map_page or IOMMUOP_map_foreign
>                                       can be used for the calling domain
> 
> IOMMU_QUERY_map_all_mfns     1        IOMMUOP_map_page subop can map any MFN
>                                       not used by Xen
> 
> Reserved for future use     2-9                   n/a
> 
> IOMMU_page_order           10-15      Returns maximum possible page order for
>                                       all other IOMMUOP subops
> ------------------------------------------------------------------------------
> 
> Defined values for query_caps subop status field:
> 
> Value   Reason
> ------  ----------------------------------------------------------
> 0       subop successfully returned
> 
> IOMMUOP_map_page
> ----------------------
> This subop uses `struct map_page` part of the `struct pv_iommu_op`.
> 
> If IOMMU dom0-strict mode is NOT enabled then the hardware domain will be
> allowed to map all GFNs except for Xen owned MFNs else the hardware
> domain will only be allowed to map GFNs which it owns.
> 
> If IOMMU dom0-strict mode is NOT enabled then the hardware domain will be
> allowed to map all GFNs without taking a reference to the MFN backing the GFN
> by setting the IOMMU_MAP_OP_no_ref_cnt flag.
> 
> Every successful pv_iommu_op will result in an additional page reference being
> taken on the MFN backing the GFN except for the condition detailed above.
> 
> If the map_op flags indicate a writeable mapping is required then a writeable
> page type reference will be taken otherwise a standard page reference will be
> taken.
> 
> All the following conditions are required to be true for PV IOMMU map
> subop to succeed:
> 
> 1. IOMMU detected and supported by Xen
> 2. The domain has IOMMU controlled hardware allocated to it
> 3. If hardware_domain and the following Xen IOMMU options are
>    NOT enabled: dom0-passthrough
> 
> This subop usage of the `struct pv_iommu_op` and `struct map_page` fields
> are detailed below:
> 
> ------------------------------------------------------------------------------
> Field          Purpose
> -----          ---------------------------------------------------------------
> `bfn`          [in]  Bus address frame number(BFN) to be mapped to specified gfn
>                      below
> 
> `gfn`          [in]  Guest address frame number for DOMID_SELF
> 
> `flags`        [in]  Flags for signalling type of IOMMU mapping to be created,
>                      Flags can be combined.
> 
> `status`       [out] Mapping status of this op, op specific values listed below
> ------------------------------------------------------------------------------
> 
> Defined bits for flags field:
> 
> Name                        Bit                Definition
> ----                       -----      ----------------------------------
> IOMMU_OP_readable            0        Create readable IOMMU mapping
> IOMMU_OP_writeable           1        Create writeable IOMMU mapping
> IOMMU_MAP_OP_no_ref_cnt      2        IOMMU mapping does not take a reference to
>                                       MFN backing BFN mapping
> Reserved for future use     3-9                   n/a
> IOMMU_page_order            10-15     Page order to be used for both gfn and bfn
> 
> Defined values for map_page subop status field:
> 
> Value   Reason
> ------  ----------------------------------------------------------------------
> 0       subop successfully returned
> -EIO    IOMMU unit returned error when attempting to map BFN to GFN.
> -EPERM  GFN could not be mapped because the GFN belongs to Xen.
> -EPERM  Domain is not the hardware domain and GFN does not belong to domain
> -EPERM  Domain is the hardware domain, IOMMU dom-strict mode is enabled and
>         GFN does not belong to domain
> -EACCES BFN address conflicts with RMRR regions for devices attached to
>         DOMID_SELF
> -ENOSPC Page order is too large for either BFN, GFN or IOMMU unit
> 
> IOMMUOP_unmap_page
> ------------------
> This subop uses `struct unmap_page` part of the `struct pv_iommu_op`.
> 
> The subop usage of the `struct pv_iommu_op` and `struct unmap_page` fields
> are detailed below:
> 
> --------------------------------------------------------------------
> Field          Purpose
> -----          -----------------------------------------------------
> `bfn`          [in] Bus address frame number to be unmapped in DOMID_SELF
> 
> `flags`        [in] Flags for signalling page order of unmap operation
> 
> `status`       [out] Mapping status of this unmap operation, 0 indicates success
> --------------------------------------------------------------------
> 
> Defined bits for flags field:
> 
> Name                        Bit                Definition
> ----                       -----      ----------------------------------
> Reserved for future use     0-9                   n/a
> IOMMU_page_order            10-15     Page order to be used for bfn
> 
> 
> Defined values for unmap_page subop status field:
> 
> Error code  Reason
> ----------  ------------------------------------------------------------
> 0            subop successfully returned
> -EIO         IOMMU unit returned error when attempting to unmap BFN.
> -ENOSPC      Page order is too large for either BFN address or IOMMU unit
> ------------------------------------------------------------------------
> 
> 
> IOMMUOP_map_foreign_page
> ------------------------
> This subop uses `struct map_foreign_page` part of the `struct pv_iommu_op`.
> 
> It is not valid to use a domid representing the calling domain.
> 
> The hypercall will only succeed if calling domain has sufficient privilege over
> the specified domid.
> 
> The M2B mechanism is an MFN to (BFN,domid,ioserver) tuple.
> 
> Each successful subop will add to the M2B if there was not an existing identical
> M2B entry.
> 
> Every new M2B entry will take a reference to the MFN backing the GFN.
> 
> All the following conditions are required to be true for PV IOMMU map_foreign
> subop to succeed:
> 
> 1. IOMMU detected and supported by Xen
> 2. The domain has IOMMU controlled hardware allocated to it
> 3. The domain is the hardware_domain and the following Xen IOMMU options are
>    NOT enabled: dom0-passthrough
> 
> 
> This subop usage of the `struct pv_iommu_op` and `struct map_foreign_page`
> fields are detailed below:
> 
> --------------------------------------------------------------------
> Field          Purpose
> -----          -----------------------------------------------------
> `domid`        [in] The domain id for which the gfn field applies
> 
> `ioserver`     [in] IOREQ server id associated with mapping
> 
> `bfn`          [in] Bus address frame number for gfn address
> 
> `gfn`          [in] Guest address frame number
> 
> `flags`        [in] Details the status of the BFN mapping
> 
> `status`       [out] status of this subop, 0 indicates success
> --------------------------------------------------------------------
> 
> Defined bits for flags field:
> 
> Name                         Bit                Definition
> ----                        -----      ----------------------------------
> IOMMUOP_readable              0        BFN IOMMU mapping is readable
> IOMMUOP_writeable             1        BFN IOMMU mapping is writeable
> IOMMUOP_swap_mfn              2        BFN IOMMU mapping can be safely
>                                        swapped to scratch page
> Reserved for future use      3-9       Reserved flag bits should be 0
> IOMMU_page_order            10-15      Page order to be used for both gfn and bfn
> 
> Defined values for map_foreign_page subop status field:
> 
> Error code  Reason
> ----------  ------------------------------------------------------------
> 0            subop successfully returned
> -EIO         IOMMU unit returned error when attempting to map BFN to GFN.
> -EPERM       Calling domain does not have sufficient privilege over domid
> -EPERM       GFN could not be mapped because the GFN belongs to Xen.
> -EPERM       domid maps to DOMID_SELF
> -EACCES      BFN address conflicts with RMRR regions for devices attached to
>              DOMID_SELF
> -ENODEV      Provided ioserver id is not valid
> -ENXIO       Provided domid id is not valid
> -ENXIO       Provided GFN address is not valid
> -ENOSPC      Page order is too large for either BFN, GFN or IOMMU unit
> 
> IOMMU_lookup_foreign_page
> -------------------------
> This subop uses `struct lookup_foreign_page` part of the `struct pv_iommu_op`.
> 
> This subop lookups up a BFN mapping for a ioserver + gfn + target domid
> combination.
> 
> The hypercall will only succeed if calling domain has sufficient privilege over
> the specified domid.
> 
> If a 1:1 mapping exists of BFN to MFN then a M2B entry is added and a
> reference is taken to the underlying MFN. If an existing mapping is present
> then the BFN is returned and no additional reference's will be taken to the
> underlying MFN.
> 
> A 1:1 mapping will exist if there is no IOMMU support or if the PV hardware
> domain was booted in dom0-relaxed mode or in dom0-passthrough mode.
> 
> If there is no IOMMU support then the MFN is returned in the BFN field (that is
> the only valid bus address for the GFN + domid combination).
> 
> Each successful subop will add to the M2B if there was not an existing identical
> M2B entry.
> 
> Every new M2B entry will take a reference to the MFN backing the GFN.
> 
> This subop usage of the `struct pv_iommu_op` and `struct lookup_foreign_page`
> fields are detailed below:
> 
> --------------------------------------------------------------------
> Field          Purpose
> -----          -----------------------------------------------------
> `domid`        [in] The domain id for which the gfn field applies
> 
> `ioserver`     [in] IOREQ server id associated with mapping
> 
> `bfn`          [out] Bus address frame number for gfn address
> 
> `gfn`          [in] Guest address frame number
> 
> `flags`        [out] Details the status of the BFN mapping
> 
> `status`       [out] status of this subop, 0 indicates success
> --------------------------------------------------------------------
> 
> Defined bits for flags field:
> 
> Name                         Bit                Definition
> ----                        -----      ----------------------------------
> IOMMUOP_readable              0        Returned BFN IOMMU mapping is readable
> IOMMUOP_writeable             1        Returned BFN IOMMU mapping is writeable
> Reserved for future use      2-9       Reserved flag bits should be 0
> IOMMU_page_order            10-15      Returns maximum possible page order for
>                                        all other IOMMUOP subops
> 
> Defined values for lookup_foreign_page subop status field:
> 
> Error code  Reason
> ----------  ------------------------------------------------------------
> 0            subop successfully returned
> -EPERM       Calling domain does not have sufficient privilege over domid
> -ENOENT      There is no available BFN for provided GFN + domid combination
> -ENODEV      Provided ioserver id is not valid
> -ENXIO       Provided domid id is not valid
> -ENXIO       Provided GFN address is not valid
> 
> 
> IOMMUOP_unmap_foreign_page
> --------------------------
> This subop uses `struct unmap_foreign_page` part of the `struct pv_iommu_op`.
> 
> It only allows BFNs acquired via IOMMUOP_map_foreign_page or IOMMUOP_lookup_page
> to be unmapped. If an attempt is made to unmap a BFN mapped via IOMMUOP_map_page
> then the subop will fail.
> 
> The subop will perform a B2M lookup (IO page table walk) for the calling domain
> and then index the M2B using the returned MFN. This is safe because a particular
> BFN mapping can only map to one MFN for a particular calling domain.
> 
> This subop usage of the `struct pv_iommu_op` and `struct unmap_foreign_page` fields
> are detailed below:
> 
> -----------------------------------------------------------------------
> Field          Purpose
> -----          --------------------------------------------------------
> `ioserver`     [in] IOREQ server id associated with mapping
> 
> `bfn`          [in] Bus address frame number for gfn address
> 
> `flags`        [in] Flags for signalling page order of unmap operation
> 
> `status`       [out] status of this subop, 0 indicates success
> -----------------------------------------------------------------------
> 
> Defined bits for flags field:
> 
> Name                        Bit                Definition
> ----                        -----     ----------------------------------
> Reserved for future use     0-9                   n/a
> IOMMU_page_order            10-15     Page order to be used for bfn unmapping
> 
> Defined values for unmap_foreign_page subop status field:
> 
> Error code  Reason
> ----------  ------------------------------------------------------------
> 0            subop successfully returned
> -ENOENT      An M2B entry was not found for the specified input parameters.
> 
> 
> Linux kernel architecture
> =========================
> 
> The Linux kernel will use the PV-IOMMU hypercalls to map its PFN address
> space into the IOMMU. It will map the PFNs to the IOMMU address space using
> a 1:1 mapping, it does this by programming a BFN to GFN mapping which matches
> the PFN to GFN mapping.
> 
> The native SWIOTLB will be used to handle devices which cannot DMA to all of
> the kernel's PFN address space.
> 
> An interface shall be provided for emulator usage of IOMMUOP_*_foreign_page
> subops which will allow the Linux kernel to centrally manage that domain's BFN
> resource and ensure there are no unexpected conflicts.
> 
> Kernel Map Foreign GFN to BFN interface
> ---------------------------------------
> 
> An array of 'count' of 'struct pv_iommu_ops' will be passed to the mapping
> function.
> 
>     int map_foreign_gfn_to_bfn(int count, struct pv_iommu_op *ops)
> 
> The calling function will use the `struct map_foreign_page` inside the `struct
> pv_iommu_op` and will fill in the domid, gfn and ioserver_id fields.
> 
> The kernel function will reuse the passed in struct pv_iommu_op for the
> hypercall and will set the subop_id field based on the IOMMU_QUERY_map_cap
> capability.
> 
> If the IOMMU_QUERY_map_cap is set then the kernel will allocate a suitable BFN
> address, set the BFN field in the op to this address and set the subop_id to
> IOMMUOP_map_page. It will do this on all 'ops' and then issue the hypercall.
> 
> If the IOMMU_QUERY_map_cap is NOT set then the kernel will set the subops_id
> to IOMMUOP_lookup_page on all `ops` and then issue the hypercall.
> 
> The calling function should check the status field in each op and if the
> status field is 0 then it can use the returned BFN address in each op.
> 
> 
> Kernel Unmap Foreign GFN to BFN interface
> -----------------------------------------
> 
> An array of 'count' of 'struct pv_iommu_ops' will be passed the mapping
> function.
> 
>     int unmap_foreign_gfn_to_bfn(int count, struct pv_iommu_op *ops)
> 
> The calling function will use the `struct unmap_foreign_page` inside the `struct
> pv_iommu_op` and will fill in the bfn field.
> 
> The kernel function will set the subop_id field to IOMMUOP_unmap_foreign_page
> in each op and then issue the hypercall.
> 
> The calling function should check the status field in each op and if the
> status field is 0 then the BFN has been successfully unmapped.
> 
> 
^ permalink raw reply	[flat|nested] 20+ messages in thread
- * Re: Xen PV IOMMU interface draft C
  2015-06-26 11:03   ` Ian Campbell
@ 2015-06-29 14:40     ` Konrad Rzeszutek Wilk
  2015-06-29 14:52       ` Ian Campbell
  0 siblings, 1 reply; 20+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-06-29 14:40 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Kevin Tian, Zhang, Yu C, Stefano Stabellini, Andrew Cooper,
	Julien Grall, Paul Durrant, Lv, Zhiyuan, Jan Beulich, xen-devel,
	Malcolm Crossley, David Vrabel
On Fri, Jun 26, 2015 at 12:03:44PM +0100, Ian Campbell wrote:
> +ARM devs.
> 
> On Fri, 2015-06-26 at 11:23 +0100, Malcolm Crossley wrote:
> > Hi All,
> 
> I had a chat with Malcolm about this with respect to ARM.
> 
> The upshot is that this does not help us to remove the dom0 1:1
> workaround or associated swiotlb uses on systems without an SMMU, nor
> does it allow us to sensibly do passthrough on systems which lack an
> SMMU.
What would?
^ permalink raw reply	[flat|nested] 20+ messages in thread 
- * Re: Xen PV IOMMU interface draft C
  2015-06-29 14:40     ` Konrad Rzeszutek Wilk
@ 2015-06-29 14:52       ` Ian Campbell
  2015-06-29 15:05         ` Malcolm Crossley
  2015-06-29 15:24         ` David Vrabel
  0 siblings, 2 replies; 20+ messages in thread
From: Ian Campbell @ 2015-06-29 14:52 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Kevin Tian, Zhang, Yu C, Stefano Stabellini, Andrew Cooper,
	Julien Grall, Paul Durrant, Lv, Zhiyuan, Jan Beulich, xen-devel,
	Malcolm Crossley, David Vrabel
On Mon, 2015-06-29 at 10:40 -0400, Konrad Rzeszutek Wilk wrote:
> On Fri, Jun 26, 2015 at 12:03:44PM +0100, Ian Campbell wrote:
> > +ARM devs.
> > 
> > On Fri, 2015-06-26 at 11:23 +0100, Malcolm Crossley wrote:
> > > Hi All,
> > 
> > I had a chat with Malcolm about this with respect to ARM.
> > 
> > The upshot is that this does not help us to remove the dom0 1:1
> > workaround or associated swiotlb uses on systems without an SMMU, nor
> > does it allow us to sensibly do passthrough on systems which lack an
> > SMMU.
> 
> What would?
This "Xen PV IOMMU interface".
^ permalink raw reply	[flat|nested] 20+ messages in thread 
- * Re: Xen PV IOMMU interface draft C
  2015-06-29 14:52       ` Ian Campbell
@ 2015-06-29 15:05         ` Malcolm Crossley
  2015-06-29 15:24         ` David Vrabel
  1 sibling, 0 replies; 20+ messages in thread
From: Malcolm Crossley @ 2015-06-29 15:05 UTC (permalink / raw)
  To: Ian Campbell, Konrad Rzeszutek Wilk
  Cc: Kevin Tian, Zhang, Yu C, Stefano Stabellini, Andrew Cooper,
	Julien Grall, Paul Durrant, Lv, Zhiyuan, Jan Beulich, xen-devel,
	David Vrabel
On 29/06/15 15:52, Ian Campbell wrote:
> On Mon, 2015-06-29 at 10:40 -0400, Konrad Rzeszutek Wilk wrote:
>> On Fri, Jun 26, 2015 at 12:03:44PM +0100, Ian Campbell wrote:
>>> +ARM devs.
>>>
>>> On Fri, 2015-06-26 at 11:23 +0100, Malcolm Crossley wrote:
>>>> Hi All,
>>>
>>> I had a chat with Malcolm about this with respect to ARM.
>>>
>>> The upshot is that this does not help us to remove the dom0 1:1
>>> workaround or associated swiotlb uses on systems without an SMMU, nor
>>> does it allow us to sensibly do passthrough on systems which lack an
>>> SMMU.
>>
>> What would?
> 
> This "Xen PV IOMMU interface".
> 
> 
You only get "mediated device passthrough" on system without IOMMU's.
^ permalink raw reply	[flat|nested] 20+ messages in thread 
- * Re: Xen PV IOMMU interface draft C
  2015-06-29 14:52       ` Ian Campbell
  2015-06-29 15:05         ` Malcolm Crossley
@ 2015-06-29 15:24         ` David Vrabel
  2015-06-29 15:36           ` Ian Campbell
  1 sibling, 1 reply; 20+ messages in thread
From: David Vrabel @ 2015-06-29 15:24 UTC (permalink / raw)
  To: Ian Campbell, Konrad Rzeszutek Wilk
  Cc: Kevin Tian, Zhang, Yu C, Stefano Stabellini, Andrew Cooper,
	Julien Grall, Paul Durrant, Lv, Zhiyuan, Jan Beulich, xen-devel,
	Malcolm Crossley
On 29/06/15 15:52, Ian Campbell wrote:
> On Mon, 2015-06-29 at 10:40 -0400, Konrad Rzeszutek Wilk wrote:
>> On Fri, Jun 26, 2015 at 12:03:44PM +0100, Ian Campbell wrote:
>>> +ARM devs.
>>>
>>> On Fri, 2015-06-26 at 11:23 +0100, Malcolm Crossley wrote:
>>>> Hi All,
>>>
>>> I had a chat with Malcolm about this with respect to ARM.
>>>
>>> The upshot is that this does not help us to remove the dom0 1:1
>>> workaround or associated swiotlb uses on systems without an SMMU, nor
>>> does it allow us to sensibly do passthrough on systems which lack an
>>> SMMU.
>>
>> What would?
> 
> This "Xen PV IOMMU interface".
I guess Konrad is asking if this PV IOMMU interface doesn't solve these
problems, what (other possible interface) would?  I guess the answer is
a SMMU?
David
^ permalink raw reply	[flat|nested] 20+ messages in thread 
- * Re: Xen PV IOMMU interface draft C
  2015-06-29 15:24         ` David Vrabel
@ 2015-06-29 15:36           ` Ian Campbell
  0 siblings, 0 replies; 20+ messages in thread
From: Ian Campbell @ 2015-06-29 15:36 UTC (permalink / raw)
  To: David Vrabel
  Cc: Kevin Tian, Zhang, Yu C, Stefano Stabellini, Andrew Cooper,
	Julien Grall, Paul Durrant, Lv, Zhiyuan, Jan Beulich, xen-devel,
	Malcolm Crossley
On Mon, 2015-06-29 at 16:24 +0100, David Vrabel wrote:
> On 29/06/15 15:52, Ian Campbell wrote:
> > On Mon, 2015-06-29 at 10:40 -0400, Konrad Rzeszutek Wilk wrote:
> >> On Fri, Jun 26, 2015 at 12:03:44PM +0100, Ian Campbell wrote:
> >>> +ARM devs.
> >>>
> >>> On Fri, 2015-06-26 at 11:23 +0100, Malcolm Crossley wrote:
> >>>> Hi All,
> >>>
> >>> I had a chat with Malcolm about this with respect to ARM.
> >>>
> >>> The upshot is that this does not help us to remove the dom0 1:1
> >>> workaround or associated swiotlb uses on systems without an SMMU, nor
> >>> does it allow us to sensibly do passthrough on systems which lack an
> >>> SMMU.
> >>
> >> What would?
> > 
> > This "Xen PV IOMMU interface".
> 
> I guess Konrad is asking if this PV IOMMU interface doesn't solve these
> problems, what (other possible interface) would?
Ah, I thought it was a response to the this in "... that this doest
not...".
> I guess the answer is a SMMU?
That's the only solution I've thought of, yes.
Ian.
^ permalink raw reply	[flat|nested] 20+ messages in thread 
 
 
 
 
- * Re: Xen PV IOMMU interface draft C
  2015-06-26 10:23 ` Xen PV IOMMU interface draft C Malcolm Crossley
  2015-06-26 11:03   ` Ian Campbell
@ 2015-07-10 19:32   ` Konrad Rzeszutek Wilk
  2016-02-10 10:09   ` Xen PV IOMMU interface draft D Malcolm Crossley
  2 siblings, 0 replies; 20+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-07-10 19:32 UTC (permalink / raw)
  To: Malcolm Crossley
  Cc: Kevin Tian, Ian Campbell, Zhang, Yu C, Andrew Cooper,
	Paul Durrant, Lv, Zhiyuan, Jan Beulich, xen-devel, David Vrabel
> Xen PV IOMMU hypercall interface
> --------------------------------
> A two argument hypercall interface (do_iommu_op).
> 
>     ret_t do_iommu_op(XEN_GUEST_HANDLE_PARAM(void) arg, unsigned int count)
> 
> First argument, guest handle pointer to array of `struct pv_iommu_op`
> 
> Second argument, unsigned integer count of `struct pv_iommu_op` elements in array.
Is there an upper limit on it?  If there is an we hit, what is the return
value? Should it be E2BIG and the toolstack (or kernel) should adjust size
accordinly?
.. snip..
> Linux kernel architecture
> =========================
> 
> The Linux kernel will use the PV-IOMMU hypercalls to map its PFN address
> space into the IOMMU. It will map the PFNs to the IOMMU address space using
> a 1:1 mapping, it does this by programming a BFN to GFN mapping which matches
> the PFN to GFN mapping.
> 
> The native SWIOTLB will be used to handle devices which cannot DMA to all of
> the kernel's PFN address space.
native? Or the Xen SWIOTLB which consult the P2M space? In which case it is
the Xen-SWIOTLB.
^ permalink raw reply	[flat|nested] 20+ messages in thread 
- * Xen PV IOMMU interface draft D
  2015-06-26 10:23 ` Xen PV IOMMU interface draft C Malcolm Crossley
  2015-06-26 11:03   ` Ian Campbell
  2015-07-10 19:32   ` Konrad Rzeszutek Wilk
@ 2016-02-10 10:09   ` Malcolm Crossley
  2016-02-18  8:21     ` Tian, Kevin
                       ` (2 more replies)
  2 siblings, 3 replies; 20+ messages in thread
From: Malcolm Crossley @ 2016-02-10 10:09 UTC (permalink / raw)
  To: xen-devel, Jan Beulich, Konrad Rzeszutek Wilk, Andrew Cooper,
	Paul Durrant, Kevin Tian, Lv, Zhiyuan, Zhang, Yu C, David Vrabel,
	Ian Campbell
% Xen PV IOMMU interface
% Malcolm Crossley <<malcolm.crossley@citrix.com>>
  Paul Durrant <<paul.durrant@citrix.com>>
% Draft D
Introduction
============
Revision History
----------------
--------------------------------------------------------------------
Version  Date         Changes
-------  -----------  ----------------------------------------------
Draft A  10 Apr 2014  Initial draft.
Draft B  12 Jun 2015  Second draft.
Draft C  26 Jun 2015  Third draft.
Draft D  09 Feb 2016  Fourth draft.
--------------------------------------------------------------------
Background
==========
Linux kernel SWIOTLB
--------------------
Xen PV guests use a Pseudophysical Frame Number(PFN) address space which is
decoupled from the host Machine Frame Number(MFN) address space.
PV guest hardware drivers are aware of the PFN address space only and
assume that if PFN addresses are contiguous then the hardware addresses would
be contiguous as well. The decoupling between PFN and MFN address spaces means
PFN and MFN addresses may not be contiguous across page boundaries and thus a
buffer allocated in GFN address space which spans a page boundary may not be
contiguous in MFN address space.
PV hardware drivers cannot tolerate this behaviour and so a special
"bounce buffer" region is used to hide this issue from the drivers.
A bounce buffer region is a special part of the PFN address space which has
been made to be contiguous in both PFN and MFN address spaces. When a driver
requests a buffer which spans a page boundary be made available for hardware
to read the core operating system code copies the buffer into a temporarily
reserved part of the bounce buffer region and then returns the MFN address of
the reserved part of the bounce buffer region back to the driver itself. The
driver then instructs the hardware to read the copy of the buffer in the
bounce buffer. Similarly if the driver requests a buffer is made available
for hardware to write to the first a region of the bounce buffer is reserved
and then after the hardware completes writing then the reserved region of
bounce buffer is copied to the originally allocated buffer.
The overheard of memory copies to/from the bounce buffer region is high
and damages performance. Furthermore, there is a risk the fixed size
bounce buffer region will become exhausted and it will not be possible to
return an hardware address back to the driver. The Linux kernel drivers do not
tolerate this failure and so the kernel is forced to crash, as an
unrecoverable error has occurred.
Input/Output Memory Management Units (IOMMU) allow for an inbound address
mapping to be created from the I/O Bus address space (typically PCI) to
the machine frame number address space. IOMMUs typically use a page table
mechanism to manage the mappings and therefore can create mappings of page size
granularity or larger.
The I/O Bus address space will be referred to as the Bus Frame Number (BFN)
address space for the rest of this document.
Mediated Pass-through Emulators
-------------------------------
Mediated Pass-through emulators allow guest domains to interact with
hardware devices via emulator mediation. The emulator runs in a domain separate
to the guest domain and it is used to enforce security of guest access to the
hardware devices and isolation of different guests accessing the same hardware
device.
The emulator requires a mechanism to map guest addresses to a bus address that
the hardware devices can access.
Clarification of GFN and BFN fields for different guest types
-------------------------------------------------------------
Guest Frame Numbers (GFN) definition varies depending on the guest type.
Diagram below details the memory accesses originating from CPU, per guest type:
      HVM guest                              PV guest
         (VA)                                   (VA)
          |                                      |
         MMU                                    MMU
          |                                      |
         (GFN)                                   |
          |                                      | (GFN)
     HAP a.k.a EPT/NPT                           |
          |                                      |
         (MFN)                                  (MFN)
          |                                      |
         RAM                                    RAM
For PV guests GFN is equal to MFN for a single page but not for a contiguous
range of pages.
Bus Frame Numbers (BFN) refer to the address presented on the physical bus
before being translated by the IOMMU.
Diagram below details memory accesses originating from physical device.
    Physical Device
          |
        (BFN)
          |
	   IOMMU-PT
          |
        (MFN)
          |
         RAM
Purpose
=======
1. Allow Xen guests to create/modify/destroy IOMMU mappings for
hardware devices that the PV guests has access to. This enables the PV guest to
program a bus address space mapping which matches its GFN mapping. Once a 1-1
mapping of PFN to bus address space is created then a bounce buffer
region is not required for the I/O devices connected to the IOMMU.
2. Allow for Xen guests to lookup/create/modify/destroy IOMMU mappings for
guest memory of domains the calling Xen guest has sufficient privilege over.
This enables domains to provide mediated hardware acceleration to other
guest domains.
General principles for PV IOMMU interface
=========================================
There are two different usage models for the BFN address space of a calling
guest based upon the two purposes specified in the section above.
A calling guest may use their BFN address space for only one of the purposes
detailed above and so the PV IOMMU interface has a subop per usage model.
Furthermore, the IOMMU mapping of foreign domains memory is more complex than
IOMMU mapping local domain memory and seperating the subops allows for the
complexity to be split in the implementation.
The PV IOMMU design allows the calling domain to control it's BFN memory map.
Thus the design also assigns the responsiblity of ensuring a BFN address
mapped for local domain memory mappings are not reused for foreign domain
memory mappings without an explict unmap of BFN address first. This simplifies
the usage of the API and the extra overhead for the calling domains should be
minimal as they should be already tracking the BFN address space usage already.
Emulator usage of PV IOMMU interface
====================================
Emulators which require bus address mapping of guest RAM must first determine if
it's possible for the domain to control the bus addresses themselves.
A IOMMUOP_query_caps subop will return the IOMMU_QUERY_map_cap flag. If this
flag is set then the emulator may specify the BFN address it wishes guest RAM to
be mapped to via the IOMMUOP_map_foreign_page subop.  If the flag is not set
then the emulator must use BFN addresses supplied by the Xen via the
IOMMUOP_lookup_foreign_page.
Operating systems which use the IOMMUOP_map_page subop are expected to provide a
common interface for emulators to use. Otherwise emulators will not be aware
of existing BFN mappings created by operating system and will get failed
subops due to conflicts in the BFN address space for the domain.
Emulators should unmap unused GFN mappings as often as possible using
IOMMUOP_unmap_foreign_page subops so that guest domains can balloon pages
quickly and efficiently.
Emulators should conform to the ballooning behaviour described section
"IOMMUOP_*_foreign_page interactions with guest domain ballooning" so that guest
domains are able to effectively balloon out and in memory.
Emulators must unmap any active BFN mappings when they shutdown.
IOMMUOP_*_foreign_page interactions with guest domain ballooning
================================================================
Guest domains can balloon out a set of GFN mappings at any time and render the
BFN to GFN mapping invalid.
When a BFN to GFN mapping becomes invalid, Xen will issue a buffered I/O request
of type IOREQ_TYPE_INVALIDATE to the affected IOREQ servers with the now invalid
BFN address in the data field. If the buffered I/O request ring is full then a
standard (synchronous) I/O request of type IOREQ_TYPE_INVALIDATE will be issued
to the affected IOREQ server the with just invalidated BFN address in the data
field.
The BFN mappings cannot be simply unmapped at the point of the balloon hypercall
otherwise a malicious guest could specifically balloon out an in use GFN address
in use by an emulator and trigger IOMMU faults for the domains with BFN
mappings.
For hosts with no IOMMU support: The affected emulator(s) must specifically
issue a IOMMUOP_unmap_foreign_page subop for the now invalid BFN address so that
the references to the underlying MFN are removed and the MFN can be freed back
to the Xen memory allocator.
For hosts with IOMMU support:
If the BFN was mapped without the IOMMUOP_swap_mfn flag set in the
IOMMUOP_map_foreign_page then the affected affected emulator(s) must
specifically issue a IOMMUOP_unmap_foreign_page subop for the now invalid BFN
address so that the references to the underlying MFN are removed.
If the BFN was mapped with the IOMMUOP_swap_mfn flag set in the
IOMMUOP_map_foreign_page subop for all emulators with mappings of that GFN then
the BFN mapping will be swapped to point at a scratch MFN page and all BFN
references to the invalid MFN will be removed by Xen after the BFN mapping has
been updated to point at the scratch MFN page.
The rationale for swapping the BFN mapping to point at scratch pages is to
enable guest domains to balloon quickly without requiring hypercall(s) from
emulators.
Not all BFN mappings can be swapped without potentially causing problems for the
hardware itself (command rings etc.) so the IOMMUOP_swap_mfn flag is used to
allow per BFN control of Xen ballooning behaviour.
PV IOMMU interactions with self ballooning
==========================================
A guest should clear any IOMMU mappings it has of its own pages before
releasing a page back to Xen. The guest also will need to add IOMMU mappings
after repopulating a page with the populate_physmap hypercall.
PV guests must clear any IOMMU mappings before pinning page table pages
because the IOMMU mappings will take a writable reference count and this will
prevent page table pinning.
Security Implications of allowing domain IOMMU control
======================================================
Xen currently allows I/O devices attached to hardware domain to have direct
access to the all of the MFN address space (except Xen hypervisor memory regions),
provided the Xen IOMMU option dom0-strict is not enabled.
The PV IOMMU feature provides the same level of access to MFN address space
and the feature is not enabled when the Xen IOMMU option dom0-strict is
enabled. Therefore security is not degraded by the PV IOMMU feature.
Domains with physical device(s) assigned which are not hardware domains are only
allowed to map their own GFNs or GFNs for domain(s) they have privilege over.
PV IOMMU interactions with grant map/unmap operations
=====================================================
Grant map operations return a Physical device accessible address (BFN) if the
GNTMAP_device_map flag is set.  This operation currently returns the MFN for PV
guests which may conflict with the BFN address space the guest uses if PV IOMMU
map support is available to the guest.
This design proposes to allow the calling domain to control the BFN address that
a grant map operation uses.
This can be achieved by specifying that the dev_bus_addr in the
gnttab_map_grant_ref structure is used an input parameter instead of the
output parameter it is currently.
Only PAGE_SIZE aligned addresses are allowed for dev_bus_addr input parameter.
The revised structure is shown below for convenience.
    struct gnttab_map_grant_ref {
        /* IN parameters. */
        uint64_t host_addr;
        uint32_t flags;               /* GNTMAP_* */
        grant_ref_t ref;
        domid_t  dom;
        /* OUT parameters. */
        int16_t  status;              /* => enum grant_status */
        grant_handle_t handle;
        /* IN/OUT parameters */
        uint64_t dev_bus_addr;
    };
The grant map operation would then behave similarly to the IOMMUOP_map_page
subop for the creation of the IOMMU mapping.
The grant unmap operation would then behave similarly to the IOMMUOP_unmap_page
subop for the removal of the IOMMU mapping.
A new grantmap flag would be used to indicate the domain is requesting the
dev_bus_addr field is used an input parameter.
    #define _GNTMAP_request_bfn_map      (6)
    #define GNTMAP_request_bfn_map   (1<<_GNTMAP_request_bfn_map)
Xen PV-IOMMU Architecture
=========================
The Xen architecture consists of a new hypercall interface and changes to the
grant map interface.
The existing IOMMU mappings setup at domain creation time will be preserved so
that PV domains unaware of this feature will continue to function with no
changes required.
Memory ballooning will be supported by taking an additional reference on the
MFN backing the GFN for each successful IOMMU mapping created.
An M2B tracking structure will be used to ensure all references to an MFN can
be located efficiently.
Xen PV IOMMU hypercall interface
--------------------------------
A two argument hypercall interface (do_iommu_op).
    ret_t do_iommu_op(XEN_GUEST_HANDLE_PARAM(void) arg, unsigned int count)
First argument, guest handle pointer to array of `struct pv_iommu_op`
Second argument, unsigned integer count of `struct pv_iommu_op` elements in array.
Definition of `struct pv_iommu_op`:
    struct pv_iommu_op {
        uint16_t subop_id;
        uint16_t flags;
        int32_t status;
        union {
            struct {
                uint64_t bfn;
                uint64_t gfn;
            } map_page;
            struct {
                uint64_t bfn;
            } unmap_page;
            struct {
                uint64_t bfn;
                uint64_t gfn;
                uint16_t domid;
                ioservid_t ioserver;
            } map_foreign_page;
            struct {
                uint64_t bfn;
                uint64_t gfn;
                uint16_t domid;
                ioservid_t ioserver;
            } lookup_foreign_page;
            struct {
                uint64_t bfn;
                ioservid_t ioserver;
            } unmap_foreign_page;
        } u;
    };
Definition of PV IOMMU subops:
    #define IOMMUOP_query_caps            1
    #define IOMMUOP_map_page              2
    #define IOMMUOP_unmap_page            3
    #define IOMMUOP_map_foreign_page      4
    #define IOMMUOP_lookup_foreign_page   5
    #define IOMMUOP_unmap_foreign_page    6
Design considerations for hypercall op
-------------------------------------------
IOMMU map/unmap operations can be slow and can involve flushing the IOMMU TLB
to ensure the I/O device uses the updated mappings.
The op has been designed to take an array of operations and a count as
parameters. This allows for easily implemented hypercall continuations to be
used and allows for batches of IOMMU operations to be submitted before flushing
the IOMMU TLB.
The `subop_id` to be used for a particular element is encoded into the element
itself. This allows for map and unmap operations to be performed in one hypercall
and for the IOMMU TLB flushing optimisations to be still applied.
The hypercall will ensure that the required IOMMU TLB flushes are applied before
returning to guest via either hypercall completion or a hypercall continuation.
IOMMUOP_query_caps
------------------
This subop queries the runtime capabilities of the PV-IOMMU interface for the
specific calling domain. This subop uses `struct pv_iommu_op` directly.
------------------------------------------------------------------------------
Field          Purpose
-----          ---------------------------------------------------------------
`flags`        [out] This field details the IOMMUOP capabilities.
`status`       [out] Status of this op, op specific values listed below
------------------------------------------------------------------------------
Defined bits for flags field:
------------------------------------------------------------------------------
Name                        Bit                Definition
----                       ------     ----------------------------------
IOMMU_QUERY_map_cap          0        IOMMUOP_map_page or IOMMUOP_map_foreign
                                      can be used for the calling domain
IOMMU_QUERY_map_all_mfns     1        IOMMUOP_map_page subop can map any MFN
                                      not used by Xen
Reserved for future use     2-9                   n/a
IOMMU_page_order           10-15      Returns maximum possible page order for
                                      all other IOMMUOP subops
------------------------------------------------------------------------------
Defined values for query_caps subop status field:
Value   Reason
------  ----------------------------------------------------------
0       subop successfully returned
IOMMUOP_map_page
----------------------
This subop uses `struct map_page` part of the `struct pv_iommu_op`.
If IOMMU dom0-strict mode is NOT enabled then the hardware domain will be
allowed to map all GFNs except for Xen owned MFNs else the hardware
domain will only be allowed to map GFNs which it owns.
If IOMMU dom0-strict mode is NOT enabled then the hardware domain will be
allowed to map all GFNs without taking a reference to the MFN backing the GFN
by setting the IOMMU_MAP_OP_no_ref_cnt flag.
Every successful pv_iommu_op will result in an additional page reference being
taken on the MFN backing the GFN except for the condition detailed above.
If the map_op flags indicate a writeable mapping is required then a writeable
page type reference will be taken otherwise a standard page reference will be
taken.
All the following conditions are required to be true for PV IOMMU map
subop to succeed:
1. IOMMU detected and supported by Xen
2. The domain has IOMMU controlled hardware allocated to it
3. If hardware_domain and the following Xen IOMMU options are
   NOT enabled: dom0-passthrough
This subop usage of the `struct pv_iommu_op` and `struct map_page` fields
are detailed below:
------------------------------------------------------------------------------
Field          Purpose
-----          ---------------------------------------------------------------
`bfn`          [in]  Bus address frame number(BFN) to be mapped to specified gfn
                     below
`gfn`          [in]  Guest address frame number for DOMID_SELF
`flags`        [in]  Flags for signalling type of IOMMU mapping to be created,
                     Flags can be combined.
`status`       [out] Mapping status of this op, op specific values listed below
------------------------------------------------------------------------------
Defined bits for flags field:
Name                        Bit                Definition
----                       -----      ----------------------------------
IOMMU_OP_readable            0        Create readable IOMMU mapping
IOMMU_OP_writeable           1        Create writeable IOMMU mapping
IOMMU_MAP_OP_no_ref_cnt      2        IOMMU mapping does not take a reference to
                                      MFN backing BFN mapping
IOMMU_MAP_OP_add_m2b         3        Wildcard M2B mapping added for
                                      lookup_foreign_page to use
Reserved for future use     4-9                   n/a
IOMMU_page_order            10-15     Page order to be used for both gfn and bfn
Defined values for map_page subop status field:
Value   Reason
------  ----------------------------------------------------------------------
0       subop successfully returned
-EIO    IOMMU unit returned error when attempting to map BFN to GFN.
-EPERM  GFN could not be mapped because the GFN belongs to Xen.
-EPERM  Domain is not the hardware domain and GFN does not belong to domain
-EPERM  Domain is the hardware domain, IOMMU dom-strict mode is enabled and
        GFN does not belong to domain
-EACCES BFN address conflicts with RMRR regions for devices attached to
        DOMID_SELF
-ENOSPC Page order is too large for either BFN, GFN or IOMMU unit
IOMMUOP_unmap_page
------------------
This subop uses `struct unmap_page` part of the `struct pv_iommu_op`.
The subop usage of the `struct pv_iommu_op` and `struct unmap_page` fields
are detailed below:
--------------------------------------------------------------------
Field          Purpose
-----          -----------------------------------------------------
`bfn`          [in] Bus address frame number to be unmapped in DOMID_SELF
`flags`        [in] Flags for signalling page order of unmap operation
`status`       [out] Mapping status of this unmap operation, 0 indicates success
--------------------------------------------------------------------
Defined bits for flags field:
Name                        Bit                Definition
----                       -----      ----------------------------------
IOMMU_UNMAP_OP_remove_m2b    0        Wildcard M2B mapping removed for
                                      lookup_foreign_page use
Reserved for future use     1-9                   n/a
IOMMU_page_order            10-15     Page order to be used for bfn
Defined values for unmap_page subop status field:
Error code  Reason
----------  ------------------------------------------------------------
0            subop successfully returned
-EIO         IOMMU unit returned error when attempting to unmap BFN.
-ENOSPC      Page order is too large for either BFN address or IOMMU unit
------------------------------------------------------------------------
IOMMUOP_map_foreign_page
------------------------
This subop uses `struct map_foreign_page` part of the `struct pv_iommu_op`.
It is not valid to use a domid representing the calling domain.
The hypercall will only succeed if calling domain has sufficient privilege over
the specified domid.
The M2B mechanism is an MFN to (BFN,domid,ioserver) tuple.
Each successful subop will add to the M2B if there was not an existing identical
M2B entry.
Every new M2B entry will take a reference to the MFN backing the GFN.
All the following conditions are required to be true for PV IOMMU map_foreign
subop to succeed:
1. IOMMU detected and supported by Xen
2. The domain has IOMMU controlled hardware allocated to it
3. The domain is the hardware_domain and the following Xen IOMMU options are
   NOT enabled: dom0-passthrough
This subop usage of the `struct pv_iommu_op` and `struct map_foreign_page`
fields are detailed below:
--------------------------------------------------------------------
Field          Purpose
-----          -----------------------------------------------------
`domid`        [in] The domain id for which the gfn field applies
`ioserver`     [in] IOREQ server id associated with mapping
`bfn`          [in] Bus address frame number for gfn address
`gfn`          [in] Guest address frame number
`flags`        [in] Details the status of the BFN mapping
`status`       [out] status of this subop, 0 indicates success
--------------------------------------------------------------------
Defined bits for flags field:
Name                         Bit                Definition
----                        -----      ----------------------------------
IOMMUOP_readable              0        BFN IOMMU mapping is readable
IOMMUOP_writeable             1        BFN IOMMU mapping is writeable
IOMMUOP_swap_mfn              2        BFN IOMMU mapping can be safely
                                       swapped to scratch page
Reserved for future use      3-9       Reserved flag bits should be 0
IOMMU_page_order            10-15      Page order to be used for both gfn and bfn
Defined values for map_foreign_page subop status field:
Error code  Reason
----------  ------------------------------------------------------------
0            subop successfully returned
-EIO         IOMMU unit returned error when attempting to map BFN to GFN.
-EPERM       Calling domain does not have sufficient privilege over domid
-EPERM       GFN could not be mapped because the GFN belongs to Xen.
-EPERM       domid maps to DOMID_SELF
-EACCES      BFN address conflicts with RMRR regions for devices attached to
             DOMID_SELF
-ENODEV      Provided ioserver id is not valid
-ENXIO       Provided domid id is not valid
-ENXIO       Provided GFN address is not valid
-ENOSPC      Page order is too large for either BFN, GFN or IOMMU unit
IOMMU_lookup_foreign_page
-------------------------
This subop uses `struct lookup_foreign_page` part of the `struct pv_iommu_op`.
This subop lookups up a BFN mapping for a ioserver + gfn + target domid
combination.
The hypercall will only succeed if calling domain has sufficient privilege over
the specified domid.
If a 1:1 mapping exists of BFN to MFN then a M2B entry is added and a
reference is taken to the underlying MFN. If an existing mapping is present
then the BFN is returned and no additional reference's will be taken to the
underlying MFN.
A 1:1 mapping will exist if there is no IOMMU support or if the PV hardware
domain was booted in dom0-relaxed mode or in dom0-passthrough mode.
If there is no IOMMU support then the MFN is returned in the BFN field (that is
the only valid bus address for the GFN + domid combination).
Each successful subop will add to the M2B if there was not an existing identical
M2B entry.
Every new M2B entry will take a reference to the MFN backing the GFN.
This subop usage of the `struct pv_iommu_op` and `struct lookup_foreign_page`
fields are detailed below:
--------------------------------------------------------------------
Field          Purpose
-----          -----------------------------------------------------
`domid`        [in] The domain id for which the gfn field applies
`ioserver`     [in] IOREQ server id associated with mapping
`bfn`          [out] Bus address frame number for gfn address
`gfn`          [in] Guest address frame number
`flags`        [out] Details the status of the BFN mapping
`status`       [out] status of this subop, 0 indicates success
--------------------------------------------------------------------
Defined bits for flags field:
Name                         Bit                Definition
----                        -----      ----------------------------------
IOMMUOP_readable              0        Returned BFN IOMMU mapping is readable
IOMMUOP_writeable             1        Returned BFN IOMMU mapping is writeable
Reserved for future use      2-9       Reserved flag bits should be 0
IOMMU_page_order            10-15      Returns maximum possible page order for
                                       all other IOMMUOP subops
Defined values for lookup_foreign_page subop status field:
Error code  Reason
----------  ------------------------------------------------------------
0            subop successfully returned
-EPERM       Calling domain does not have sufficient privilege over domid
-ENOENT      There is no available BFN for provided GFN + domid combination
-ENODEV      Provided ioserver id is not valid
-ENXIO       Provided domid id is not valid
-ENXIO       Provided GFN address is not valid
IOMMUOP_unmap_foreign_page
--------------------------
This subop uses `struct unmap_foreign_page` part of the `struct pv_iommu_op`.
It only allows BFNs acquired via IOMMUOP_map_foreign_page or IOMMUOP_lookup_page
to be unmapped. If an attempt is made to unmap a BFN mapped via IOMMUOP_map_page
then the subop will fail.
The subop will perform a B2M lookup (IO page table walk) for the calling domain
and then index the M2B using the returned MFN. This is safe because a particular
BFN mapping can only map to one MFN for a particular calling domain.
This subop usage of the `struct pv_iommu_op` and `struct unmap_foreign_page` fields
are detailed below:
-----------------------------------------------------------------------
Field          Purpose
-----          --------------------------------------------------------
`ioserver`     [in] IOREQ server id associated with mapping
`bfn`          [in] Bus address frame number for gfn address
`flags`        [in] Flags for signalling page order of unmap operation
`status`       [out] status of this subop, 0 indicates success
-----------------------------------------------------------------------
Defined bits for flags field:
Name                        Bit                Definition
----                        -----     ----------------------------------
Reserved for future use     0-9                   n/a
IOMMU_page_order            10-15     Page order to be used for bfn unmapping
Defined values for unmap_foreign_page subop status field:
Error code  Reason
----------  ------------------------------------------------------------
0            subop successfully returned
-ENOENT      An M2B entry was not found for the specified input parameters.
Linux kernel architecture
=========================
The Linux kernel will use the PV-IOMMU hypercalls to map its PFN address
space into the IOMMU. It will map the PFNs to the IOMMU address space using
a 1:1 mapping, it does this by programming a BFN to GFN mapping which matches
the PFN to GFN mapping.
The native SWIOTLB will be used to handle devices which cannot DMA to all of
the kernel's PFN address space.
An interface shall be provided for emulator usage of IOMMUOP_*_foreign_page
subops which will allow the Linux kernel to centrally manage that domain's BFN
resource and ensure there are no unexpected conflicts.
Kernel Map Foreign GFN to BFN interface
---------------------------------------
An array of 'count' of 'struct pv_iommu_ops' will be passed to the mapping
function.
    int map_foreign_gfn_to_bfn(int count, struct pv_iommu_op *ops)
The calling function will use the `struct map_foreign_page` inside the `struct
pv_iommu_op` and will fill in the domid, gfn and ioserver_id fields.
The kernel function will reuse the passed in struct pv_iommu_op for the
hypercall and will set the subop_id field based on the IOMMU_QUERY_map_cap
capability.
If the IOMMU_QUERY_map_cap is set then the kernel will allocate a suitable BFN
address, set the BFN field in the op to this address and set the subop_id to
IOMMUOP_map_page. It will do this on all 'ops' and then issue the hypercall.
If the IOMMU_QUERY_map_cap is NOT set then the kernel will set the subops_id
to IOMMUOP_lookup_page on all `ops` and then issue the hypercall.
The calling function should check the status field in each op and if the
status field is 0 then it can use the returned BFN address in each op.
Kernel Unmap Foreign GFN to BFN interface
-----------------------------------------
An array of 'count' of 'struct pv_iommu_ops' will be passed the mapping
function.
    int unmap_foreign_gfn_to_bfn(int count, struct pv_iommu_op *ops)
The calling function will use the `struct unmap_foreign_page` inside the `struct
pv_iommu_op` and will fill in the bfn field.
The kernel function will set the subop_id field to IOMMUOP_unmap_foreign_page
in each op and then issue the hypercall.
The calling function should check the status field in each op and if the
status field is 0 then the BFN has been successfully unmapped.
^ permalink raw reply	[flat|nested] 20+ messages in thread
- * Re: Xen PV IOMMU interface draft D
  2016-02-10 10:09   ` Xen PV IOMMU interface draft D Malcolm Crossley
@ 2016-02-18  8:21     ` Tian, Kevin
  2016-02-23 16:17     ` Jan Beulich
  2016-03-02  6:54     ` Tian, Kevin
  2 siblings, 0 replies; 20+ messages in thread
From: Tian, Kevin @ 2016-02-18  8:21 UTC (permalink / raw)
  To: Malcolm Crossley, xen-devel, Jan Beulich, Konrad Rzeszutek Wilk,
	Andrew Cooper, Paul Durrant, Lv, Zhiyuan, Zhang, Yu C,
	David Vrabel, Ian Campbell
> From: Malcolm Crossley [mailto:malcolm.crossley@citrix.com]
> Sent: Wednesday, February 10, 2016 6:09 PM
As Konrad commented, it's better to add this doc as 1st patch in your series
then it's easier to review it with other patches together. Also it's always
good to include such design doc in the repo.
Other comments embedded.
[...]
> 
> Clarification of GFN and BFN fields for different guest types
> -------------------------------------------------------------
> 
[...]
> Bus Frame Numbers (BFN) refer to the address presented on the physical bus
> before being translated by the IOMMU.
> 
> Diagram below details memory accesses originating from physical device.
> 
>     Physical Device
>           |
>         (BFN)
>           |
> 	   IOMMU-PT
>           |
>         (MFN)
>           |
>          RAM
Curious what IOMMU-'PT' means here?
[...]
> General principles for PV IOMMU interface
> =========================================
> 
> There are two different usage models for the BFN address space of a calling
> guest based upon the two purposes specified in the section above.
> 
> A calling guest may use their BFN address space for only one of the purposes
> detailed above and so the PV IOMMU interface has a subop per usage model.
> Furthermore, the IOMMU mapping of foreign domains memory is more complex than
> IOMMU mapping local domain memory and seperating the subops allows for the
> complexity to be split in the implementation.
> 
> The PV IOMMU design allows the calling domain to control it's BFN memory map.
> Thus the design also assigns the responsiblity of ensuring a BFN address
> mapped for local domain memory mappings are not reused for foreign domain
> memory mappings without an explict unmap of BFN address first. This simplifies
> the usage of the API and the extra overhead for the calling domains should be
> minimal as they should be already tracking the BFN address space usage already.
It might be clearer if you can add a separate section for BFN itself, i.e.
how it is managed/allocated in different scenarios. I know most info is
already provided in this text, but not centralized so far. :-)
> 
> 
> Emulator usage of PV IOMMU interface
> ====================================
I'd suggest moving this and later sections to behind basic API introduction.
Otherwise insufficient background on so many API references at this point.
> 
> Emulators which require bus address mapping of guest RAM must first determine if
> it's possible for the domain to control the bus addresses themselves.
> 
> A IOMMUOP_query_caps subop will return the IOMMU_QUERY_map_cap flag. If this
> flag is set then the emulator may specify the BFN address it wishes guest RAM to
> be mapped to via the IOMMUOP_map_foreign_page subop.  If the flag is not set
> then the emulator must use BFN addresses supplied by the Xen via the
> IOMMUOP_lookup_foreign_page.
IOMMU_QUERY_map_cap is a bit confusing here. Above paragraph is about
whether emulator is allowed to allocate/specify BFN itself. However this
capability name is more read as whether the calling domain can map foreign
pages which is actually true regardless of how BFN is allocated.
> 
> Operating systems which use the IOMMUOP_map_page subop are expected to provide a
> common interface for emulators to use. Otherwise emulators will not be aware
> of existing BFN mappings created by operating system and will get failed
> subops due to conflicts in the BFN address space for the domain.
Do you mean that emulator needs to detect whether OS is using
IOMMUOP_map_page? If yes, then emulator calls a common interface
provided by OS. If not, then emulator just directly invoke raw IOMMUOP 
itself. I'm not certain whether there is common mechanism to detect
this so far. Could you elaborate your thought here?
> 
> Emulators should unmap unused GFN mappings as often as possible using
> IOMMUOP_unmap_foreign_page subops so that guest domains can balloon pages
> quickly and efficiently.
Following earlier analysis then this only applies when OS doesn't use IOMMUOP.
Otherwise emulator needs call a 'OS common interface' right?
> 
> Emulators should conform to the ballooning behaviour described section
> "IOMMUOP_*_foreign_page interactions with guest domain ballooning" so that guest
> domains are able to effectively balloon out and in memory.
> 
> Emulators must unmap any active BFN mappings when they shutdown.
> 
> IOMMUOP_*_foreign_page interactions with guest domain ballooning
> =====================================================
> ===========
> 
> Guest domains can balloon out a set of GFN mappings at any time and render the
> BFN to GFN mapping invalid.
> 
> When a BFN to GFN mapping becomes invalid, Xen will issue a buffered I/O request
> of type IOREQ_TYPE_INVALIDATE to the affected IOREQ servers with the now invalid
> BFN address in the data field. If the buffered I/O request ring is full then a
> standard (synchronous) I/O request of type IOREQ_TYPE_INVALIDATE will be issued
> to the affected IOREQ server the with just invalidated BFN address in the data
> field.
> 
> The BFN mappings cannot be simply unmapped at the point of the balloon hypercall
> otherwise a malicious guest could specifically balloon out an in use GFN address
> in use by an emulator and trigger IOMMU faults for the domains with BFN
> mappings.
Is it a real problem? Today for PCI passthru, what will happen if guest programs
assigned device with a bad GPA which is not mapped in IOMMU? I think IOMMU
fault should be fine, and we can just leverage existing IOMMU fault handling after
the fault is triggered.
> 
> For hosts with no IOMMU support: The affected emulator(s) must specifically
> issue a IOMMUOP_unmap_foreign_page subop for the now invalid BFN address so that
> the references to the underlying MFN are removed and the MFN can be freed back
> to the Xen memory allocator.
> 
> For hosts with IOMMU support:
> If the BFN was mapped without the IOMMUOP_swap_mfn flag set in the
> IOMMUOP_map_foreign_page then the affected affected emulator(s) must
> specifically issue a IOMMUOP_unmap_foreign_page subop for the now invalid BFN
> address so that the references to the underlying MFN are removed.
> 
> If the BFN was mapped with the IOMMUOP_swap_mfn flag set in the
> IOMMUOP_map_foreign_page subop for all emulators with mappings of that GFN then
> the BFN mapping will be swapped to point at a scratch MFN page and all BFN
> references to the invalid MFN will be removed by Xen after the BFN mapping has
> been updated to point at the scratch MFN page.
I don't understand why for 'swap' case you don't need emulator to do 
explicit unmap. You can think 'noswap' (page-A to invalid) as a special 
example of 'swap' (page-A to scratch page), since they both move
away from page-A reference. If there is a reason that emulator needs
to do some cleanup internally before dropping the reference, does 
'swap_mfn' breaks that situation then?
> 
> The rationale for swapping the BFN mapping to point at scratch pages is to
> enable guest domains to balloon quickly without requiring hypercall(s) from
> emulators.
> 
> Not all BFN mappings can be swapped without potentially causing problems for the
> hardware itself (command rings etc.) so the IOMMUOP_swap_mfn flag is used to
> allow per BFN control of Xen ballooning behaviour.
Who will judge whether a BFN mapping can be swapped then?
[...]
> Xen PV IOMMU hypercall interface
> --------------------------------
> A two argument hypercall interface (do_iommu_op).
> 
>     ret_t do_iommu_op(XEN_GUEST_HANDLE_PARAM(void) arg, unsigned int count)
> 
> First argument, guest handle pointer to array of `struct pv_iommu_op`
> 
> Second argument, unsigned integer count of `struct pv_iommu_op` elements in array.
> 
> Definition of `struct pv_iommu_op`:
> 
>     struct pv_iommu_op {
> 
>         uint16_t subop_id;
>         uint16_t flags;
>         int32_t status;
> 
>         union {
>             struct {
>                 uint64_t bfn;
>                 uint64_t gfn;
>             } map_page;
> 
>             struct {
>                 uint64_t bfn;
>             } unmap_page;
> 
>             struct {
>                 uint64_t bfn;
>                 uint64_t gfn;
>                 uint16_t domid;
>                 ioservid_t ioserver;
>             } map_foreign_page;
> 
>             struct {
>                 uint64_t bfn;
>                 uint64_t gfn;
>                 uint16_t domid;
>                 ioservid_t ioserver;
>             } lookup_foreign_page;
> 
>             struct {
>                 uint64_t bfn;
>                 ioservid_t ioserver;
>             } unmap_foreign_page;
>         } u;
>     };
Do we really need such ioserver ID here? Could it be simple
as looping all ioreq servers with INVALIDATE notifications?
[...]
> 
> IOMMUOP_map_page
> ----------------------
> This subop uses `struct map_page` part of the `struct pv_iommu_op`.
> 
> If IOMMU dom0-strict mode is NOT enabled then the hardware domain will be
> allowed to map all GFNs except for Xen owned MFNs else the hardware
> domain will only be allowed to map GFNs which it owns.
"map all GFNs" -> "map all MFNs" since you use "except for Xen owned MFNs"
later. Since you have a capability called IOMMU_QUERY_map_all_mfns, should
you add such condition in above description?
> 
> If IOMMU dom0-strict mode is NOT enabled then the hardware domain will be
> allowed to map all GFNs without taking a reference to the MFN backing the GFN
> by setting the IOMMU_MAP_OP_no_ref_cnt flag.
could you elaborate when no_ref_cnt is required?
[...]
> 
> IOMMUOP_unmap_page
> ------------------
> This subop uses `struct unmap_page` part of the `struct pv_iommu_op`.
> 
> The subop usage of the `struct pv_iommu_op` and `struct unmap_page` fields
> are detailed below:
> 
> --------------------------------------------------------------------
> Field          Purpose
> -----          -----------------------------------------------------
> `bfn`          [in] Bus address frame number to be unmapped in DOMID_SELF
> 
> `flags`        [in] Flags for signalling page order of unmap operation
> 
> `status`       [out] Mapping status of this unmap operation, 0 indicates success
> --------------------------------------------------------------------
> 
> Defined bits for flags field:
> 
> Name                        Bit                Definition
> ----                       -----      ----------------------------------
> IOMMU_UNMAP_OP_remove_m2b    0        Wildcard M2B mapping removed for
>                                       lookup_foreign_page use
Is it explicitly required? Should it be implicit as long as a valid M2B entry existing?
[...]
> IOMMUOP_map_foreign_page
> ------------------------
> This subop uses `struct map_foreign_page` part of the `struct pv_iommu_op`.
> 
> It is not valid to use a domid representing the calling domain.
Then what's being used here to represent the calling domain?
> 
> The hypercall will only succeed if calling domain has sufficient privilege over
> the specified domid.
How is this privilege check being done? Is there existing mechanism, or something
new to add?
> 
> The M2B mechanism is an MFN to (BFN,domid,ioserver) tuple.
> 
> Each successful subop will add to the M2B if there was not an existing identical
> M2B entry.
> 
> Every new M2B entry will take a reference to the MFN backing the GFN.
> 
> All the following conditions are required to be true for PV IOMMU map_foreign
> subop to succeed:
> 
> 1. IOMMU detected and supported by Xen
> 2. The domain has IOMMU controlled hardware allocated to it
> 3. The domain is the hardware_domain and the following Xen IOMMU options are
>    NOT enabled: dom0-passthrough
4. the domain has sufficient privilege over the specified domid;
[...]
> 
> IOMMU_lookup_foreign_page
> -------------------------
> This subop uses `struct lookup_foreign_page` part of the `struct pv_iommu_op`.
> 
> This subop lookups up a BFN mapping for a ioserver + gfn + target domid
> combination.
> 
> The hypercall will only succeed if calling domain has sufficient privilege over
> the specified domid.
> 
> If a 1:1 mapping exists of BFN to MFN then a M2B entry is added and a
> reference is taken to the underlying MFN. If an existing mapping is present
Then when will this very reference be dropped?
> then the BFN is returned and no additional reference's will be taken to the
> underlying MFN.
> 
> A 1:1 mapping will exist if there is no IOMMU support or if the PV hardware
> domain was booted in dom0-relaxed mode or in dom0-passthrough mode.
what about hardware domain using IOMMUOPS in the meantime? In that
case, from your earlier description it's hardware domain to manage BFN
addr space, while here 1:1 mapping is some hard assumption in hypervisor,
so two things together may conflict. There needs to be a mechanism
that once Xen sees any explicit BFN passed from hardware domain, then
such 1:1 mapping scheme should be disabled.
> 
> If there is no IOMMU support then the MFN is returned in the BFN field (that is
> the only valid bus address for the GFN + domid combination).
> 
[...]
> 
> Linux kernel architecture
> =========================
> 
> The Linux kernel will use the PV-IOMMU hypercalls to map its PFN address
> space into the IOMMU. It will map the PFNs to the IOMMU address space using
> a 1:1 mapping, it does this by programming a BFN to GFN mapping which matches
> the PFN to GFN mapping.
> 
> The native SWIOTLB will be used to handle devices which cannot DMA to all of
> the kernel's PFN address space.
> 
> An interface shall be provided for emulator usage of IOMMUOP_*_foreign_page
> subops which will allow the Linux kernel to centrally manage that domain's BFN
> resource and ensure there are no unexpected conflicts.
One open here. When IOMMU is enabled, there is supposed to be a
IOVA space created in Linux kernel. How does this BFN space play
with that one?
Thanks
Kevin
^ permalink raw reply	[flat|nested] 20+ messages in thread
- * Re: Xen PV IOMMU interface draft D
  2016-02-10 10:09   ` Xen PV IOMMU interface draft D Malcolm Crossley
  2016-02-18  8:21     ` Tian, Kevin
@ 2016-02-23 16:17     ` Jan Beulich
  2016-02-23 16:22       ` Malcolm Crossley
  2016-03-02  6:54     ` Tian, Kevin
  2 siblings, 1 reply; 20+ messages in thread
From: Jan Beulich @ 2016-02-23 16:17 UTC (permalink / raw)
  To: Malcolm Crossley
  Cc: Kevin Tian, Ian Campbell, Yu C Zhang, AndrewCooper, Paul Durrant,
	David Vrabel, xen-devel, Zhiyuan Lv
>>> On 10.02.16 at 11:09, <malcolm.crossley@citrix.com> wrote:
> % Xen PV IOMMU interface
> % Malcolm Crossley <<malcolm.crossley@citrix.com>>
>   Paul Durrant <<paul.durrant@citrix.com>>
> % Draft D
> 
> Introduction
> ============
> 
> Revision History
> ----------------
> 
> --------------------------------------------------------------------
> Version  Date         Changes
> -------  -----------  ----------------------------------------------
> Draft A  10 Apr 2014  Initial draft.
> 
> Draft B  12 Jun 2015  Second draft.
> 
> Draft C  26 Jun 2015  Third draft.
> 
> Draft D  09 Feb 2016  Fourth draft.
Unless this is a complete re-write, I'd really like to avoid needing to
read through all of it again. Do you perhaps have a PDF version
somewhere with change marks?
Jan
^ permalink raw reply	[flat|nested] 20+ messages in thread 
- * Re: Xen PV IOMMU interface draft D
  2016-02-23 16:17     ` Jan Beulich
@ 2016-02-23 16:22       ` Malcolm Crossley
  0 siblings, 0 replies; 20+ messages in thread
From: Malcolm Crossley @ 2016-02-23 16:22 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Ian Campbell, Yu C Zhang, AndrewCooper, Paul Durrant,
	David Vrabel, xen-devel, Zhiyuan Lv
On 23/02/16 16:17, Jan Beulich wrote:
>>>> On 10.02.16 at 11:09, <malcolm.crossley@citrix.com> wrote:
>> % Xen PV IOMMU interface
>> % Malcolm Crossley <<malcolm.crossley@citrix.com>>
>>   Paul Durrant <<paul.durrant@citrix.com>>
>> % Draft D
>>
>> Introduction
>> ============
>>
>> Revision History
>> ----------------
>>
>> --------------------------------------------------------------------
>> Version  Date         Changes
>> -------  -----------  ----------------------------------------------
>> Draft A  10 Apr 2014  Initial draft.
>>
>> Draft B  12 Jun 2015  Second draft.
>>
>> Draft C  26 Jun 2015  Third draft.
>>
>> Draft D  09 Feb 2016  Fourth draft.
> 
> Unless this is a complete re-write, I'd really like to avoid needing to
> read through all of it again. Do you perhaps have a PDF version
> somewhere with change marks?
> 
It was really a minor update to ensure the design matched the RFC implementation.
The diff for the draft D is below. For the next revision I will details what has changed.
diff -r a35c08555541 -r e829e3d0fcec designs/pv-iommu-control/design.txt
--- a/designs/pv-iommu-control/design.txt
+++ b/designs/pv-iommu-control/design.txt
@@ -1,7 +1,7 @@
 % Xen PV IOMMU interface
 % Malcolm Crossley <<malcolm.crossley@citrix.com>>
   Paul Durrant <<paul.durrant@citrix.com>>
-% Draft C
+% Draft D
 Introduction
 ============
@@ -17,6 +17,8 @@ Draft A  10 Apr 2014  Initial draft.
 Draft B  12 Jun 2015  Second draft.
 Draft C  26 Jun 2015  Third draft.
+
+Draft D  09 Feb 2016  Fourth draft.
 --------------------------------------------------------------------
 Background
@@ -481,7 +483,9 @@ IOMMU_OP_readable            0        Cr
 IOMMU_OP_writeable           1        Create writeable IOMMU mapping
 IOMMU_MAP_OP_no_ref_cnt      2        IOMMU mapping does not take a reference to
                                       MFN backing BFN mapping
-Reserved for future use     3-9                   n/a
+IOMMU_MAP_OP_add_m2b         3        Wildcard M2B mapping added for
+                                      lookup_foreign_page to use
+Reserved for future use     4-9                   n/a
 IOMMU_page_order            10-15     Page order to be used for both gfn and bfn
 Defined values for map_page subop status field:
@@ -519,7 +523,9 @@ Defined bits for flags field:
 Name                        Bit                Definition
 ----                       -----      ----------------------------------
-Reserved for future use     0-9                   n/a
+IOMMU_UNMAP_OP_remove_m2b    0        Wildcard M2B mapping removed for
+                                      lookup_foreign_page use
+Reserved for future use     1-9                   n/a
 IOMMU_page_order            10-15     Page order to be used for bfn
Sorry about the hassle. I posted the RFC because I was would like to get some comments on the M2B
implementation and the iommu_lookup_page code.
Malcolm
> Jan
> 
^ permalink raw reply	[flat|nested] 20+ messages in thread
 
- * Re: Xen PV IOMMU interface draft D
  2016-02-10 10:09   ` Xen PV IOMMU interface draft D Malcolm Crossley
  2016-02-18  8:21     ` Tian, Kevin
  2016-02-23 16:17     ` Jan Beulich
@ 2016-03-02  6:54     ` Tian, Kevin
  2 siblings, 0 replies; 20+ messages in thread
From: Tian, Kevin @ 2016-03-02  6:54 UTC (permalink / raw)
  To: Malcolm Crossley, xen-devel, Jan Beulich, Konrad Rzeszutek Wilk,
	Andrew Cooper, Paul Durrant, Lv, Zhiyuan, Zhang, Yu C,
	David Vrabel, Ian Campbell
Hi, Malcolm,
Not sure whether I missed your reply or not, but failed to find it in my
archive. Could you help re-post if you already did so? Sorry that my
comments might be a bit late which didn't catch previous draft discussions,
but some of below questions are really important to help us understand
how this new interface works with XenGT...
Thanks
Kevin
> From: Tian, Kevin
> Sent: Thursday, February 18, 2016 4:21 PM
> 
> > From: Malcolm Crossley [mailto:malcolm.crossley@citrix.com]
> > Sent: Wednesday, February 10, 2016 6:09 PM
> 
> As Konrad commented, it's better to add this doc as 1st patch in your series
> then it's easier to review it with other patches together. Also it's always
> good to include such design doc in the repo.
> 
> Other comments embedded.
> 
> [...]
> >
> > Clarification of GFN and BFN fields for different guest types
> > -------------------------------------------------------------
> >
> [...]
> > Bus Frame Numbers (BFN) refer to the address presented on the physical bus
> > before being translated by the IOMMU.
> >
> > Diagram below details memory accesses originating from physical device.
> >
> >     Physical Device
> >           |
> >         (BFN)
> >           |
> > 	   IOMMU-PT
> >           |
> >         (MFN)
> >           |
> >          RAM
> 
> Curious what IOMMU-'PT' means here?
> 
> [...]
> > General principles for PV IOMMU interface
> > =========================================
> >
> > There are two different usage models for the BFN address space of a calling
> > guest based upon the two purposes specified in the section above.
> >
> > A calling guest may use their BFN address space for only one of the purposes
> > detailed above and so the PV IOMMU interface has a subop per usage model.
> > Furthermore, the IOMMU mapping of foreign domains memory is more complex than
> > IOMMU mapping local domain memory and seperating the subops allows for the
> > complexity to be split in the implementation.
> >
> > The PV IOMMU design allows the calling domain to control it's BFN memory map.
> > Thus the design also assigns the responsiblity of ensuring a BFN address
> > mapped for local domain memory mappings are not reused for foreign domain
> > memory mappings without an explict unmap of BFN address first. This simplifies
> > the usage of the API and the extra overhead for the calling domains should be
> > minimal as they should be already tracking the BFN address space usage already.
> 
> It might be clearer if you can add a separate section for BFN itself, i.e.
> how it is managed/allocated in different scenarios. I know most info is
> already provided in this text, but not centralized so far. :-)
> 
> >
> >
> > Emulator usage of PV IOMMU interface
> > ====================================
> 
> I'd suggest moving this and later sections to behind basic API introduction.
> Otherwise insufficient background on so many API references at this point.
> 
> >
> > Emulators which require bus address mapping of guest RAM must first determine if
> > it's possible for the domain to control the bus addresses themselves.
> >
> > A IOMMUOP_query_caps subop will return the IOMMU_QUERY_map_cap flag. If this
> > flag is set then the emulator may specify the BFN address it wishes guest RAM to
> > be mapped to via the IOMMUOP_map_foreign_page subop.  If the flag is not set
> > then the emulator must use BFN addresses supplied by the Xen via the
> > IOMMUOP_lookup_foreign_page.
> 
> IOMMU_QUERY_map_cap is a bit confusing here. Above paragraph is about
> whether emulator is allowed to allocate/specify BFN itself. However this
> capability name is more read as whether the calling domain can map foreign
> pages which is actually true regardless of how BFN is allocated.
> 
> >
> > Operating systems which use the IOMMUOP_map_page subop are expected to provide
> a
> > common interface for emulators to use. Otherwise emulators will not be aware
> > of existing BFN mappings created by operating system and will get failed
> > subops due to conflicts in the BFN address space for the domain.
> 
> Do you mean that emulator needs to detect whether OS is using
> IOMMUOP_map_page? If yes, then emulator calls a common interface
> provided by OS. If not, then emulator just directly invoke raw IOMMUOP
> itself. I'm not certain whether there is common mechanism to detect
> this so far. Could you elaborate your thought here?
> 
> >
> > Emulators should unmap unused GFN mappings as often as possible using
> > IOMMUOP_unmap_foreign_page subops so that guest domains can balloon pages
> > quickly and efficiently.
> 
> Following earlier analysis then this only applies when OS doesn't use IOMMUOP.
> Otherwise emulator needs call a 'OS common interface' right?
> 
> >
> > Emulators should conform to the ballooning behaviour described section
> > "IOMMUOP_*_foreign_page interactions with guest domain ballooning" so that guest
> > domains are able to effectively balloon out and in memory.
> >
> > Emulators must unmap any active BFN mappings when they shutdown.
> >
> > IOMMUOP_*_foreign_page interactions with guest domain ballooning
> >
> =====================================================
> > ===========
> >
> > Guest domains can balloon out a set of GFN mappings at any time and render the
> > BFN to GFN mapping invalid.
> >
> > When a BFN to GFN mapping becomes invalid, Xen will issue a buffered I/O request
> > of type IOREQ_TYPE_INVALIDATE to the affected IOREQ servers with the now invalid
> > BFN address in the data field. If the buffered I/O request ring is full then a
> > standard (synchronous) I/O request of type IOREQ_TYPE_INVALIDATE will be issued
> > to the affected IOREQ server the with just invalidated BFN address in the data
> > field.
> >
> > The BFN mappings cannot be simply unmapped at the point of the balloon hypercall
> > otherwise a malicious guest could specifically balloon out an in use GFN address
> > in use by an emulator and trigger IOMMU faults for the domains with BFN
> > mappings.
> 
> Is it a real problem? Today for PCI passthru, what will happen if guest programs
> assigned device with a bad GPA which is not mapped in IOMMU? I think IOMMU
> fault should be fine, and we can just leverage existing IOMMU fault handling after
> the fault is triggered.
> 
> >
> > For hosts with no IOMMU support: The affected emulator(s) must specifically
> > issue a IOMMUOP_unmap_foreign_page subop for the now invalid BFN address so that
> > the references to the underlying MFN are removed and the MFN can be freed back
> > to the Xen memory allocator.
> >
> > For hosts with IOMMU support:
> > If the BFN was mapped without the IOMMUOP_swap_mfn flag set in the
> > IOMMUOP_map_foreign_page then the affected affected emulator(s) must
> > specifically issue a IOMMUOP_unmap_foreign_page subop for the now invalid BFN
> > address so that the references to the underlying MFN are removed.
> >
> > If the BFN was mapped with the IOMMUOP_swap_mfn flag set in the
> > IOMMUOP_map_foreign_page subop for all emulators with mappings of that GFN then
> > the BFN mapping will be swapped to point at a scratch MFN page and all BFN
> > references to the invalid MFN will be removed by Xen after the BFN mapping has
> > been updated to point at the scratch MFN page.
> 
> I don't understand why for 'swap' case you don't need emulator to do
> explicit unmap. You can think 'noswap' (page-A to invalid) as a special
> example of 'swap' (page-A to scratch page), since they both move
> away from page-A reference. If there is a reason that emulator needs
> to do some cleanup internally before dropping the reference, does
> 'swap_mfn' breaks that situation then?
> 
> >
> > The rationale for swapping the BFN mapping to point at scratch pages is to
> > enable guest domains to balloon quickly without requiring hypercall(s) from
> > emulators.
> >
> > Not all BFN mappings can be swapped without potentially causing problems for the
> > hardware itself (command rings etc.) so the IOMMUOP_swap_mfn flag is used to
> > allow per BFN control of Xen ballooning behaviour.
> 
> Who will judge whether a BFN mapping can be swapped then?
> 
> [...]
> > Xen PV IOMMU hypercall interface
> > --------------------------------
> > A two argument hypercall interface (do_iommu_op).
> >
> >     ret_t do_iommu_op(XEN_GUEST_HANDLE_PARAM(void) arg, unsigned int count)
> >
> > First argument, guest handle pointer to array of `struct pv_iommu_op`
> >
> > Second argument, unsigned integer count of `struct pv_iommu_op` elements in array.
> >
> > Definition of `struct pv_iommu_op`:
> >
> >     struct pv_iommu_op {
> >
> >         uint16_t subop_id;
> >         uint16_t flags;
> >         int32_t status;
> >
> >         union {
> >             struct {
> >                 uint64_t bfn;
> >                 uint64_t gfn;
> >             } map_page;
> >
> >             struct {
> >                 uint64_t bfn;
> >             } unmap_page;
> >
> >             struct {
> >                 uint64_t bfn;
> >                 uint64_t gfn;
> >                 uint16_t domid;
> >                 ioservid_t ioserver;
> >             } map_foreign_page;
> >
> >             struct {
> >                 uint64_t bfn;
> >                 uint64_t gfn;
> >                 uint16_t domid;
> >                 ioservid_t ioserver;
> >             } lookup_foreign_page;
> >
> >             struct {
> >                 uint64_t bfn;
> >                 ioservid_t ioserver;
> >             } unmap_foreign_page;
> >         } u;
> >     };
> 
> Do we really need such ioserver ID here? Could it be simple
> as looping all ioreq servers with INVALIDATE notifications?
> 
> 
> [...]
> >
> > IOMMUOP_map_page
> > ----------------------
> > This subop uses `struct map_page` part of the `struct pv_iommu_op`.
> >
> > If IOMMU dom0-strict mode is NOT enabled then the hardware domain will be
> > allowed to map all GFNs except for Xen owned MFNs else the hardware
> > domain will only be allowed to map GFNs which it owns.
> 
> "map all GFNs" -> "map all MFNs" since you use "except for Xen owned MFNs"
> later. Since you have a capability called IOMMU_QUERY_map_all_mfns, should
> you add such condition in above description?
> 
> >
> > If IOMMU dom0-strict mode is NOT enabled then the hardware domain will be
> > allowed to map all GFNs without taking a reference to the MFN backing the GFN
> > by setting the IOMMU_MAP_OP_no_ref_cnt flag.
> 
> could you elaborate when no_ref_cnt is required?
> 
> [...]
> >
> > IOMMUOP_unmap_page
> > ------------------
> > This subop uses `struct unmap_page` part of the `struct pv_iommu_op`.
> >
> > The subop usage of the `struct pv_iommu_op` and `struct unmap_page` fields
> > are detailed below:
> >
> > --------------------------------------------------------------------
> > Field          Purpose
> > -----          -----------------------------------------------------
> > `bfn`          [in] Bus address frame number to be unmapped in DOMID_SELF
> >
> > `flags`        [in] Flags for signalling page order of unmap operation
> >
> > `status`       [out] Mapping status of this unmap operation, 0 indicates success
> > --------------------------------------------------------------------
> >
> > Defined bits for flags field:
> >
> > Name                        Bit                Definition
> > ----                       -----      ----------------------------------
> > IOMMU_UNMAP_OP_remove_m2b    0        Wildcard M2B mapping removed for
> >                                       lookup_foreign_page use
> 
> Is it explicitly required? Should it be implicit as long as a valid M2B entry existing?
> 
> 
> [...]
> > IOMMUOP_map_foreign_page
> > ------------------------
> > This subop uses `struct map_foreign_page` part of the `struct pv_iommu_op`.
> >
> > It is not valid to use a domid representing the calling domain.
> 
> Then what's being used here to represent the calling domain?
> 
> >
> > The hypercall will only succeed if calling domain has sufficient privilege over
> > the specified domid.
> 
> How is this privilege check being done? Is there existing mechanism, or something
> new to add?
> 
> >
> > The M2B mechanism is an MFN to (BFN,domid,ioserver) tuple.
> >
> > Each successful subop will add to the M2B if there was not an existing identical
> > M2B entry.
> >
> > Every new M2B entry will take a reference to the MFN backing the GFN.
> >
> > All the following conditions are required to be true for PV IOMMU map_foreign
> > subop to succeed:
> >
> > 1. IOMMU detected and supported by Xen
> > 2. The domain has IOMMU controlled hardware allocated to it
> > 3. The domain is the hardware_domain and the following Xen IOMMU options are
> >    NOT enabled: dom0-passthrough
> 
> 4. the domain has sufficient privilege over the specified domid;
> 
> [...]
> >
> > IOMMU_lookup_foreign_page
> > -------------------------
> > This subop uses `struct lookup_foreign_page` part of the `struct pv_iommu_op`.
> >
> > This subop lookups up a BFN mapping for a ioserver + gfn + target domid
> > combination.
> >
> > The hypercall will only succeed if calling domain has sufficient privilege over
> > the specified domid.
> >
> > If a 1:1 mapping exists of BFN to MFN then a M2B entry is added and a
> > reference is taken to the underlying MFN. If an existing mapping is present
> 
> Then when will this very reference be dropped?
> 
> > then the BFN is returned and no additional reference's will be taken to the
> > underlying MFN.
> >
> > A 1:1 mapping will exist if there is no IOMMU support or if the PV hardware
> > domain was booted in dom0-relaxed mode or in dom0-passthrough mode.
> 
> what about hardware domain using IOMMUOPS in the meantime? In that
> case, from your earlier description it's hardware domain to manage BFN
> addr space, while here 1:1 mapping is some hard assumption in hypervisor,
> so two things together may conflict. There needs to be a mechanism
> that once Xen sees any explicit BFN passed from hardware domain, then
> such 1:1 mapping scheme should be disabled.
> 
> >
> > If there is no IOMMU support then the MFN is returned in the BFN field (that is
> > the only valid bus address for the GFN + domid combination).
> >
> 
> [...]
> >
> > Linux kernel architecture
> > =========================
> >
> > The Linux kernel will use the PV-IOMMU hypercalls to map its PFN address
> > space into the IOMMU. It will map the PFNs to the IOMMU address space using
> > a 1:1 mapping, it does this by programming a BFN to GFN mapping which matches
> > the PFN to GFN mapping.
> >
> > The native SWIOTLB will be used to handle devices which cannot DMA to all of
> > the kernel's PFN address space.
> >
> > An interface shall be provided for emulator usage of IOMMUOP_*_foreign_page
> > subops which will allow the Linux kernel to centrally manage that domain's BFN
> > resource and ensure there are no unexpected conflicts.
> 
> One open here. When IOMMU is enabled, there is supposed to be a
> IOVA space created in Linux kernel. How does this BFN space play
> with that one?
> 
> Thanks
> Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
^ permalink raw reply	[flat|nested] 20+ messages in thread