[RFC] Dom0 PV IOMMU control design (draft A)

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Dom0 PV IOMMU control design (draft A)
@ 2014-04-11 17:28 Malcolm Crossley
  2014-04-11 17:50 ` Konrad Rzeszutek Wilk
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Malcolm Crossley @ 2014-04-11 17:28 UTC (permalink / raw)
  To: xen-devel

Hi,

Here is a design for allowing Dom0 PV guests to control the IOMMU. This 
allows for the Dom0 GPFN mapping to be programmed into the IOMMU and 
avoid using the SWIOTLB bounce buffer technique in the Linux kernel 
(except for legacy 32 bit DMA IO devices)

This feature provides two gains:
1.  Improved performance for use cases which relied upon the bounce 
buffer e.g. NIC cards using jumbo frames with linear buffers.
2.  Prevent SWIOTLB bounce buffer region exhaustion which can cause 
unrecoverable Linux kernel driver errors.

A PDF version of the document is available here:

http://xenbits.xen.org/people/andrewcoop/pv-iommu-control-A.pdf

The pandoc markdown format of the document is provided below to allow 
for easier inline comments:

Introduction
============

Background
-------

Xen PV guests use a Guest Pseudo-physical Frame Number(GPFN) address space
which is decoupled from the host Machine Frame Number(MFN) address 
space. PV
guests which interact with hardware need to translate GPFN addresses to MFN
address because hardware uses the host address space only.
PV guest hardware drivers are only aware of the GPFN address space only and
assume that if GPFN addresses are contiguous then the hardware addresses 
would
be contiguous as well. The decoupling between GPFN and MFN address 
spaces means
GPFN and MFN addresses may not be contiguous across page boundaries and 
thus a
buffer allocated in GPFN address space which spans a page boundary may 
not be
contiguous in MFN address space.

PV hardware drivers cannot tolerate this behaviour and so a special
"bounce buffer" region is used to hide this issue from the drivers.

A bounce buffer region is a special part of the GPFN address space which has
been made to be contiguous in both GPFN and MFN address spaces. When a 
driver
requests a buffer which spans a page boundary be made available for 
hardware
to read then core operating system code copies the buffer into a temporarily
reserved part of the bounce buffer region and then returns the MFN 
address of
the reserved part of the bounce buffer region back to the driver itself. The
driver then instructs the hardware to read the copy of the buffer in the
bounce buffer. Similarly if the driver requests a buffer is made available
for hardware to write to then first a region of the bounce buffer is 
reserved
and then after the hardware completes writing then the reserved region of
bounce buffer is copied to the originally allocated buffer.

The overheard of memory copies to/from the bounce buffer region is high
and damages performance. Furthermore, there is a risk the fixed size
bounce buffer region will become exhausted and it will not be possible to
return an hardware address back to the driver. The Linux kernel drivers 
do not
tolerate this failure and so the kernel is forced to crash the kernel, as an
uncorrectable error has occurred.

Input/Output Memory Management Units (IOMMU) allow for an inbound address
mapping to be created from the I/O Bus address space (typically PCI) to
the machine frame number address space. IOMMU's typically use a page table
mechanism to manage the mappings and therefore can create mappings of 
page size
granularity or larger.

Purpose
=======

Allow Xen Domain 0 PV guests to create/modify/destroy IOMMU mappings for
hardware devices that Domain 0 has access to. This enables Domain 0 to 
program
a bus address space mapping which matches it's GPFN mapping. Once a 1-1
mapping of GPFN to bus address space is created then a bounce buffer
region is not required for the IO devices connected to the IOMMU.

Architecture
============

A three argument hypercall interface (do_iommu_op), implementing two 
hypercall
subops.

Design considerations for hypercall subops
-------------------------------------------
IOMMU map/unmap operations can be slow and can involve flushing the 
IOMMU TLB
to ensure the IO device uses the updated mappings.

The subops have been designed to take an array of operations and a count as
parameters. This allows for easily implemented hypercall continuations 
to be
used and allows for batches of IOMMU operations to be submitted before 
flushing
the IOMMU TLB.

IOMMUOP_map_page
----------------
First argument, pointer to array of `struct iommu_map_op`
Second argument, integer count of `struct iommu_map_op` elements in array

This subop will attempt to IOMMU map each element in the `struct 
iommu_map_op`
array and record the mapping status back into the array itself. If an 
mapping
fault occurs then the hypercall will return with -EFAULT.

This subop will inspect the MFN address being mapped in each 
iommu_map_op to
ensure it does not belong to the Xen hypervisor itself. If the MFN does 
belong
to the Xen hypervisor the subop will return -EPERM in the status field 
for that
particular iommu_map_op.

The IOMMU TLB will only be flushed when the hypercall completes or a 
hypercall
continuation is created.

     struct iommu_map_op {
         uint64_t bfn;
         uint64_t mfn;
         uint32_t flags;
         int32_t status;
     };

------------------------------------------------------------------------------
Field          Purpose
----- ---------------------------------------------------------------
`bfn`          [in] Bus address frame number to mapped to specified mfn 
below

`mfn`          [in] Machine address frame number

`flags`        [in] Flags for signalling type of IOMMU mapping to be created

`status`       [out] Mapping status of this map operation, 0 indicates 
success
------------------------------------------------------------------------------

Defined bits for flags field
------------------------------------------------------------------------
Name                        Bit                Definition
----                       ----- ----------------------------------
IOMMU_MAP_OP_readable        0        Create readable IOMMU mapping
IOMMU_MAP_OP_writeable       1        Create writeable IOMMU mapping
Reserved for future use     2-31                   n/a
------------------------------------------------------------------------

Additional error codes specific to this hypercall:

Error code  Reason
---------- ------------------------------------------------------------
EPERM       PV IOMMU mode not enabled or calling domain is not domain 0
------------------------------------------------------------------------

IOMMUOP_unmap_page
----------------
First argument, pointer to array of `struct iommu_map_op`
Second argument, integer count of `struct iommu_map_op` elements in array

This subop will attempt to unmap each element in the `struct 
iommu_map_op` array
and record the mapping status back into the array itself. If an 
unmapping fault
occurs then the hypercall stop processing the array and return with an 
EFAULT;

The IOMMU TLB will only be flushed when the hypercall completes or a 
hypercall
continuation is created.

     struct iommu_map_op {
         uint64_t bfn;
         uint64_t mfn;
         uint32_t flags;
         int32_t status;
     };

--------------------------------------------------------------------
Field          Purpose
-----          -----------------------------------------------------
`bfn`          [in] Bus address frame number to be unmapped

`mfn`          [in] This field is ignored for unmap subop

`flags`        [in] This field is ignored for unmap subop

`status`       [out] Mapping status of this unmap operation, 0 indicates 
success
--------------------------------------------------------------------

Additional error codes specific to this hypercall:

Error code  Reason
---------- ------------------------------------------------------------
EPERM       PV IOMMU mode not enabled or calling domain is not domain 0
------------------------------------------------------------------------

Conditions for which PV IOMMU hypercalls succeed
------------------------------------------------
All the following conditions are required to be true for PV IOMMU hypercalls
to succeed:

1. IOMMU detected and supported by Xen
2. The following Xen IOMMU options are NOT enabled: dom0-passthrough, 
dom0-strict
3. Domain 0 is making the hypercall

Security Implications of allowing Domain 0 IOMMU control
========================================================

Xen currently allows IO devices attached to Domain 0 to have direct 
access to
the all of the MFN address space (except Xen hypervisor memory regions),
provided the Xen IOMMU option dom0-strict is not enabled.

The PV IOMMU feature provides the same level of access to MFN address space
and the feature is not enabled when the Xen IOMMU option dom0-strict is
enabled. Therefore security is not affected by the PV IOMMU feature.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] Dom0 PV IOMMU control design (draft A)
  2014-04-11 17:28 [RFC] Dom0 PV IOMMU control design (draft A) Malcolm Crossley
@ 2014-04-11 17:50 ` Konrad Rzeszutek Wilk
  2014-04-14 12:12   ` Malcolm Crossley
  2014-04-14 11:52 ` David Vrabel
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-04-11 17:50 UTC (permalink / raw)
  To: Malcolm Crossley; +Cc: xen-devel

On Fri, Apr 11, 2014 at 06:28:43PM +0100, Malcolm Crossley wrote:
> Hi,
> 
> Here is a design for allowing Dom0 PV guests to control the IOMMU.
> This allows for the Dom0 GPFN mapping to be programmed into the
> IOMMU and avoid using the SWIOTLB bounce buffer technique in the
> Linux kernel (except for legacy 32 bit DMA IO devices)
> 
> This feature provides two gains:
> 1.  Improved performance for use cases which relied upon the bounce
> buffer e.g. NIC cards using jumbo frames with linear buffers.
> 2.  Prevent SWIOTLB bounce buffer region exhaustion which can cause
> unrecoverable Linux kernel driver errors.
> 
> A PDF version of the document is available here:
> 
> http://xenbits.xen.org/people/andrewcoop/pv-iommu-control-A.pdf
> 
> The pandoc markdown format of the document is provided below to
> allow for easier inline comments:
> 
> Introduction
> ============
> 
> Background
> -------
> 
> Xen PV guests use a Guest Pseudo-physical Frame Number(GPFN) address space
> which is decoupled from the host Machine Frame Number(MFN) address
> space. PV
> guests which interact with hardware need to translate GPFN addresses to MFN
> address because hardware uses the host address space only.
> PV guest hardware drivers are only aware of the GPFN address space only and
> assume that if GPFN addresses are contiguous then the hardware
> addresses would
> be contiguous as well. The decoupling between GPFN and MFN address
> spaces means
> GPFN and MFN addresses may not be contiguous across page boundaries
> and thus a
> buffer allocated in GPFN address space which spans a page boundary
> may not be
> contiguous in MFN address space.
> 
> PV hardware drivers cannot tolerate this behaviour and so a special
> "bounce buffer" region is used to hide this issue from the drivers.
> 
> A bounce buffer region is a special part of the GPFN address space which has
> been made to be contiguous in both GPFN and MFN address spaces. When
> a driver
> requests a buffer which spans a page boundary be made available for
> hardware
> to read then core operating system code copies the buffer into a temporarily
> reserved part of the bounce buffer region and then returns the MFN
> address of
> the reserved part of the bounce buffer region back to the driver itself. The
> driver then instructs the hardware to read the copy of the buffer in the
> bounce buffer. Similarly if the driver requests a buffer is made available
> for hardware to write to then first a region of the bounce buffer is
> reserved
> and then after the hardware completes writing then the reserved region of
> bounce buffer is copied to the originally allocated buffer.
> 
> The overheard of memory copies to/from the bounce buffer region is high
> and damages performance. Furthermore, there is a risk the fixed size
> bounce buffer region will become exhausted and it will not be possible to
> return an hardware address back to the driver. The Linux kernel
> drivers do not
> tolerate this failure and so the kernel is forced to crash the kernel, as an
> uncorrectable error has occurred.
> 
> Input/Output Memory Management Units (IOMMU) allow for an inbound address
> mapping to be created from the I/O Bus address space (typically PCI) to
> the machine frame number address space. IOMMU's typically use a page table
> mechanism to manage the mappings and therefore can create mappings
> of page size
> granularity or larger.
> 
> Purpose
> =======
> 
> Allow Xen Domain 0 PV guests to create/modify/destroy IOMMU mappings for
> hardware devices that Domain 0 has access to. This enables Domain 0
> to program
> a bus address space mapping which matches it's GPFN mapping. Once a 1-1
> mapping of GPFN to bus address space is created then a bounce buffer
> region is not required for the IO devices connected to the IOMMU.
> 
> 
> Architecture
> ============
> 
> A three argument hypercall interface (do_iommu_op), implementing two
> hypercall
> subops.
> 
> Design considerations for hypercall subops
> -------------------------------------------
> IOMMU map/unmap operations can be slow and can involve flushing the
> IOMMU TLB
> to ensure the IO device uses the updated mappings.
> 
> The subops have been designed to take an array of operations and a count as
> parameters. This allows for easily implemented hypercall
> continuations to be
> used and allows for batches of IOMMU operations to be submitted
> before flushing
> the IOMMU TLB.
> 
> 
> 
> IOMMUOP_map_page
> ----------------
> First argument, pointer to array of `struct iommu_map_op`
> Second argument, integer count of `struct iommu_map_op` elements in array

Could this be 'unsigned integer' count?

Is there a limit? Can I do 31415 of them? Can I do it for the whole
memory space of the guest?

> 
> This subop will attempt to IOMMU map each element in the `struct
> iommu_map_op`
> array and record the mapping status back into the array itself. If
> an mapping
> fault occurs then the hypercall will return with -EFAULT.

> 
> This subop will inspect the MFN address being mapped in each
> iommu_map_op to
> ensure it does not belong to the Xen hypervisor itself. If the MFN
> does belong
> to the Xen hypervisor the subop will return -EPERM in the status
> field for that
> particular iommu_map_op.

Is it OK if the MFN belongs to another guest? 

> 
> The IOMMU TLB will only be flushed when the hypercall completes or a
> hypercall
> continuation is created.
> 
>     struct iommu_map_op {
>         uint64_t bfn;

bus_frame ?

>         uint64_t mfn;
>         uint32_t flags;
>         int32_t status;
>     };
> 
> ------------------------------------------------------------------------------
> Field          Purpose
> ----- ---------------------------------------------------------------
> `bfn`          [in] Bus address frame number to mapped to specified
> mfn below

Huh? Isn't this out? If not, isn't bfn == mfn for dom0?
How would dom0 know the bus address? That usually is something only the
IOMMU knows.

> 
> `mfn`          [in] Machine address frame number
> 


We still need to do a bit of PFN -> MFN -> hypercall -> GFN and program
that in the PCIe devices right?


> `flags`        [in] Flags for signalling type of IOMMU mapping to be created
> 
> `status`       [out] Mapping status of this map operation, 0
> indicates success
> ------------------------------------------------------------------------------
> 
> 
> Defined bits for flags field
> ------------------------------------------------------------------------
> Name                        Bit                Definition
> ----                       ----- ----------------------------------
> IOMMU_MAP_OP_readable        0        Create readable IOMMU mapping
> IOMMU_MAP_OP_writeable       1        Create writeable IOMMU mapping

And is it OK to use both?

> Reserved for future use     2-31                   n/a
> ------------------------------------------------------------------------
> 
> Additional error codes specific to this hypercall:
> 
> Error code  Reason
> ---------- ------------------------------------------------------------
> EPERM       PV IOMMU mode not enabled or calling domain is not domain 0

And -EFAULT

and what about success? Do you return 0 or the number of ops that were
successfull?

> ------------------------------------------------------------------------
> 
> IOMMUOP_unmap_page
> ----------------
> First argument, pointer to array of `struct iommu_map_op`
> Second argument, integer count of `struct iommu_map_op` elements in array

Um, 'unsigned integer' count?

> 
> This subop will attempt to unmap each element in the `struct
> iommu_map_op` array
> and record the mapping status back into the array itself. If an
> unmapping fault
> occurs then the hypercall stop processing the array and return with
> an EFAULT;
> 
> The IOMMU TLB will only be flushed when the hypercall completes or a
> hypercall
> continuation is created.
> 
>     struct iommu_map_op {
>         uint64_t bfn;
>         uint64_t mfn;
>         uint32_t flags;
>         int32_t status;
>     };
> 
> --------------------------------------------------------------------
> Field          Purpose
> -----          -----------------------------------------------------
> `bfn`          [in] Bus address frame number to be unmapped

I presume this is gathered from the 'map' call?

> 
> `mfn`          [in] This field is ignored for unmap subop
> 
> `flags`        [in] This field is ignored for unmap subop
> 
> `status`       [out] Mapping status of this unmap operation, 0
> indicates success
> --------------------------------------------------------------------
> 
> Additional error codes specific to this hypercall:
> 
> Error code  Reason
> ---------- ------------------------------------------------------------
> EPERM       PV IOMMU mode not enabled or calling domain is not domain 0

EFAULT too

> ------------------------------------------------------------------------
> 
> 
> Conditions for which PV IOMMU hypercalls succeed
> ------------------------------------------------
> All the following conditions are required to be true for PV IOMMU hypercalls
> to succeed:
> 
> 1. IOMMU detected and supported by Xen
> 2. The following Xen IOMMU options are NOT enabled:
> dom0-passthrough, dom0-strict
> 3. Domain 0 is making the hypercall
> 
> 
> Security Implications of allowing Domain 0 IOMMU control
> ========================================================
> 
> Xen currently allows IO devices attached to Domain 0 to have direct
> access to
> the all of the MFN address space (except Xen hypervisor memory regions),
> provided the Xen IOMMU option dom0-strict is not enabled.
> 
> The PV IOMMU feature provides the same level of access to MFN address space
> and the feature is not enabled when the Xen IOMMU option dom0-strict is
> enabled. Therefore security is not affected by the PV IOMMU feature.
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] Dom0 PV IOMMU control design (draft A)
  2014-04-11 17:28 [RFC] Dom0 PV IOMMU control design (draft A) Malcolm Crossley
  2014-04-11 17:50 ` Konrad Rzeszutek Wilk
@ 2014-04-14 11:52 ` David Vrabel
  2014-04-14 13:47 ` Jan Beulich
  2014-04-16 14:13 ` Zhang, Xiantao
  3 siblings, 0 replies; 18+ messages in thread
From: David Vrabel @ 2014-04-14 11:52 UTC (permalink / raw)
  To: Malcolm Crossley; +Cc: xen-devel

On 11/04/14 18:28, Malcolm Crossley wrote:
> Purpose
> =======
> 
> Allow Xen Domain 0 PV guests to create/modify/destroy IOMMU mappings for
> hardware devices that Domain 0 has access to. This enables Domain 0 to
> program
> a bus address space mapping which matches it's GPFN mapping. Once a 1-1
> mapping of GPFN to bus address space is created then a bounce buffer
> region is not required for the IO devices connected to the IOMMU.

I think this needs to be expanded on further.  Suggest an additional
section on the design of the Linux/guest side.

> IOMMUOP_map_page
> ----------------
> First argument, pointer to array of `struct iommu_map_op`
> Second argument, integer count of `struct iommu_map_op` elements in array

Perhaps a C function prototype here would be clearer?

> The IOMMU TLB will only be flushed when the hypercall completes or a
> hypercall
> continuation is created.

The IOMMU TLB flush should be when all the ops are complete.  There's no
need to flush when doing a continuation.

> Field          Purpose
> ----- ---------------------------------------------------------------
> `bfn`          [in] Bus address frame number to mapped to specified mfn
> below
> 
> `mfn`          [in] Machine address frame number
> 
> `flags`        [in] Flags for signalling type of IOMMU mapping to be
> created
> 
> `status`       [out] Mapping status of this map operation, 0 indicates
> success

Need to define the failure codes.  Perhaps similar to the grant table op
status codes?

> ------------------------------------------------------------------------------
> 
> 
> 
> Defined bits for flags field
> ------------------------------------------------------------------------
> Name                        Bit                Definition
> ----                       ----- ----------------------------------
> IOMMU_MAP_OP_readable        0        Create readable IOMMU mapping
> IOMMU_MAP_OP_writeable       1        Create writeable IOMMU mapping
> Reserved for future use     2-31                   n/a
> ------------------------------------------------------------------------

Should state that it is not possible to create write-only mappings
(since not all hardware supports this).

> Additional error codes specific to this hypercall:
> 
> Error code  Reason
> ---------- ------------------------------------------------------------
> EPERM       PV IOMMU mode not enabled or calling domain is not domain 0

ENOSYS for PV IOMMU not enabled perhaps?  And similarly for unmap.

David

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] Dom0 PV IOMMU control design (draft A)
  2014-04-11 17:50 ` Konrad Rzeszutek Wilk
@ 2014-04-14 12:12   ` Malcolm Crossley
  2014-04-14 12:51     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 18+ messages in thread
From: Malcolm Crossley @ 2014-04-14 12:12 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: xen-devel

On 11/04/14 18:50, Konrad Rzeszutek Wilk wrote:
> On Fri, Apr 11, 2014 at 06:28:43PM +0100, Malcolm Crossley wrote:
>> Hi,
>>
>> Here is a design for allowing Dom0 PV guests to control the IOMMU.
>> This allows for the Dom0 GPFN mapping to be programmed into the
>> IOMMU and avoid using the SWIOTLB bounce buffer technique in the
>> Linux kernel (except for legacy 32 bit DMA IO devices)
>>
>> ...
>>
>> Design considerations for hypercall subops
>> -------------------------------------------
>> IOMMU map/unmap operations can be slow and can involve flushing the
>> IOMMU TLB
>> to ensure the IO device uses the updated mappings.
>>
>> The subops have been designed to take an array of operations and a count as
>> parameters. This allows for easily implemented hypercall
>> continuations to be
>> used and allows for batches of IOMMU operations to be submitted
>> before flushing
>> the IOMMU TLB.
>>
>>
>>
>> IOMMUOP_map_page
>> ----------------
>> First argument, pointer to array of `struct iommu_map_op`
>> Second argument, integer count of `struct iommu_map_op` elements in array
> Could this be 'unsigned integer' count?
Yes will change for draft B
>
> Is there a limit? Can I do 31415 of them? Can I do it for the whole
> memory space of the guest?
There is no current limit, the hypercall will be implemented with 
continuations to prevent denial of service attacks.
>> This subop will attempt to IOMMU map each element in the `struct
>> iommu_map_op`
>> array and record the mapping status back into the array itself. If
>> an mapping
>> fault occurs then the hypercall will return with -EFAULT.
>> This subop will inspect the MFN address being mapped in each
>> iommu_map_op to
>> ensure it does not belong to the Xen hypervisor itself. If the MFN
>> does belong
>> to the Xen hypervisor the subop will return -EPERM in the status
>> field for that
>> particular iommu_map_op.
> Is it OK if the MFN belongs to another guest?
>
It is OK for the MFN to belong to another guest because only Dom0 is 
performing the mapping. This is to allow grant mapped pages to have a 
Dom0 BFN mapping.
>> The IOMMU TLB will only be flushed when the hypercall completes or a
>> hypercall
>> continuation is created.
>>
>>      struct iommu_map_op {
>>          uint64_t bfn;
> bus_frame ?
Yes, or you could say, Bus Address with 4k page size granularity.
>
>>          uint64_t mfn;
>>          uint32_t flags;
>>          int32_t status;
>>      };
>>
>> ------------------------------------------------------------------------------
>> Field          Purpose
>> ----- ---------------------------------------------------------------
>> `bfn`          [in] Bus address frame number to mapped to specified
>> mfn below
> Huh? Isn't this out? If not, isn't bfn == mfn for dom0?
> How would dom0 know the bus address? That usually is something only the
> IOMMU knows.
The idea is that Domain0 sets up BFN to MFN mappings which are the same 
as the GPFN to MFN mappings. This cannot be done from Xen's M2P mappings 
because they are not up to date for PV guests.
>
>> `mfn`          [in] Machine address frame number
>>
>
> We still need to do a bit of PFN -> MFN -> hypercall -> GFN and program
> that in the PCIe devices right?
Yes, a GPFN to MFN lookup will be required but this is simply consulting 
the Guest P2M. BTW, we are programming the IOMMU not the PCIe device's 
themselves.
>
>
>> `flags`        [in] Flags for signalling type of IOMMU mapping to be created
>>
>> `status`       [out] Mapping status of this map operation, 0
>> indicates success
>> ------------------------------------------------------------------------------
>>
>>
>> Defined bits for flags field
>> ------------------------------------------------------------------------
>> Name                        Bit                Definition
>> ----                       ----- ----------------------------------
>> IOMMU_MAP_OP_readable        0        Create readable IOMMU mapping
>> IOMMU_MAP_OP_writeable       1        Create writeable IOMMU mapping
> And is it OK to use both?
Yes and typically both would be used. I'm just allowing for read only 
mappings and write only mappings to be created.
>
>> Reserved for future use     2-31                   n/a
>> ------------------------------------------------------------------------
>>
>> Additional error codes specific to this hypercall:
>>
>> Error code  Reason
>> ---------- ------------------------------------------------------------
>> EPERM       PV IOMMU mode not enabled or calling domain is not domain 0
> And -EFAULT
I was considering that EFAULT would be standard error code, do you want 
it to be explicitly listed?
>
> and what about success? Do you return 0 or the number of ops that were
> successfull?
Return 0 if all ops were successful, otherwise return the number of 
failed operations as positive number. I leave it up to the caller to 
determine which operations failed by iterating the input array. I'll 
will add this to draft B.
>
>> ------------------------------------------------------------------------
>>
>> IOMMUOP_unmap_page
>> ----------------
>> First argument, pointer to array of `struct iommu_map_op`
>> Second argument, integer count of `struct iommu_map_op` elements in array
> Um, 'unsigned integer' count?
Yes will change for draft B
>
>> This subop will attempt to unmap each element in the `struct
>> iommu_map_op` array
>> and record the mapping status back into the array itself. If an
>> unmapping fault
>> occurs then the hypercall stop processing the array and return with
>> an EFAULT;
>>
>> The IOMMU TLB will only be flushed when the hypercall completes or a
>> hypercall
>> continuation is created.
>>
>>      struct iommu_map_op {
>>          uint64_t bfn;
>>          uint64_t mfn;
>>          uint32_t flags;
>>          int32_t status;
>>      };
>>
>> --------------------------------------------------------------------
>> Field          Purpose
>> -----          -----------------------------------------------------
>> `bfn`          [in] Bus address frame number to be unmapped
> I presume this is gathered from the 'map' call?
It does not need to be gathered, Domain 0 is responsible for it's own 
BFN mappings and there is no auditing of the BFN address themselves.
>
>> `mfn`          [in] This field is ignored for unmap subop
>>
>> `flags`        [in] This field is ignored for unmap subop
>>
>> `status`       [out] Mapping status of this unmap operation, 0
>> indicates success
>> --------------------------------------------------------------------
>>
>> Additional error codes specific to this hypercall:
>>
>> Error code  Reason
>> ---------- ------------------------------------------------------------
>> EPERM       PV IOMMU mode not enabled or calling domain is not domain 0
> EFAULT too
I was considering that EFAULT would be standard error code, do you want 
it to be explicitly listed?
>
>> ------------------------------------------------------------------------
>>
>>
>> Conditions for which PV IOMMU hypercalls succeed
>> ------------------------------------------------
>> All the following conditions are required to be true for PV IOMMU hypercalls
>> to succeed:
>>
>> 1. IOMMU detected and supported by Xen
>> 2. The following Xen IOMMU options are NOT enabled:
>> dom0-passthrough, dom0-strict
>> 3. Domain 0 is making the hypercall
>>
>>
>> Security Implications of allowing Domain 0 IOMMU control
>> ========================================================
>>
>> Xen currently allows IO devices attached to Domain 0 to have direct
>> access to
>> the all of the MFN address space (except Xen hypervisor memory regions),
>> provided the Xen IOMMU option dom0-strict is not enabled.
>>
>> The PV IOMMU feature provides the same level of access to MFN address space
>> and the feature is not enabled when the Xen IOMMU option dom0-strict is
>> enabled. Therefore security is not affected by the PV IOMMU feature.
>>
Thanks for your comments.
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] Dom0 PV IOMMU control design (draft A)
  2014-04-14 12:12   ` Malcolm Crossley
@ 2014-04-14 12:51     ` Konrad Rzeszutek Wilk
  2014-04-14 15:03       ` Malcolm Crossley
  0 siblings, 1 reply; 18+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-04-14 12:51 UTC (permalink / raw)
  To: Malcolm Crossley; +Cc: xen-devel

On Mon, Apr 14, 2014 at 01:12:07PM +0100, Malcolm Crossley wrote:
> On 11/04/14 18:50, Konrad Rzeszutek Wilk wrote:
> >On Fri, Apr 11, 2014 at 06:28:43PM +0100, Malcolm Crossley wrote:
> >>Hi,
> >>
> >>Here is a design for allowing Dom0 PV guests to control the IOMMU.

With the device driver domains I think you should also rename the
'dom0' to device driver or 'hardware domain' - as this functionality
should be possible within an PV guest with PCI passthrough for example.

> >>This allows for the Dom0 GPFN mapping to be programmed into the
> >>IOMMU and avoid using the SWIOTLB bounce buffer technique in the
> >>Linux kernel (except for legacy 32 bit DMA IO devices)
> >>
> >>...
> >>
> >>Design considerations for hypercall subops
> >>-------------------------------------------
> >>IOMMU map/unmap operations can be slow and can involve flushing the
> >>IOMMU TLB
> >>to ensure the IO device uses the updated mappings.
> >>
> >>The subops have been designed to take an array of operations and a count as
> >>parameters. This allows for easily implemented hypercall
> >>continuations to be
> >>used and allows for batches of IOMMU operations to be submitted
> >>before flushing
> >>the IOMMU TLB.
> >>
> >>
> >>
> >>IOMMUOP_map_page
> >>----------------
> >>First argument, pointer to array of `struct iommu_map_op`
> >>Second argument, integer count of `struct iommu_map_op` elements in array
> >Could this be 'unsigned integer' count?
> Yes will change for draft B
> >
> >Is there a limit? Can I do 31415 of them? Can I do it for the whole
> >memory space of the guest?
> There is no current limit, the hypercall will be implemented with
> continuations to prevent denial of service attacks.
> >>This subop will attempt to IOMMU map each element in the `struct
> >>iommu_map_op`
> >>array and record the mapping status back into the array itself. If
> >>an mapping
> >>fault occurs then the hypercall will return with -EFAULT.
> >>This subop will inspect the MFN address being mapped in each
> >>iommu_map_op to
> >>ensure it does not belong to the Xen hypervisor itself. If the MFN
> >>does belong
> >>to the Xen hypervisor the subop will return -EPERM in the status
> >>field for that
> >>particular iommu_map_op.
> >Is it OK if the MFN belongs to another guest?
> >
> It is OK for the MFN to belong to another guest because only Dom0 is
> performing the mapping. This is to allow grant mapped pages to have
> a Dom0 BFN mapping.

That should be reflected in the spec. That the MFN have to pass the
grant check.
> >>The IOMMU TLB will only be flushed when the hypercall completes or a
> >>hypercall
> >>continuation is created.
> >>
> >>     struct iommu_map_op {
> >>         uint64_t bfn;
> >bus_frame ?
> Yes, or you could say, Bus Address with 4k page size granularity.


The 'bfn' sounds like a new word. I was hoping you could use something
that is more in line with the nomenclature in the I/O world. Such as
'bus addresses', 'bus physical frame', etc. The 'bfn' sounds just
off. The best I could come up with was 'bus_frame' but there has to
be something that sounds better?


> >
> >>         uint64_t mfn;
> >>         uint32_t flags;
> >>         int32_t status;
> >>     };
> >>
> >>------------------------------------------------------------------------------
> >>Field          Purpose
> >>----- ---------------------------------------------------------------
> >>`bfn`          [in] Bus address frame number to mapped to specified
> >>mfn below
> >Huh? Isn't this out? If not, isn't bfn == mfn for dom0?
> >How would dom0 know the bus address? That usually is something only the
> >IOMMU knows.
> The idea is that Domain0 sets up BFN to MFN mappings which are the
> same as the GPFN to MFN mappings. This cannot be done from Xen's M2P
> mappings because they are not up to date for PV guests.

OK, but if GPFN == MFN, then this is not an bus frame number. You might
as well call it 'gmfn' ?

> >
> >>`mfn`          [in] Machine address frame number
> >>
> >
> >We still need to do a bit of PFN -> MFN -> hypercall -> GFN and program
> >that in the PCIe devices right?
> Yes, a GPFN to MFN lookup will be required but this is simply
> consulting the Guest P2M. BTW, we are programming the IOMMU not the
> PCIe device's themselves.

We do have to program some value in the PCIe device so that it can
do the DMA operation. Said value right now is the MFN << PAGE_SHIFT (aka
physical address). But as the spec says - the 'physical address' might
not be equal to the 'bus address'. Hence the value we would be programming
in might be very well different than MFN.

Or in other words - the IOMMU could be non-coherent and it would use
its own addresses for physical addresses. As in, different than
the physical ones.

> >
> >
> >>`flags`        [in] Flags for signalling type of IOMMU mapping to be created
> >>
> >>`status`       [out] Mapping status of this map operation, 0
> >>indicates success
> >>------------------------------------------------------------------------------
> >>
> >>
> >>Defined bits for flags field
> >>------------------------------------------------------------------------
> >>Name                        Bit                Definition
> >>----                       ----- ----------------------------------
> >>IOMMU_MAP_OP_readable        0        Create readable IOMMU mapping
> >>IOMMU_MAP_OP_writeable       1        Create writeable IOMMU mapping
> >And is it OK to use both?
> Yes and typically both would be used. I'm just allowing for read
> only mappings and write only mappings to be created.

You need to mention that in the spec that it is OK to combine them.
Or mention which ones are not OK to combine.
> >
> >>Reserved for future use     2-31                   n/a
> >>------------------------------------------------------------------------
> >>
> >>Additional error codes specific to this hypercall:
> >>
> >>Error code  Reason
> >>---------- ------------------------------------------------------------
> >>EPERM       PV IOMMU mode not enabled or calling domain is not domain 0
> >And -EFAULT
> I was considering that EFAULT would be standard error code, do you
> want it to be explicitly listed?

Yes. This is a spec and if you don't spell it out folks might miss this.

> >
> >and what about success? Do you return 0 or the number of ops that were
> >successfull?
> Return 0 if all ops were successful, otherwise return the number of
> failed operations as positive number. I leave it up to the caller to
> determine which operations failed by iterating the input array. I'll
> will add this to draft B.

Thank you.
> >
> >>------------------------------------------------------------------------
> >>
> >>IOMMUOP_unmap_page
> >>----------------
> >>First argument, pointer to array of `struct iommu_map_op`
> >>Second argument, integer count of `struct iommu_map_op` elements in array
> >Um, 'unsigned integer' count?
> Yes will change for draft B
> >
> >>This subop will attempt to unmap each element in the `struct
> >>iommu_map_op` array
> >>and record the mapping status back into the array itself. If an
> >>unmapping fault
> >>occurs then the hypercall stop processing the array and return with
> >>an EFAULT;
> >>
> >>The IOMMU TLB will only be flushed when the hypercall completes or a
> >>hypercall
> >>continuation is created.
> >>
> >>     struct iommu_map_op {
> >>         uint64_t bfn;
> >>         uint64_t mfn;
> >>         uint32_t flags;
> >>         int32_t status;
> >>     };
> >>
> >>--------------------------------------------------------------------
> >>Field          Purpose
> >>-----          -----------------------------------------------------
> >>`bfn`          [in] Bus address frame number to be unmapped
> >I presume this is gathered from the 'map' call?
> It does not need to be gathered, Domain 0 is responsible for it's
> own BFN mappings and there is no auditing of the BFN address
> themselves.

I can see this be the case when BFN == MFN. But if that is not the
case I cannot see how the domain would gather the BFNs - unless it
gets it from the 'map' call.

> >
> >>`mfn`          [in] This field is ignored for unmap subop
> >>
> >>`flags`        [in] This field is ignored for unmap subop
> >>
> >>`status`       [out] Mapping status of this unmap operation, 0
> >>indicates success
> >>--------------------------------------------------------------------
> >>
> >>Additional error codes specific to this hypercall:
> >>
> >>Error code  Reason
> >>---------- ------------------------------------------------------------
> >>EPERM       PV IOMMU mode not enabled or calling domain is not domain 0
> >EFAULT too
> I was considering that EFAULT would be standard error code, do you
> want it to be explicitly listed?

Yes.
> >
> >>------------------------------------------------------------------------
> >>
> >>
> >>Conditions for which PV IOMMU hypercalls succeed
> >>------------------------------------------------
> >>All the following conditions are required to be true for PV IOMMU hypercalls
> >>to succeed:
> >>
> >>1. IOMMU detected and supported by Xen
> >>2. The following Xen IOMMU options are NOT enabled:
> >>dom0-passthrough, dom0-strict
> >>3. Domain 0 is making the hypercall
> >>
> >>
> >>Security Implications of allowing Domain 0 IOMMU control
> >>========================================================
> >>
> >>Xen currently allows IO devices attached to Domain 0 to have direct
> >>access to
> >>the all of the MFN address space (except Xen hypervisor memory regions),
> >>provided the Xen IOMMU option dom0-strict is not enabled.
> >>
> >>The PV IOMMU feature provides the same level of access to MFN address space
> >>and the feature is not enabled when the Xen IOMMU option dom0-strict is
> >>enabled. Therefore security is not affected by the PV IOMMU feature.
> >>
> Thanks for your comments.
> >>
> >>_______________________________________________
> >>Xen-devel mailing list
> >>Xen-devel@lists.xen.org
> >>http://lists.xen.org/xen-devel
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] Dom0 PV IOMMU control design (draft A)
  2014-04-11 17:28 [RFC] Dom0 PV IOMMU control design (draft A) Malcolm Crossley
  2014-04-11 17:50 ` Konrad Rzeszutek Wilk
  2014-04-14 11:52 ` David Vrabel
@ 2014-04-14 13:47 ` Jan Beulich
  2014-04-14 15:48   ` Malcolm Crossley
  2014-04-16 14:13 ` Zhang, Xiantao
  3 siblings, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2014-04-14 13:47 UTC (permalink / raw)
  To: Malcolm Crossley; +Cc: xen-devel

>>> On 11.04.14 at 19:28, <malcolm.crossley@citrix.com> wrote:
> Purpose
> =======
> 
> Allow Xen Domain 0 PV guests to create/modify/destroy IOMMU mappings for
> hardware devices that Domain 0 has access to. This enables Domain 0 to 
> program
> a bus address space mapping which matches it's GPFN mapping. Once a 1-1
> mapping of GPFN to bus address space is created then a bounce buffer
> region is not required for the IO devices connected to the IOMMU.

So the idea then is for the domain to
- create this 1:1 mapping once at boot, and update it as pages get
  ballooned int/out, or
- zap the entire IOMMU mapping at boot, and establish individual
  mappings as needed based on the driver's use of the DMA map ops?

> IOMMUOP_unmap_page
> ----------------
> First argument, pointer to array of `struct iommu_map_op`
> Second argument, integer count of `struct iommu_map_op` elements in array
> 
> This subop will attempt to unmap each element in the `struct 
> iommu_map_op` array
> and record the mapping status back into the array itself. If an 

unmapping?

> unmapping fault

mapping?

> occurs then the hypercall stop processing the array and return with an 
> EFAULT;

-EFAULT.

> The IOMMU TLB will only be flushed when the hypercall completes or a 
> hypercall
> continuation is created.
> 
>      struct iommu_map_op {
>          uint64_t bfn;
>          uint64_t mfn;
>          uint32_t flags;
>          int32_t status;
>      };
> 
> --------------------------------------------------------------------
> Field          Purpose
> -----          -----------------------------------------------------
> `bfn`          [in] Bus address frame number to be unmapped
> 
> `mfn`          [in] This field is ignored for unmap subop
> 
> `flags`        [in] This field is ignored for unmap subop
> 
> `status`       [out] Mapping status of this unmap operation, 0 indicates success

Hmm, bfn and flags ignored would call for an abbreviated interface
structure. Or how about IOMMU_MAP_OP_{readable,writeable}
implying unmap, in which case for now you'd need only one op. And
with that I wonder whether this shouldn't simply be a new
PHYSDEVOP_iommu_map.

> Security Implications of allowing Domain 0 IOMMU control
> ========================================================
> 
> Xen currently allows IO devices attached to Domain 0 to have direct 
> access to
> the all of the MFN address space (except Xen hypervisor memory regions),
> provided the Xen IOMMU option dom0-strict is not enabled.
> 
> The PV IOMMU feature provides the same level of access to MFN address space
> and the feature is not enabled when the Xen IOMMU option dom0-strict is
> enabled. Therefore security is not affected by the PV IOMMU feature.

While I was about to reply with the same request to extend this to
driver domains, this last section pretty much excludes such an
extension. I guess that would benefit from revisiting.

Jan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] Dom0 PV IOMMU control design (draft A)
  2014-04-14 12:51     ` Konrad Rzeszutek Wilk
@ 2014-04-14 15:03       ` Malcolm Crossley
  2014-04-14 15:09         ` Konrad Rzeszutek Wilk
  2014-04-14 15:48         ` Jan Beulich
  0 siblings, 2 replies; 18+ messages in thread
From: Malcolm Crossley @ 2014-04-14 15:03 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: xen-devel

On 14/04/14 13:51, Konrad Rzeszutek Wilk wrote:
> On Mon, Apr 14, 2014 at 01:12:07PM +0100, Malcolm Crossley wrote:
>> On 11/04/14 18:50, Konrad Rzeszutek Wilk wrote:
>>> On Fri, Apr 11, 2014 at 06:28:43PM +0100, Malcolm Crossley wrote:
>>>> Hi,
>>>>
>>>> Here is a design for allowing Dom0 PV guests to control the IOMMU.
> With the device driver domains I think you should also rename the
> 'dom0' to device driver or 'hardware domain' - as this functionality
> should be possible within an PV guest with PCI passthrough for example.
Currently Xen only allows Dom0 IOMMU access to all (expect Xen) MFN's. 
To not change the current security implications of the feature I would 
prefer that 'hardware domain' support was added as a separate design.
>
>>>> This allows for the Dom0 GPFN mapping to be programmed into the
>>>> IOMMU and avoid using the SWIOTLB bounce buffer technique in the
>>>> Linux kernel (except for legacy 32 bit DMA IO devices)
>>>>
>>>> ...
>>>>
>>>> Design considerations for hypercall subops
>>>> -------------------------------------------
>>>> IOMMU map/unmap operations can be slow and can involve flushing the
>>>> IOMMU TLB
>>>> to ensure the IO device uses the updated mappings.
>>>>
>>>> The subops have been designed to take an array of operations and a count as
>>>> parameters. This allows for easily implemented hypercall
>>>> continuations to be
>>>> used and allows for batches of IOMMU operations to be submitted
>>>> before flushing
>>>> the IOMMU TLB.
>>>>
>>>>
>>>>
>>>> IOMMUOP_map_page
>>>> ----------------
>>>> First argument, pointer to array of `struct iommu_map_op`
>>>> Second argument, integer count of `struct iommu_map_op` elements in array
>>> Could this be 'unsigned integer' count?
>> Yes will change for draft B
>>> Is there a limit? Can I do 31415 of them? Can I do it for the whole
>>> memory space of the guest?
>> There is no current limit, the hypercall will be implemented with
>> continuations to prevent denial of service attacks.
>>>> This subop will attempt to IOMMU map each element in the `struct
>>>> iommu_map_op`
>>>> array and record the mapping status back into the array itself. If
>>>> an mapping
>>>> fault occurs then the hypercall will return with -EFAULT.
>>>> This subop will inspect the MFN address being mapped in each
>>>> iommu_map_op to
>>>> ensure it does not belong to the Xen hypervisor itself. If the MFN
>>>> does belong
>>>> to the Xen hypervisor the subop will return -EPERM in the status
>>>> field for that
>>>> particular iommu_map_op.
>>> Is it OK if the MFN belongs to another guest?
>>>
>> It is OK for the MFN to belong to another guest because only Dom0 is
>> performing the mapping. This is to allow grant mapped pages to have
>> a Dom0 BFN mapping.
> That should be reflected in the spec. That the MFN have to pass the
> grant check.
Again, if the design is restricted to domain 0 only then there is no 
need for a grant check because previously domain 0 had a bus addressable 
mapping for all MFN's.
>>>> The IOMMU TLB will only be flushed when the hypercall completes or a
>>>> hypercall
>>>> continuation is created.
>>>>
>>>>      struct iommu_map_op {
>>>>          uint64_t bfn;
>>> bus_frame ?
>> Yes, or you could say, Bus Address with 4k page size granularity.
>
> The 'bfn' sounds like a new word. I was hoping you could use something
> that is more in line with the nomenclature in the I/O world. Such as
> 'bus addresses', 'bus physical frame', etc. The 'bfn' sounds just
> off. The best I could come up with was 'bus_frame' but there has to
> be something that sounds better?
I agree that bfn is not ideal. Using bus_address is dangerous because 
the mapping can only be done at 4k granularity (MFN's are 4k granularity 
also).
"iofn" is an alternative but potentially misleading because IOMMU's may 
not be able to translate all IO device's accessible to dom0 
(particularly on ARM).
>
>>>>          uint64_t mfn;
>>>>          uint32_t flags;
>>>>          int32_t status;
>>>>      };
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Field          Purpose
>>>> ----- ---------------------------------------------------------------
>>>> `bfn`          [in] Bus address frame number to mapped to specified
>>>> mfn below
>>> Huh? Isn't this out? If not, isn't bfn == mfn for dom0?
>>> How would dom0 know the bus address? That usually is something only the
>>> IOMMU knows.
>> The idea is that Domain0 sets up BFN to MFN mappings which are the
>> same as the GPFN to MFN mappings. This cannot be done from Xen's M2P
>> mappings because they are not up to date for PV guests.
> OK, but if GPFN == MFN, then this is not an bus frame number. You might
> as well call it 'gmfn' ?
The idea is not to force the particular mappings that domain 0 is 
allowed to create but the point I was trying to make above is that 
domain 0 is responsible for tracking the bfn mappings it has setup.
>>>> `mfn`          [in] Machine address frame number
>>>>
>>> We still need to do a bit of PFN -> MFN -> hypercall -> GFN and program
>>> that in the PCIe devices right?
>> Yes, a GPFN to MFN lookup will be required but this is simply
>> consulting the Guest P2M. BTW, we are programming the IOMMU not the
>> PCIe device's themselves.
> We do have to program some value in the PCIe device so that it can
> do the DMA operation. Said value right now is the MFN << PAGE_SHIFT (aka
> physical address). But as the spec says - the 'physical address' might
> not be equal to the 'bus address'. Hence the value we would be programming
> in might be very well different than MFN.
Yes but domain 0 will track the BFN to MFN mappings it has setup ( if it 
setup BFN= GPFN then this is easy).
So domain 0 is responsible to programming the correct BFN address into 
the PCIe device.
>
> Or in other words - the IOMMU could be non-coherent and it would use
> its own addresses for physical addresses. As in, different than
> the physical ones.
I don't understand what you are trying to say here? When you say 
physical address do you mean MFN (memory based addresses) or BFN (bus 
addresses from the device point of view)?
>>>
>>>> `flags`        [in] Flags for signalling type of IOMMU mapping to be created
>>>>
>>>> `status`       [out] Mapping status of this map operation, 0
>>>> indicates success
>>>> ------------------------------------------------------------------------------
>>>>
>>>>
>>>> Defined bits for flags field
>>>> ------------------------------------------------------------------------
>>>> Name                        Bit                Definition
>>>> ----                       ----- ----------------------------------
>>>> IOMMU_MAP_OP_readable        0        Create readable IOMMU mapping
>>>> IOMMU_MAP_OP_writeable       1        Create writeable IOMMU mapping
>>> And is it OK to use both?
>> Yes and typically both would be used. I'm just allowing for read
>> only mappings and write only mappings to be created.
> You need to mention that in the spec that it is OK to combine them.
> Or mention which ones are not OK to combine.
Different IOMMU implementations may change what type of mappings can be 
setup. I would expect the guest to handle the error if particular type 
of mapping can't be setup which the guest requires.
>>>> Reserved for future use     2-31                   n/a
>>>> ------------------------------------------------------------------------
>>>>
>>>> Additional error codes specific to this hypercall:
>>>>
>>>> Error code  Reason
>>>> ---------- ------------------------------------------------------------
>>>> EPERM       PV IOMMU mode not enabled or calling domain is not domain 0
>>> And -EFAULT
>> I was considering that EFAULT would be standard error code, do you
>> want it to be explicitly listed?
> Yes. This is a spec and if you don't spell it out folks might miss this.
>
>>> and what about success? Do you return 0 or the number of ops that were
>>> successfull?
>> Return 0 if all ops were successful, otherwise return the number of
>> failed operations as positive number. I leave it up to the caller to
>> determine which operations failed by iterating the input array. I'll
>> will add this to draft B.
> Thank you.
>>>> ------------------------------------------------------------------------
>>>>
>>>> IOMMUOP_unmap_page
>>>> ----------------
>>>> First argument, pointer to array of `struct iommu_map_op`
>>>> Second argument, integer count of `struct iommu_map_op` elements in array
>>> Um, 'unsigned integer' count?
>> Yes will change for draft B
>>>> This subop will attempt to unmap each element in the `struct
>>>> iommu_map_op` array
>>>> and record the mapping status back into the array itself. If an
>>>> unmapping fault
>>>> occurs then the hypercall stop processing the array and return with
>>>> an EFAULT;
>>>>
>>>> The IOMMU TLB will only be flushed when the hypercall completes or a
>>>> hypercall
>>>> continuation is created.
>>>>
>>>>      struct iommu_map_op {
>>>>          uint64_t bfn;
>>>>          uint64_t mfn;
>>>>          uint32_t flags;
>>>>          int32_t status;
>>>>      };
>>>>
>>>> --------------------------------------------------------------------
>>>> Field          Purpose
>>>> -----          -----------------------------------------------------
>>>> `bfn`          [in] Bus address frame number to be unmapped
>>> I presume this is gathered from the 'map' call?
>> It does not need to be gathered, Domain 0 is responsible for it's
>> own BFN mappings and there is no auditing of the BFN address
>> themselves.
> I can see this be the case when BFN == MFN. But if that is not the
> case I cannot see how the domain would gather the BFNs - unless it
> gets it from the 'map' call.
Domain 0 issue's the original PV IOMMU map hypercall to cause the BFN 
mapping to be created so it should be able to track what needs to be 
unmapped.
>
>>>> `mfn`          [in] This field is ignored for unmap subop
>>>>
>>>> `flags`        [in] This field is ignored for unmap subop
>>>>
>>>> `status`       [out] Mapping status of this unmap operation, 0
>>>> indicates success
>>>> --------------------------------------------------------------------
>>>>
>>>> Additional error codes specific to this hypercall:
>>>>
>>>> Error code  Reason
>>>> ---------- ------------------------------------------------------------
>>>> EPERM       PV IOMMU mode not enabled or calling domain is not domain 0
>>> EFAULT too
>> I was considering that EFAULT would be standard error code, do you
>> want it to be explicitly listed?
> Yes.
>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] Dom0 PV IOMMU control design (draft A)
  2014-04-14 15:03       ` Malcolm Crossley
@ 2014-04-14 15:09         ` Konrad Rzeszutek Wilk
  2014-04-14 15:48         ` Jan Beulich
  1 sibling, 0 replies; 18+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-04-14 15:09 UTC (permalink / raw)
  To: Malcolm Crossley; +Cc: xen-devel

On Mon, Apr 14, 2014 at 04:03:22PM +0100, Malcolm Crossley wrote:
> On 14/04/14 13:51, Konrad Rzeszutek Wilk wrote:
> >On Mon, Apr 14, 2014 at 01:12:07PM +0100, Malcolm Crossley wrote:
> >>On 11/04/14 18:50, Konrad Rzeszutek Wilk wrote:
> >>>On Fri, Apr 11, 2014 at 06:28:43PM +0100, Malcolm Crossley wrote:
> >>>>Hi,
> >>>>
> >>>>Here is a design for allowing Dom0 PV guests to control the IOMMU.
> >With the device driver domains I think you should also rename the
> >'dom0' to device driver or 'hardware domain' - as this functionality
> >should be possible within an PV guest with PCI passthrough for example.
> Currently Xen only allows Dom0 IOMMU access to all (expect Xen)
> MFN's. To not change the current security implications of the
> feature I would prefer that 'hardware domain' support was added as a
> separate design.
> >
> >>>>This allows for the Dom0 GPFN mapping to be programmed into the
> >>>>IOMMU and avoid using the SWIOTLB bounce buffer technique in the
> >>>>Linux kernel (except for legacy 32 bit DMA IO devices)
> >>>>
> >>>>...
> >>>>
> >>>>Design considerations for hypercall subops
> >>>>-------------------------------------------
> >>>>IOMMU map/unmap operations can be slow and can involve flushing the
> >>>>IOMMU TLB
> >>>>to ensure the IO device uses the updated mappings.
> >>>>
> >>>>The subops have been designed to take an array of operations and a count as
> >>>>parameters. This allows for easily implemented hypercall
> >>>>continuations to be
> >>>>used and allows for batches of IOMMU operations to be submitted
> >>>>before flushing
> >>>>the IOMMU TLB.
> >>>>
> >>>>
> >>>>
> >>>>IOMMUOP_map_page
> >>>>----------------
> >>>>First argument, pointer to array of `struct iommu_map_op`
> >>>>Second argument, integer count of `struct iommu_map_op` elements in array
> >>>Could this be 'unsigned integer' count?
> >>Yes will change for draft B
> >>>Is there a limit? Can I do 31415 of them? Can I do it for the whole
> >>>memory space of the guest?
> >>There is no current limit, the hypercall will be implemented with
> >>continuations to prevent denial of service attacks.
> >>>>This subop will attempt to IOMMU map each element in the `struct
> >>>>iommu_map_op`
> >>>>array and record the mapping status back into the array itself. If
> >>>>an mapping
> >>>>fault occurs then the hypercall will return with -EFAULT.
> >>>>This subop will inspect the MFN address being mapped in each
> >>>>iommu_map_op to
> >>>>ensure it does not belong to the Xen hypervisor itself. If the MFN
> >>>>does belong
> >>>>to the Xen hypervisor the subop will return -EPERM in the status
> >>>>field for that
> >>>>particular iommu_map_op.
> >>>Is it OK if the MFN belongs to another guest?
> >>>
> >>It is OK for the MFN to belong to another guest because only Dom0 is
> >>performing the mapping. This is to allow grant mapped pages to have
> >>a Dom0 BFN mapping.
> >That should be reflected in the spec. That the MFN have to pass the
> >grant check.
> Again, if the design is restricted to domain 0 only then there is no
> need for a grant check because previously domain 0 had a bus
> addressable mapping for all MFN's.
> >>>>The IOMMU TLB will only be flushed when the hypercall completes or a
> >>>>hypercall
> >>>>continuation is created.
> >>>>
> >>>>     struct iommu_map_op {
> >>>>         uint64_t bfn;
> >>>bus_frame ?
> >>Yes, or you could say, Bus Address with 4k page size granularity.
> >
> >The 'bfn' sounds like a new word. I was hoping you could use something
> >that is more in line with the nomenclature in the I/O world. Such as
> >'bus addresses', 'bus physical frame', etc. The 'bfn' sounds just
> >off. The best I could come up with was 'bus_frame' but there has to
> >be something that sounds better?
> I agree that bfn is not ideal. Using bus_address is dangerous
> because the mapping can only be done at 4k granularity (MFN's are 4k
> granularity also).
> "iofn" is an alternative but potentially misleading because IOMMU's
> may not be able to translate all IO device's accessible to dom0
> (particularly on ARM).

Who does translate the rest of those I/O devices? Other IOMMUs?

> >
> >>>>         uint64_t mfn;
> >>>>         uint32_t flags;
> >>>>         int32_t status;
> >>>>     };
> >>>>
> >>>>------------------------------------------------------------------------------
> >>>>Field          Purpose
> >>>>----- ---------------------------------------------------------------
> >>>>`bfn`          [in] Bus address frame number to mapped to specified
> >>>>mfn below
> >>>Huh? Isn't this out? If not, isn't bfn == mfn for dom0?
> >>>How would dom0 know the bus address? That usually is something only the
> >>>IOMMU knows.
> >>The idea is that Domain0 sets up BFN to MFN mappings which are the
> >>same as the GPFN to MFN mappings. This cannot be done from Xen's M2P
> >>mappings because they are not up to date for PV guests.
> >OK, but if GPFN == MFN, then this is not an bus frame number. You might
> >as well call it 'gmfn' ?
> The idea is not to force the particular mappings that domain 0 is
> allowed to create but the point I was trying to make above is that
> domain 0 is responsible for tracking the bfn mappings it has setup.

OK, just mention that in the spec :-)

> >>>>`mfn`          [in] Machine address frame number
> >>>>
> >>>We still need to do a bit of PFN -> MFN -> hypercall -> GFN and program
> >>>that in the PCIe devices right?
> >>Yes, a GPFN to MFN lookup will be required but this is simply
> >>consulting the Guest P2M. BTW, we are programming the IOMMU not the
> >>PCIe device's themselves.
> >We do have to program some value in the PCIe device so that it can
> >do the DMA operation. Said value right now is the MFN << PAGE_SHIFT (aka
> >physical address). But as the spec says - the 'physical address' might
> >not be equal to the 'bus address'. Hence the value we would be programming
> >in might be very well different than MFN.
> Yes but domain 0 will track the BFN to MFN mappings it has setup (
> if it setup BFN= GPFN then this is easy).
> So domain 0 is responsible to programming the correct BFN address
> into the PCIe device.
> >
> >Or in other words - the IOMMU could be non-coherent and it would use
> >its own addresses for physical addresses. As in, different than
> >the physical ones.
> I don't understand what you are trying to say here? When you say
> physical address do you mean MFN (memory based addresses) or BFN
> (bus addresses from the device point of view)?

physical address = CPU viewpoint, aka MFN.
bus addresses = IOMMU or I/O viewpoint, aka BFN.

And BFN != MFN.

> >>>
> >>>>`flags`        [in] Flags for signalling type of IOMMU mapping to be created
> >>>>
> >>>>`status`       [out] Mapping status of this map operation, 0
> >>>>indicates success
> >>>>------------------------------------------------------------------------------
> >>>>
> >>>>
> >>>>Defined bits for flags field
> >>>>------------------------------------------------------------------------
> >>>>Name                        Bit                Definition
> >>>>----                       ----- ----------------------------------
> >>>>IOMMU_MAP_OP_readable        0        Create readable IOMMU mapping
> >>>>IOMMU_MAP_OP_writeable       1        Create writeable IOMMU mapping
> >>>And is it OK to use both?
> >>Yes and typically both would be used. I'm just allowing for read
> >>only mappings and write only mappings to be created.
> >You need to mention that in the spec that it is OK to combine them.
> >Or mention which ones are not OK to combine.
> Different IOMMU implementations may change what type of mappings can
> be setup. I would expect the guest to handle the error if particular
> type of mapping can't be setup which the guest requires.

Ok, that needs to be mentioned in the spec too 
> >>>>Reserved for future use     2-31                   n/a
> >>>>------------------------------------------------------------------------
> >>>>
> >>>>Additional error codes specific to this hypercall:
> >>>>
> >>>>Error code  Reason
> >>>>---------- ------------------------------------------------------------
> >>>>EPERM       PV IOMMU mode not enabled or calling domain is not domain 0
> >>>And -EFAULT
> >>I was considering that EFAULT would be standard error code, do you
> >>want it to be explicitly listed?
> >Yes. This is a spec and if you don't spell it out folks might miss this.
> >
> >>>and what about success? Do you return 0 or the number of ops that were
> >>>successfull?
> >>Return 0 if all ops were successful, otherwise return the number of
> >>failed operations as positive number. I leave it up to the caller to
> >>determine which operations failed by iterating the input array. I'll
> >>will add this to draft B.
> >Thank you.
> >>>>------------------------------------------------------------------------
> >>>>
> >>>>IOMMUOP_unmap_page
> >>>>----------------
> >>>>First argument, pointer to array of `struct iommu_map_op`
> >>>>Second argument, integer count of `struct iommu_map_op` elements in array
> >>>Um, 'unsigned integer' count?
> >>Yes will change for draft B
> >>>>This subop will attempt to unmap each element in the `struct
> >>>>iommu_map_op` array
> >>>>and record the mapping status back into the array itself. If an
> >>>>unmapping fault
> >>>>occurs then the hypercall stop processing the array and return with
> >>>>an EFAULT;
> >>>>
> >>>>The IOMMU TLB will only be flushed when the hypercall completes or a
> >>>>hypercall
> >>>>continuation is created.
> >>>>
> >>>>     struct iommu_map_op {
> >>>>         uint64_t bfn;
> >>>>         uint64_t mfn;
> >>>>         uint32_t flags;
> >>>>         int32_t status;
> >>>>     };
> >>>>
> >>>>--------------------------------------------------------------------
> >>>>Field          Purpose
> >>>>-----          -----------------------------------------------------
> >>>>`bfn`          [in] Bus address frame number to be unmapped
> >>>I presume this is gathered from the 'map' call?
> >>It does not need to be gathered, Domain 0 is responsible for it's
> >>own BFN mappings and there is no auditing of the BFN address
> >>themselves.
> >I can see this be the case when BFN == MFN. But if that is not the
> >case I cannot see how the domain would gather the BFNs - unless it
> >gets it from the 'map' call.
> Domain 0 issue's the original PV IOMMU map hypercall to cause the
> BFN mapping to be created so it should be able to track what needs
> to be unmapped.

Earlier you stated: "It does not need to be gathered' but it does -
via the 'map' call which  you confirmed it here. Please just
note it in the spec that these values should used from prior
'map' calls.


> >
> >>>>`mfn`          [in] This field is ignored for unmap subop
> >>>>
> >>>>`flags`        [in] This field is ignored for unmap subop
> >>>>
> >>>>`status`       [out] Mapping status of this unmap operation, 0
> >>>>indicates success
> >>>>--------------------------------------------------------------------
> >>>>
> >>>>Additional error codes specific to this hypercall:
> >>>>
> >>>>Error code  Reason
> >>>>---------- ------------------------------------------------------------
> >>>>EPERM       PV IOMMU mode not enabled or calling domain is not domain 0
> >>>EFAULT too
> >>I was considering that EFAULT would be standard error code, do you
> >>want it to be explicitly listed?
> >Yes.
> >>
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] Dom0 PV IOMMU control design (draft A)
  2014-04-14 13:47 ` Jan Beulich
@ 2014-04-14 15:48   ` Malcolm Crossley
  2014-04-14 16:00     ` Jan Beulich
  2014-05-01 11:56     ` Tim Deegan
  0 siblings, 2 replies; 18+ messages in thread
From: Malcolm Crossley @ 2014-04-14 15:48 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On 14/04/14 14:47, Jan Beulich wrote:
>>>> On 11.04.14 at 19:28, <malcolm.crossley@citrix.com> wrote:
>> Purpose
>> =======
>>
>> Allow Xen Domain 0 PV guests to create/modify/destroy IOMMU mappings for
>> hardware devices that Domain 0 has access to. This enables Domain 0 to 
>> program
>> a bus address space mapping which matches it's GPFN mapping. Once a 1-1
>> mapping of GPFN to bus address space is created then a bounce buffer
>> region is not required for the IO devices connected to the IOMMU.
> So the idea then is for the domain to
> - create this 1:1 mapping once at boot, and update it as pages get
>   ballooned int/out, or
> - zap the entire IOMMU mapping at boot, and establish individual
>   mappings as needed based on the driver's use of the DMA map ops?
>

Yes both of those are the general idea, I will document this better in
draft B as per David's suggestion.
The first strategy has better performance (less IOMMU updates) but lower
security from bad devices, the second strategy has better security but
higher IOMMU mapping overhead.

The current idea for maximum performance is to create a 1:1 mapping for
dom0 GPFN's and an offset (offset by max dom0 GPFN) 1:1 mapping for the
whole of the MFN address space which will be used for grant mapped
pages. This should result in IOMMU updates only for ballooned pages.
>> IOMMUOP_unmap_page
>> ----------------
>> First argument, pointer to array of `struct iommu_map_op`
>> Second argument, integer count of `struct iommu_map_op` elements in array
>>
>> This subop will attempt to unmap each element in the `struct 
>> iommu_map_op` array
>> and record the mapping status back into the array itself. If an 
> unmapping?

Agreed
>
>> unmapping fault
> mapping?

This is an unmap hypercall so I believe it would be an unmapping fault
reported.
>
>> occurs then the hypercall stop processing the array and return with an 
>> EFAULT;
> -EFAULT.

Agreed
>> The IOMMU TLB will only be flushed when the hypercall completes or a 
>> hypercall
>> continuation is created.
>>
>>      struct iommu_map_op {
>>          uint64_t bfn;
>>          uint64_t mfn;
>>          uint32_t flags;
>>          int32_t status;
>>      };
>>
>> --------------------------------------------------------------------
>> Field          Purpose
>> -----          -----------------------------------------------------
>> `bfn`          [in] Bus address frame number to be unmapped
>>
>> `mfn`          [in] This field is ignored for unmap subop
>>
>> `flags`        [in] This field is ignored for unmap subop
>>
>> `status`       [out] Mapping status of this unmap operation, 0 indicates success
> Hmm, bfn and flags ignored would call for an abbreviated interface
> structure. Or how about IOMMU_MAP_OP_{readable,writeable}
> implying unmap, in which case for now you'd need only one op. And
> with that I wonder whether this shouldn't simply be a new
> PHYSDEVOP_iommu_map.

I initially tried to add it to the PHYSDEV hypercall as a subop but when
adding batch capability it seemed cleaner to use a 3 argument hypercall
to allow for the count to be passed in as argument rather than as a
field in a struct.

We could have a single op and define a new flag to indicate we want an
unmap to be performed.
>
>> Security Implications of allowing Domain 0 IOMMU control
>> ========================================================
>>
>> Xen currently allows IO devices attached to Domain 0 to have direct 
>> access to
>> the all of the MFN address space (except Xen hypervisor memory regions),
>> provided the Xen IOMMU option dom0-strict is not enabled.
>>
>> The PV IOMMU feature provides the same level of access to MFN address space
>> and the feature is not enabled when the Xen IOMMU option dom0-strict is
>> enabled. Therefore security is not affected by the PV IOMMU feature.
> While I was about to reply with the same request to extend this to
> driver domains, this last section pretty much excludes such an
> extension. I guess that would benefit from revisiting.

I agree supporting driver domains would not be straightforward with
regards to security so I would suggest that driver domain support could
be added later as a separate design.
>
> Jan
Thank you for your review comments.

Malcolm

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] Dom0 PV IOMMU control design (draft A)
  2014-04-14 15:03       ` Malcolm Crossley
  2014-04-14 15:09         ` Konrad Rzeszutek Wilk
@ 2014-04-14 15:48         ` Jan Beulich
  2014-04-14 16:38           ` Malcolm Crossley
  1 sibling, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2014-04-14 15:48 UTC (permalink / raw)
  To: Malcolm Crossley; +Cc: xen-devel

>>> On 14.04.14 at 17:03, <malcolm.crossley@citrix.com> wrote:
> On 14/04/14 13:51, Konrad Rzeszutek Wilk wrote:
>> On Mon, Apr 14, 2014 at 01:12:07PM +0100, Malcolm Crossley wrote:
>>> On 11/04/14 18:50, Konrad Rzeszutek Wilk wrote:
>>>> On Fri, Apr 11, 2014 at 06:28:43PM +0100, Malcolm Crossley wrote:
>>>>> Hi,
>>>>>
>>>>> Here is a design for allowing Dom0 PV guests to control the IOMMU.
>> With the device driver domains I think you should also rename the
>> 'dom0' to device driver or 'hardware domain' - as this functionality
>> should be possible within an PV guest with PCI passthrough for example.
> Currently Xen only allows Dom0 IOMMU access to all (expect Xen) MFN's. 

Except in dom0-strict mode, which your proposal doesn't even
support. Yet mid/long term that mode should imo be what Dom0
should be run in by default.

> To not change the current security implications of the feature I would 
> prefer that 'hardware domain' support was added as a separate design.

I think you mean "driver domain" here; "hardware domain" is the
generalized term for Dom0 (and with most of the respective changes
for the latter already in the staging tree I think you ought to no
longer use the term "Dom0" here).

Jan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] Dom0 PV IOMMU control design (draft A)
  2014-04-14 15:48   ` Malcolm Crossley
@ 2014-04-14 16:00     ` Jan Beulich
  2014-04-14 16:55       ` Malcolm Crossley
  2014-05-01 11:56     ` Tim Deegan
  1 sibling, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2014-04-14 16:00 UTC (permalink / raw)
  To: Malcolm Crossley; +Cc: xen-devel

>>> On 14.04.14 at 17:48, <malcolm.crossley@citrix.com> wrote:
> On 14/04/14 14:47, Jan Beulich wrote:
>>>>> On 11.04.14 at 19:28, <malcolm.crossley@citrix.com> wrote:
>>> Purpose
>>> =======
>>>
>>> Allow Xen Domain 0 PV guests to create/modify/destroy IOMMU mappings for
>>> hardware devices that Domain 0 has access to. This enables Domain 0 to 
>>> program
>>> a bus address space mapping which matches it's GPFN mapping. Once a 1-1
>>> mapping of GPFN to bus address space is created then a bounce buffer
>>> region is not required for the IO devices connected to the IOMMU.
>> So the idea then is for the domain to
>> - create this 1:1 mapping once at boot, and update it as pages get
>>   ballooned int/out, or
>> - zap the entire IOMMU mapping at boot, and establish individual
>>   mappings as needed based on the driver's use of the DMA map ops?
>>
> 
> Yes both of those are the general idea, I will document this better in
> draft B as per David's suggestion.
> The first strategy has better performance (less IOMMU updates) but lower
> security from bad devices, the second strategy has better security but
> higher IOMMU mapping overhead.
> 
> The current idea for maximum performance is to create a 1:1 mapping for
> dom0 GPFN's and an offset (offset by max dom0 GPFN) 1:1 mapping for the
> whole of the MFN address space which will be used for grant mapped
> pages. This should result in IOMMU updates only for ballooned pages.

That's not fully scalable: You can't cope with systems making use of
the full physical address space (e.g. by sparsely allocating where
each node's memory goes). HP has been having such systems for a
couple of years.

>>> IOMMUOP_unmap_page
>>> ----------------
>>> First argument, pointer to array of `struct iommu_map_op`
>>> Second argument, integer count of `struct iommu_map_op` elements in array
>>>
>>> This subop will attempt to unmap each element in the `struct 
>>> iommu_map_op` array
>>> and record the mapping status back into the array itself. If an 
>> unmapping?
> 
> Agreed
>>
>>> unmapping fault
>> mapping?
> 
> This is an unmap hypercall so I believe it would be an unmapping fault
> reported.

No - an individual unmap operation failing should result in the
status to be set to -EFAULT (or whatever). The only time you'd
fail the entire hypercall with -EFAULT would be when you can't
read an interface structure array element. Hence if anything this
would be a mapping fault (albeit I didn't like the term in the
IOMMU_map_page description either).

>>> The IOMMU TLB will only be flushed when the hypercall completes or a 
>>> hypercall
>>> continuation is created.
>>>
>>>      struct iommu_map_op {
>>>          uint64_t bfn;
>>>          uint64_t mfn;
>>>          uint32_t flags;
>>>          int32_t status;
>>>      };
>>>
>>> --------------------------------------------------------------------
>>> Field          Purpose
>>> -----          -----------------------------------------------------
>>> `bfn`          [in] Bus address frame number to be unmapped
>>>
>>> `mfn`          [in] This field is ignored for unmap subop
>>>
>>> `flags`        [in] This field is ignored for unmap subop
>>>
>>> `status`       [out] Mapping status of this unmap operation, 0 indicates 
> success
>> Hmm, bfn and flags ignored would call for an abbreviated interface
>> structure. Or how about IOMMU_MAP_OP_{readable,writeable}
>> implying unmap, in which case for now you'd need only one op. And
>> with that I wonder whether this shouldn't simply be a new
>> PHYSDEVOP_iommu_map.
> 
> I initially tried to add it to the PHYSDEV hypercall as a subop but when
> adding batch capability it seemed cleaner to use a 3 argument hypercall
> to allow for the count to be passed in as argument rather than as a
> field in a struct.

Yes, I see - batching really becomes easier that way.

>>> Security Implications of allowing Domain 0 IOMMU control
>>> ========================================================
>>>
>>> Xen currently allows IO devices attached to Domain 0 to have direct 
>>> access to
>>> the all of the MFN address space (except Xen hypervisor memory regions),
>>> provided the Xen IOMMU option dom0-strict is not enabled.
>>>
>>> The PV IOMMU feature provides the same level of access to MFN address space
>>> and the feature is not enabled when the Xen IOMMU option dom0-strict is
>>> enabled. Therefore security is not affected by the PV IOMMU feature.
>> While I was about to reply with the same request to extend this to
>> driver domains, this last section pretty much excludes such an
>> extension. I guess that would benefit from revisiting.
> 
> I agree supporting driver domains would not be straightforward with
> regards to security so I would suggest that driver domain support could
> be added later as a separate design.

That's odd - you design something anew and want to immediately
postpone one of the obvious potential uses of it? As just said in
another reply, I think you will want to support dom0-strict mode
anyway, and with that I would expect PV driver domain support to
fall out almost naturally.

Jan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] Dom0 PV IOMMU control design (draft A)
  2014-04-14 15:48         ` Jan Beulich
@ 2014-04-14 16:38           ` Malcolm Crossley
  2014-04-15  6:50             ` Jan Beulich
  0 siblings, 1 reply; 18+ messages in thread
From: Malcolm Crossley @ 2014-04-14 16:38 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On 14/04/14 16:48, Jan Beulich wrote:
>>>> On 14.04.14 at 17:03, <malcolm.crossley@citrix.com> wrote:
>> On 14/04/14 13:51, Konrad Rzeszutek Wilk wrote:
>>> On Mon, Apr 14, 2014 at 01:12:07PM +0100, Malcolm Crossley wrote:
>>>> On 11/04/14 18:50, Konrad Rzeszutek Wilk wrote:
>>>>> On Fri, Apr 11, 2014 at 06:28:43PM +0100, Malcolm Crossley wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Here is a design for allowing Dom0 PV guests to control the IOMMU.
>>> With the device driver domains I think you should also rename the
>>> 'dom0' to device driver or 'hardware domain' - as this functionality
>>> should be possible within an PV guest with PCI passthrough for example.
>> Currently Xen only allows Dom0 IOMMU access to all (expect Xen) MFN's. 
> 
> Except in dom0-strict mode, which your proposal doesn't even
> support. Yet mid/long term that mode should imo be what Dom0
> should be run in by default.
> 
dom0-strict mode could be supported by validating the mfn's passed in
via the map operation to belong to the DomU making the hypercall.

Also, if we modify the grant map hypercalls to allow dev_bus_addr of
struct gnttab_map_grant_ref to be used as an input and output parameter
then we can specify the required BFN mapping for grant map pages. If we
also delay IOTLB flushes as part of batched grant map operations then
performance should also be OK.

I believe the two changes above would should be enough to support
'driver domains'. Would you agree?


>> To not change the current security implications of the feature I would 
>> prefer that 'hardware domain' support was added as a separate design.
> 
> I think you mean "driver domain" here; "hardware domain" is the
> generalized term for Dom0 (and with most of the respective changes
> for the latter already in the staging tree I think you ought to no
> longer use the term "Dom0" here).

You are right, I did mean 'driver domain' for the security implications.
I will replace references to domain 0 with hardware domain in the next
draft.
> 
> Jan
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] Dom0 PV IOMMU control design (draft A)
  2014-04-14 16:00     ` Jan Beulich
@ 2014-04-14 16:55       ` Malcolm Crossley
  2014-04-15  6:56         ` Jan Beulich
  0 siblings, 1 reply; 18+ messages in thread
From: Malcolm Crossley @ 2014-04-14 16:55 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On 14/04/14 17:00, Jan Beulich wrote:
>>>> On 14.04.14 at 17:48, <malcolm.crossley@citrix.com> wrote:
>> On 14/04/14 14:47, Jan Beulich wrote:
>>>>>> On 11.04.14 at 19:28, <malcolm.crossley@citrix.com> wrote:
>>>> Purpose
>>>> =======
>>>>
>>>> Allow Xen Domain 0 PV guests to create/modify/destroy IOMMU mappings for
>>>> hardware devices that Domain 0 has access to. This enables Domain 0 to 
>>>> program
>>>> a bus address space mapping which matches it's GPFN mapping. Once a 1-1
>>>> mapping of GPFN to bus address space is created then a bounce buffer
>>>> region is not required for the IO devices connected to the IOMMU.
>>> So the idea then is for the domain to
>>> - create this 1:1 mapping once at boot, and update it as pages get
>>>   ballooned int/out, or
>>> - zap the entire IOMMU mapping at boot, and establish individual
>>>   mappings as needed based on the driver's use of the DMA map ops?
>>>
>>
>> Yes both of those are the general idea, I will document this better in
>> draft B as per David's suggestion.
>> The first strategy has better performance (less IOMMU updates) but lower
>> security from bad devices, the second strategy has better security but
>> higher IOMMU mapping overhead.
>>
>> The current idea for maximum performance is to create a 1:1 mapping for
>> dom0 GPFN's and an offset (offset by max dom0 GPFN) 1:1 mapping for the
>> whole of the MFN address space which will be used for grant mapped
>> pages. This should result in IOMMU updates only for ballooned pages.
> 
> That's not fully scalable: You can't cope with systems making use of
> the full physical address space (e.g. by sparsely allocating where
> each node's memory goes). HP has been having such systems for a
> couple of years.
> 
Good point, I was not aware that there were systems which use the top of
the physical address space.

I propose that Domain 0 can detect the sparse holes via the E820 address
map and implement a simple form of address space compression. As long as
the amount physical memory in the system is not more than half of the
physical memory address space then both a GPFN and "grant map 1:MFN" BFN
mapping should be able to fit.

>>>> IOMMUOP_unmap_page
>>>> ----------------
>>>> First argument, pointer to array of `struct iommu_map_op`
>>>> Second argument, integer count of `struct iommu_map_op` elements in array
>>>>
>>>> This subop will attempt to unmap each element in the `struct 
>>>> iommu_map_op` array
>>>> and record the mapping status back into the array itself. If an 
>>> unmapping?
>>
>> Agreed
>>>
>>>> unmapping fault
>>> mapping?
>>
>> This is an unmap hypercall so I believe it would be an unmapping fault
>> reported.
> 
> No - an individual unmap operation failing should result in the
> status to be set to -EFAULT (or whatever). The only time you'd
> fail the entire hypercall with -EFAULT would be when you can't
> read an interface structure array element. Hence if anything this
> would be a mapping fault (albeit I didn't like the term in the
> IOMMU_map_page description either).
> 

OK, I agree. Would it be cleaner hypercall interface if the hypercall
returns 0 if all op's succeeded and a positive count of number of
failure ops? The caller would then need to scan the status entry of the
array to find the failed op's. Failed op's should hopefully be rare and
so the overview of scanning the array should be acceptable.

>>>> The IOMMU TLB will only be flushed when the hypercall completes or a 
>>>> hypercall
>>>> continuation is created.
>>>>
>>>>      struct iommu_map_op {
>>>>          uint64_t bfn;
>>>>          uint64_t mfn;
>>>>          uint32_t flags;
>>>>          int32_t status;
>>>>      };
>>>>
>>>> --------------------------------------------------------------------
>>>> Field          Purpose
>>>> -----          -----------------------------------------------------
>>>> `bfn`          [in] Bus address frame number to be unmapped
>>>>
>>>> `mfn`          [in] This field is ignored for unmap subop
>>>>
>>>> `flags`        [in] This field is ignored for unmap subop
>>>>
>>>> `status`       [out] Mapping status of this unmap operation, 0 indicates 
>> success
>>> Hmm, bfn and flags ignored would call for an abbreviated interface
>>> structure. Or how about IOMMU_MAP_OP_{readable,writeable}
>>> implying unmap, in which case for now you'd need only one op. And
>>> with that I wonder whether this shouldn't simply be a new
>>> PHYSDEVOP_iommu_map.
>>
>> I initially tried to add it to the PHYSDEV hypercall as a subop but when
>> adding batch capability it seemed cleaner to use a 3 argument hypercall
>> to allow for the count to be passed in as argument rather than as a
>> field in a struct.
> 
> Yes, I see - batching really becomes easier that way.
> 
>>>> Security Implications of allowing Domain 0 IOMMU control
>>>> ========================================================
>>>>
>>>> Xen currently allows IO devices attached to Domain 0 to have direct 
>>>> access to
>>>> the all of the MFN address space (except Xen hypervisor memory regions),
>>>> provided the Xen IOMMU option dom0-strict is not enabled.
>>>>
>>>> The PV IOMMU feature provides the same level of access to MFN address space
>>>> and the feature is not enabled when the Xen IOMMU option dom0-strict is
>>>> enabled. Therefore security is not affected by the PV IOMMU feature.
>>> While I was about to reply with the same request to extend this to
>>> driver domains, this last section pretty much excludes such an
>>> extension. I guess that would benefit from revisiting.
>>
>> I agree supporting driver domains would not be straightforward with
>> regards to security so I would suggest that driver domain support could
>> be added later as a separate design.
> 
> That's odd - you design something anew and want to immediately
> postpone one of the obvious potential uses of it? As just said in
> another reply, I think you will want to support dom0-strict mode
> anyway, and with that I would expect PV driver domain support to
> fall out almost naturally.
> 

I was trying to restrict the scope so that the feature could be
implemented in smaller chunks. Adding PV driver domains support does not
look like a significant extra amount of work so I've proposed a way to
support PV driver domains in another reply, I would be very interested
in thoughts on technique I've proposed.

> Jan
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] Dom0 PV IOMMU control design (draft A)
  2014-04-14 16:38           ` Malcolm Crossley
@ 2014-04-15  6:50             ` Jan Beulich
  0 siblings, 0 replies; 18+ messages in thread
From: Jan Beulich @ 2014-04-15  6:50 UTC (permalink / raw)
  To: Malcolm Crossley; +Cc: xen-devel

>>> On 14.04.14 at 18:38, <malcolm.crossley@citrix.com> wrote:
> On 14/04/14 16:48, Jan Beulich wrote:
>>>>> On 14.04.14 at 17:03, <malcolm.crossley@citrix.com> wrote:
>>> On 14/04/14 13:51, Konrad Rzeszutek Wilk wrote:
>>>> On Mon, Apr 14, 2014 at 01:12:07PM +0100, Malcolm Crossley wrote:
>>>>> On 11/04/14 18:50, Konrad Rzeszutek Wilk wrote:
>>>>>> On Fri, Apr 11, 2014 at 06:28:43PM +0100, Malcolm Crossley wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Here is a design for allowing Dom0 PV guests to control the IOMMU.
>>>> With the device driver domains I think you should also rename the
>>>> 'dom0' to device driver or 'hardware domain' - as this functionality
>>>> should be possible within an PV guest with PCI passthrough for example.
>>> Currently Xen only allows Dom0 IOMMU access to all (expect Xen) MFN's. 
>> 
>> Except in dom0-strict mode, which your proposal doesn't even
>> support. Yet mid/long term that mode should imo be what Dom0
>> should be run in by default.
>> 
> dom0-strict mode could be supported by validating the mfn's passed in
> via the map operation to belong to the DomU making the hypercall.
> 
> Also, if we modify the grant map hypercalls to allow dev_bus_addr of
> struct gnttab_map_grant_ref to be used as an input and output parameter
> then we can specify the required BFN mapping for grant map pages. If we
> also delay IOTLB flushes as part of batched grant map operations then
> performance should also be OK.
> 
> I believe the two changes above would should be enough to support
> 'driver domains'. Would you agree?

Yes, at a first glance this should be all that's needed.

Jan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] Dom0 PV IOMMU control design (draft A)
  2014-04-14 16:55       ` Malcolm Crossley
@ 2014-04-15  6:56         ` Jan Beulich
  0 siblings, 0 replies; 18+ messages in thread
From: Jan Beulich @ 2014-04-15  6:56 UTC (permalink / raw)
  To: Malcolm Crossley; +Cc: xen-devel

>>> On 14.04.14 at 18:55, <malcolm.crossley@citrix.com> wrote:
> On 14/04/14 17:00, Jan Beulich wrote:
>>>>> On 14.04.14 at 17:48, <malcolm.crossley@citrix.com> wrote:
>>> The current idea for maximum performance is to create a 1:1 mapping for
>>> dom0 GPFN's and an offset (offset by max dom0 GPFN) 1:1 mapping for the
>>> whole of the MFN address space which will be used for grant mapped
>>> pages. This should result in IOMMU updates only for ballooned pages.
>> 
>> That's not fully scalable: You can't cope with systems making use of
>> the full physical address space (e.g. by sparsely allocating where
>> each node's memory goes). HP has been having such systems for a
>> couple of years.
>> 
> Good point, I was not aware that there were systems which use the top of
> the physical address space.
> 
> I propose that Domain 0 can detect the sparse holes via the E820 address
> map and implement a simple form of address space compression. As long as
> the amount physical memory in the system is not more than half of the
> physical memory address space then both a GPFN and "grant map 1:MFN" BFN
> mapping should be able to fit.

In the hopes that e.g. no system populates more than half of its
physical address space - not a very nice assumption. Plus when
determining holes here, I'm afraid the E820 map isn't sufficient: You'd
also need to consider MMIO regions, and other special firmware
(e.g. ACPI) and hardware (e.g. HyperTransport) ranges.

>>>>> IOMMUOP_unmap_page
>>>>> ----------------
>>>>> First argument, pointer to array of `struct iommu_map_op`
>>>>> Second argument, integer count of `struct iommu_map_op` elements in array
>>>>>
>>>>> This subop will attempt to unmap each element in the `struct 
>>>>> iommu_map_op` array
>>>>> and record the mapping status back into the array itself. If an 
>>>> unmapping?
>>>
>>> Agreed
>>>>
>>>>> unmapping fault
>>>> mapping?
>>>
>>> This is an unmap hypercall so I believe it would be an unmapping fault
>>> reported.
>> 
>> No - an individual unmap operation failing should result in the
>> status to be set to -EFAULT (or whatever). The only time you'd
>> fail the entire hypercall with -EFAULT would be when you can't
>> read an interface structure array element. Hence if anything this
>> would be a mapping fault (albeit I didn't like the term in the
>> IOMMU_map_page description either).
>> 
> 
> OK, I agree. Would it be cleaner hypercall interface if the hypercall
> returns 0 if all op's succeeded and a positive count of number of
> failure ops? The caller would then need to scan the status entry of the
> array to find the failed op's. Failed op's should hopefully be rare and
> so the overview of scanning the array should be acceptable.

Yes, that's reasonable (as would be the alternative used by some
memory ops, returning the number of succeeded slots).

Jan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] Dom0 PV IOMMU control design (draft A)
  2014-04-11 17:28 [RFC] Dom0 PV IOMMU control design (draft A) Malcolm Crossley
                   ` (2 preceding siblings ...)
  2014-04-14 13:47 ` Jan Beulich
@ 2014-04-16 14:13 ` Zhang, Xiantao
  2014-04-16 15:35   ` Malcolm Crossley
  3 siblings, 1 reply; 18+ messages in thread
From: Zhang, Xiantao @ 2014-04-16 14:13 UTC (permalink / raw)
  To: Malcolm Crossley, xen-devel

What's the real issue are you trying to resolve? Typically, Dom0's DMA memory is allocated in guest-physical contiguous memory, and then exchanged to machine-physical contiguous memory if its size exceeds one page. I didn't see why bounce buffer is needed in most cases.
Last year, we found some legacy hardware drivers may use static DMA buffer instead of dynamic buffer, and if the static buffer's size exceeds one page, it may bring issues due to lack of memory exchange operation. One patch is cooked and posted out for fixing such issues, but looks it is not merged so far. http://lkml.iu.edu//hypermail/linux/kernel/1212.0/02226.html 
In this case, it indeed need one bounce buffer, but this is not a typical case. 
Xiantao
> -----Original Message-----
> From: xen-devel-bounces@lists.xen.org
> [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of Malcolm Crossley
> Sent: Saturday, April 12, 2014 1:29 AM
> To: xen-devel
> Subject: [Xen-devel] [RFC] Dom0 PV IOMMU control design (draft A)
> 
> Hi,
> 
> Here is a design for allowing Dom0 PV guests to control the IOMMU. This
> allows for the Dom0 GPFN mapping to be programmed into the IOMMU and
> avoid using the SWIOTLB bounce buffer technique in the Linux kernel (except for
> legacy 32 bit DMA IO devices)
> 
> This feature provides two gains:
> 1.  Improved performance for use cases which relied upon the bounce buffer
> e.g. NIC cards using jumbo frames with linear buffers.
> 2.  Prevent SWIOTLB bounce buffer region exhaustion which can cause
> unrecoverable Linux kernel driver errors.
> 
> A PDF version of the document is available here:
> 
> http://xenbits.xen.org/people/andrewcoop/pv-iommu-control-A.pdf
> 
> The pandoc markdown format of the document is provided below to allow for
> easier inline comments:
> 
> Introduction
> ============
> 
> Background
> -------
> 
> Xen PV guests use a Guest Pseudo-physical Frame Number(GPFN) address space
> which is decoupled from the host Machine Frame Number(MFN) address space.
> PV guests which interact with hardware need to translate GPFN addresses to
> MFN address because hardware uses the host address space only.
> PV guest hardware drivers are only aware of the GPFN address space only and
> assume that if GPFN addresses are contiguous then the hardware addresses
> would be contiguous as well. The decoupling between GPFN and MFN address
> spaces means GPFN and MFN addresses may not be contiguous across page
> boundaries and thus a buffer allocated in GPFN address space which spans a
> page boundary may not be contiguous in MFN address space.
> 
> PV hardware drivers cannot tolerate this behaviour and so a special "bounce
> buffer" region is used to hide this issue from the drivers.
> 
> A bounce buffer region is a special part of the GPFN address space which has
> been made to be contiguous in both GPFN and MFN address spaces. When a
> driver requests a buffer which spans a page boundary be made available for
> hardware to read then core operating system code copies the buffer into a
> temporarily reserved part of the bounce buffer region and then returns the MFN
> address of the reserved part of the bounce buffer region back to the driver itself.
> The driver then instructs the hardware to read the copy of the buffer in the
> bounce buffer. Similarly if the driver requests a buffer is made available for
> hardware to write to then first a region of the bounce buffer is reserved and
> then after the hardware completes writing then the reserved region of bounce
> buffer is copied to the originally allocated buffer.
> 
> The overheard of memory copies to/from the bounce buffer region is high and
> damages performance. Furthermore, there is a risk the fixed size bounce buffer
> region will become exhausted and it will not be possible to return an hardware
> address back to the driver. The Linux kernel drivers do not tolerate this failure
> and so the kernel is forced to crash the kernel, as an uncorrectable error has
> occurred.
> 
> Input/Output Memory Management Units (IOMMU) allow for an inbound
> address mapping to be created from the I/O Bus address space (typically PCI) to
> the machine frame number address space. IOMMU's typically use a page table
> mechanism to manage the mappings and therefore can create mappings of
> page size granularity or larger.
> 
> Purpose
> =======
> 
> Allow Xen Domain 0 PV guests to create/modify/destroy IOMMU mappings for
> hardware devices that Domain 0 has access to. This enables Domain 0 to
> program a bus address space mapping which matches it's GPFN mapping. Once
> a 1-1 mapping of GPFN to bus address space is created then a bounce buffer
> region is not required for the IO devices connected to the IOMMU.
> 
> 
> Architecture
> ============
> 
> A three argument hypercall interface (do_iommu_op), implementing two
> hypercall
> subops.
> 
> Design considerations for hypercall subops
> -------------------------------------------
> IOMMU map/unmap operations can be slow and can involve flushing the
> IOMMU TLB
> to ensure the IO device uses the updated mappings.
> 
> The subops have been designed to take an array of operations and a count as
> parameters. This allows for easily implemented hypercall continuations
> to be
> used and allows for batches of IOMMU operations to be submitted before
> flushing
> the IOMMU TLB.
> 
> 
> 
> IOMMUOP_map_page
> ----------------
> First argument, pointer to array of `struct iommu_map_op`
> Second argument, integer count of `struct iommu_map_op` elements in array
> 
> This subop will attempt to IOMMU map each element in the `struct
> iommu_map_op`
> array and record the mapping status back into the array itself. If an
> mapping
> fault occurs then the hypercall will return with -EFAULT.
> 
> This subop will inspect the MFN address being mapped in each
> iommu_map_op to
> ensure it does not belong to the Xen hypervisor itself. If the MFN does
> belong
> to the Xen hypervisor the subop will return -EPERM in the status field
> for that
> particular iommu_map_op.
> 
> The IOMMU TLB will only be flushed when the hypercall completes or a
> hypercall
> continuation is created.
> 
>      struct iommu_map_op {
>          uint64_t bfn;
>          uint64_t mfn;
>          uint32_t flags;
>          int32_t status;
>      };
> 
> ------------------------------------------------------------------------------
> Field          Purpose
> ----- ---------------------------------------------------------------
> `bfn`          [in] Bus address frame number to mapped to specified mfn
> below
> 
> `mfn`          [in] Machine address frame number
> 
> `flags`        [in] Flags for signalling type of IOMMU mapping to be created
> 
> `status`       [out] Mapping status of this map operation, 0 indicates
> success
> ------------------------------------------------------------------------------
> 
> 
> Defined bits for flags field
> ------------------------------------------------------------------------
> Name                        Bit                Definition
> ----                       ----- ----------------------------------
> IOMMU_MAP_OP_readable        0        Create readable IOMMU
> mapping
> IOMMU_MAP_OP_writeable       1        Create writeable IOMMU
> mapping
> Reserved for future use     2-31                   n/a
> ------------------------------------------------------------------------
> 
> Additional error codes specific to this hypercall:
> 
> Error code  Reason
> ---------- ------------------------------------------------------------
> EPERM       PV IOMMU mode not enabled or calling domain is not domain 0
> ------------------------------------------------------------------------
> 
> IOMMUOP_unmap_page
> ----------------
> First argument, pointer to array of `struct iommu_map_op`
> Second argument, integer count of `struct iommu_map_op` elements in array
> 
> This subop will attempt to unmap each element in the `struct
> iommu_map_op` array
> and record the mapping status back into the array itself. If an
> unmapping fault
> occurs then the hypercall stop processing the array and return with an
> EFAULT;
> 
> The IOMMU TLB will only be flushed when the hypercall completes or a
> hypercall
> continuation is created.
> 
>      struct iommu_map_op {
>          uint64_t bfn;
>          uint64_t mfn;
>          uint32_t flags;
>          int32_t status;
>      };
> 
> --------------------------------------------------------------------
> Field          Purpose
> -----          -----------------------------------------------------
> `bfn`          [in] Bus address frame number to be unmapped
> 
> `mfn`          [in] This field is ignored for unmap subop
> 
> `flags`        [in] This field is ignored for unmap subop
> 
> `status`       [out] Mapping status of this unmap operation, 0 indicates
> success
> --------------------------------------------------------------------
> 
> Additional error codes specific to this hypercall:
> 
> Error code  Reason
> ---------- ------------------------------------------------------------
> EPERM       PV IOMMU mode not enabled or calling domain is not domain 0
> ------------------------------------------------------------------------
> 
> 
> Conditions for which PV IOMMU hypercalls succeed
> ------------------------------------------------
> All the following conditions are required to be true for PV IOMMU hypercalls
> to succeed:
> 
> 1. IOMMU detected and supported by Xen
> 2. The following Xen IOMMU options are NOT enabled: dom0-passthrough,
> dom0-strict
> 3. Domain 0 is making the hypercall
> 
> 
> Security Implications of allowing Domain 0 IOMMU control
> ========================================================
> 
> Xen currently allows IO devices attached to Domain 0 to have direct
> access to
> the all of the MFN address space (except Xen hypervisor memory regions),
> provided the Xen IOMMU option dom0-strict is not enabled.
> 
> The PV IOMMU feature provides the same level of access to MFN address space
> and the feature is not enabled when the Xen IOMMU option dom0-strict is
> enabled. Therefore security is not affected by the PV IOMMU feature.
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] Dom0 PV IOMMU control design (draft A)
  2014-04-16 14:13 ` Zhang, Xiantao
@ 2014-04-16 15:35   ` Malcolm Crossley
  0 siblings, 0 replies; 18+ messages in thread
From: Malcolm Crossley @ 2014-04-16 15:35 UTC (permalink / raw)
  To: Zhang, Xiantao, xen-devel

On 16/04/14 15:13, Zhang, Xiantao wrote:
> What's the real issue are you trying to resolve? Typically, Dom0's DMA memory is allocated in guest-physical contiguous memory, and then exchanged to machine-physical contiguous memory if its size exceeds one page. I didn't see why bounce buffer is needed in most cases.
> Last year, we found some legacy hardware drivers may use static DMA buffer instead of dynamic buffer, and if the static buffer's size exceeds one page, it may bring issues due to lack of memory exchange operation. One patch is cooked and posted out for fixing such issues, but looks it is not merged so far. http://lkml.iu.edu//hypermail/linux/kernel/1212.0/02226.html 
> In this case, it indeed need one bounce buffer, but this is not a typical case. 
We have found that many NIC vendors e.g. (Cisco enic, Solarflare sfc,
Mellanox mlx4) use buffers larger than a page when configured with jumbo
frames enabled. High performance NIC's are typically configured with
jumbo frames which means a lot of pressure of put on the SWIOTLB region
because the memory copies cannot occur as fast as the NIC hardware itself.

Using memory exchange based on DMA mask to create bounce regions is
prone to allocation failures if the DMA mask is small (less the 4GB)
because potentially many PV driver domains are trying to use the same
low memory region.

Malcolm
> Xiantao
>> -----Original Message-----
>> From: xen-devel-bounces@lists.xen.org
>> [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of Malcolm Crossley
>> Sent: Saturday, April 12, 2014 1:29 AM
>> To: xen-devel
>> Subject: [Xen-devel] [RFC] Dom0 PV IOMMU control design (draft A)
>>
>> Hi,
>>
>> Here is a design for allowing Dom0 PV guests to control the IOMMU. This
>> allows for the Dom0 GPFN mapping to be programmed into the IOMMU and
>> avoid using the SWIOTLB bounce buffer technique in the Linux kernel (except for
>> legacy 32 bit DMA IO devices)
>>
>> This feature provides two gains:
>> 1.  Improved performance for use cases which relied upon the bounce buffer
>> e.g. NIC cards using jumbo frames with linear buffers.
>> 2.  Prevent SWIOTLB bounce buffer region exhaustion which can cause
>> unrecoverable Linux kernel driver errors.
>>
>> A PDF version of the document is available here:
>>
>> http://xenbits.xen.org/people/andrewcoop/pv-iommu-control-A.pdf
>>
>> The pandoc markdown format of the document is provided below to allow for
>> easier inline comments:
>>
>> Introduction
>> ============
>>
>> Background
>> -------
>>
>> Xen PV guests use a Guest Pseudo-physical Frame Number(GPFN) address space
>> which is decoupled from the host Machine Frame Number(MFN) address space.
>> PV guests which interact with hardware need to translate GPFN addresses to
>> MFN address because hardware uses the host address space only.
>> PV guest hardware drivers are only aware of the GPFN address space only and
>> assume that if GPFN addresses are contiguous then the hardware addresses
>> would be contiguous as well. The decoupling between GPFN and MFN address
>> spaces means GPFN and MFN addresses may not be contiguous across page
>> boundaries and thus a buffer allocated in GPFN address space which spans a
>> page boundary may not be contiguous in MFN address space.
>>
>> PV hardware drivers cannot tolerate this behaviour and so a special "bounce
>> buffer" region is used to hide this issue from the drivers.
>>
>> A bounce buffer region is a special part of the GPFN address space which has
>> been made to be contiguous in both GPFN and MFN address spaces. When a
>> driver requests a buffer which spans a page boundary be made available for
>> hardware to read then core operating system code copies the buffer into a
>> temporarily reserved part of the bounce buffer region and then returns the MFN
>> address of the reserved part of the bounce buffer region back to the driver itself.
>> The driver then instructs the hardware to read the copy of the buffer in the
>> bounce buffer. Similarly if the driver requests a buffer is made available for
>> hardware to write to then first a region of the bounce buffer is reserved and
>> then after the hardware completes writing then the reserved region of bounce
>> buffer is copied to the originally allocated buffer.
>>
>> The overheard of memory copies to/from the bounce buffer region is high and
>> damages performance. Furthermore, there is a risk the fixed size bounce buffer
>> region will become exhausted and it will not be possible to return an hardware
>> address back to the driver. The Linux kernel drivers do not tolerate this failure
>> and so the kernel is forced to crash the kernel, as an uncorrectable error has
>> occurred.
>>
>> Input/Output Memory Management Units (IOMMU) allow for an inbound
>> address mapping to be created from the I/O Bus address space (typically PCI) to
>> the machine frame number address space. IOMMU's typically use a page table
>> mechanism to manage the mappings and therefore can create mappings of
>> page size granularity or larger.
>>
>> Purpose
>> =======
>>
>> Allow Xen Domain 0 PV guests to create/modify/destroy IOMMU mappings for
>> hardware devices that Domain 0 has access to. This enables Domain 0 to
>> program a bus address space mapping which matches it's GPFN mapping. Once
>> a 1-1 mapping of GPFN to bus address space is created then a bounce buffer
>> region is not required for the IO devices connected to the IOMMU.
>>
>>
>> Architecture
>> ============
>>
>> A three argument hypercall interface (do_iommu_op), implementing two
>> hypercall
>> subops.
>>
>> Design considerations for hypercall subops
>> -------------------------------------------
>> IOMMU map/unmap operations can be slow and can involve flushing the
>> IOMMU TLB
>> to ensure the IO device uses the updated mappings.
>>
>> The subops have been designed to take an array of operations and a count as
>> parameters. This allows for easily implemented hypercall continuations
>> to be
>> used and allows for batches of IOMMU operations to be submitted before
>> flushing
>> the IOMMU TLB.
>>
>>
>>
>> IOMMUOP_map_page
>> ----------------
>> First argument, pointer to array of `struct iommu_map_op`
>> Second argument, integer count of `struct iommu_map_op` elements in array
>>
>> This subop will attempt to IOMMU map each element in the `struct
>> iommu_map_op`
>> array and record the mapping status back into the array itself. If an
>> mapping
>> fault occurs then the hypercall will return with -EFAULT.
>>
>> This subop will inspect the MFN address being mapped in each
>> iommu_map_op to
>> ensure it does not belong to the Xen hypervisor itself. If the MFN does
>> belong
>> to the Xen hypervisor the subop will return -EPERM in the status field
>> for that
>> particular iommu_map_op.
>>
>> The IOMMU TLB will only be flushed when the hypercall completes or a
>> hypercall
>> continuation is created.
>>
>>      struct iommu_map_op {
>>          uint64_t bfn;
>>          uint64_t mfn;
>>          uint32_t flags;
>>          int32_t status;
>>      };
>>
>> ------------------------------------------------------------------------------
>> Field          Purpose
>> ----- ---------------------------------------------------------------
>> `bfn`          [in] Bus address frame number to mapped to specified mfn
>> below
>>
>> `mfn`          [in] Machine address frame number
>>
>> `flags`        [in] Flags for signalling type of IOMMU mapping to be created
>>
>> `status`       [out] Mapping status of this map operation, 0 indicates
>> success
>> ------------------------------------------------------------------------------
>>
>>
>> Defined bits for flags field
>> ------------------------------------------------------------------------
>> Name                        Bit                Definition
>> ----                       ----- ----------------------------------
>> IOMMU_MAP_OP_readable        0        Create readable IOMMU
>> mapping
>> IOMMU_MAP_OP_writeable       1        Create writeable IOMMU
>> mapping
>> Reserved for future use     2-31                   n/a
>> ------------------------------------------------------------------------
>>
>> Additional error codes specific to this hypercall:
>>
>> Error code  Reason
>> ---------- ------------------------------------------------------------
>> EPERM       PV IOMMU mode not enabled or calling domain is not domain 0
>> ------------------------------------------------------------------------
>>
>> IOMMUOP_unmap_page
>> ----------------
>> First argument, pointer to array of `struct iommu_map_op`
>> Second argument, integer count of `struct iommu_map_op` elements in array
>>
>> This subop will attempt to unmap each element in the `struct
>> iommu_map_op` array
>> and record the mapping status back into the array itself. If an
>> unmapping fault
>> occurs then the hypercall stop processing the array and return with an
>> EFAULT;
>>
>> The IOMMU TLB will only be flushed when the hypercall completes or a
>> hypercall
>> continuation is created.
>>
>>      struct iommu_map_op {
>>          uint64_t bfn;
>>          uint64_t mfn;
>>          uint32_t flags;
>>          int32_t status;
>>      };
>>
>> --------------------------------------------------------------------
>> Field          Purpose
>> -----          -----------------------------------------------------
>> `bfn`          [in] Bus address frame number to be unmapped
>>
>> `mfn`          [in] This field is ignored for unmap subop
>>
>> `flags`        [in] This field is ignored for unmap subop
>>
>> `status`       [out] Mapping status of this unmap operation, 0 indicates
>> success
>> --------------------------------------------------------------------
>>
>> Additional error codes specific to this hypercall:
>>
>> Error code  Reason
>> ---------- ------------------------------------------------------------
>> EPERM       PV IOMMU mode not enabled or calling domain is not domain 0
>> ------------------------------------------------------------------------
>>
>>
>> Conditions for which PV IOMMU hypercalls succeed
>> ------------------------------------------------
>> All the following conditions are required to be true for PV IOMMU hypercalls
>> to succeed:
>>
>> 1. IOMMU detected and supported by Xen
>> 2. The following Xen IOMMU options are NOT enabled: dom0-passthrough,
>> dom0-strict
>> 3. Domain 0 is making the hypercall
>>
>>
>> Security Implications of allowing Domain 0 IOMMU control
>> ========================================================
>>
>> Xen currently allows IO devices attached to Domain 0 to have direct
>> access to
>> the all of the MFN address space (except Xen hypervisor memory regions),
>> provided the Xen IOMMU option dom0-strict is not enabled.
>>
>> The PV IOMMU feature provides the same level of access to MFN address space
>> and the feature is not enabled when the Xen IOMMU option dom0-strict is
>> enabled. Therefore security is not affected by the PV IOMMU feature.
>>
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] Dom0 PV IOMMU control design (draft A)
  2014-04-14 15:48   ` Malcolm Crossley
  2014-04-14 16:00     ` Jan Beulich
@ 2014-05-01 11:56     ` Tim Deegan
  1 sibling, 0 replies; 18+ messages in thread
From: Tim Deegan @ 2014-05-01 11:56 UTC (permalink / raw)
  To: Malcolm Crossley; +Cc: xen-devel, Jan Beulich

At 16:48 +0100 on 14 Apr (1397490487), Malcolm Crossley wrote:
> On 14/04/14 14:47, Jan Beulich wrote:
> >>>> On 11.04.14 at 19:28, <malcolm.crossley@citrix.com> wrote:
> >> Purpose
> >> =======
> >>
> >> Allow Xen Domain 0 PV guests to create/modify/destroy IOMMU mappings for
> >> hardware devices that Domain 0 has access to. This enables Domain 0 to 
> >> program
> >> a bus address space mapping which matches it's GPFN mapping. Once a 1-1
> >> mapping of GPFN to bus address space is created then a bounce buffer
> >> region is not required for the IO devices connected to the IOMMU.
> > So the idea then is for the domain to
> > - create this 1:1 mapping once at boot, and update it as pages get
> >   ballooned int/out, or
> > - zap the entire IOMMU mapping at boot, and establish individual
> >   mappings as needed based on the driver's use of the DMA map ops?
> >
> 
> Yes both of those are the general idea, I will document this better in
> draft B as per David's suggestion.
> The first strategy has better performance (less IOMMU updates) but lower
> security from bad devices, the second strategy has better security but
> higher IOMMU mapping overhead.
> 
> The current idea for maximum performance is to create a 1:1 mapping for
> dom0 GPFN's and an offset (offset by max dom0 GPFN) 1:1 mapping for the
> whole of the MFN address space which will be used for grant mapped
> pages. This should result in IOMMU updates only for ballooned pages.
> >> IOMMUOP_unmap_page
> >> ----------------
> >> First argument, pointer to array of `struct iommu_map_op`
> >> Second argument, integer count of `struct iommu_map_op` elements in array
> >>
> >> This subop will attempt to unmap each element in the `struct 
> >> iommu_map_op` array
> >> and record the mapping status back into the array itself. If an 
> > unmapping?
> 
> Agreed
> >
> >> unmapping fault
> > mapping?
> 
> This is an unmap hypercall so I believe it would be an unmapping fault
> reported.
> >
> >> occurs then the hypercall stop processing the array and return with an 
> >> EFAULT;
> > -EFAULT.
> 
> Agreed
> >> The IOMMU TLB will only be flushed when the hypercall completes or a 
> >> hypercall
> >> continuation is created.
> >>
> >>      struct iommu_map_op {
> >>          uint64_t bfn;
> >>          uint64_t mfn;
> >>          uint32_t flags;
> >>          int32_t status;
> >>      };
> >>
> >> --------------------------------------------------------------------
> >> Field          Purpose
> >> -----          -----------------------------------------------------
> >> `bfn`          [in] Bus address frame number to be unmapped
> >>
> >> `mfn`          [in] This field is ignored for unmap subop
> >>
> >> `flags`        [in] This field is ignored for unmap subop
> >>
> >> `status`       [out] Mapping status of this unmap operation, 0 indicates success
> > Hmm, bfn and flags ignored would call for an abbreviated interface
> > structure. Or how about IOMMU_MAP_OP_{readable,writeable}
> > implying unmap, in which case for now you'd need only one op. And
> > with that I wonder whether this shouldn't simply be a new
> > PHYSDEVOP_iommu_map.
> 
> I initially tried to add it to the PHYSDEV hypercall as a subop but when
> adding batch capability it seemed cleaner to use a 3 argument hypercall
> to allow for the count to be passed in as argument rather than as a
> field in a struct.
> 
> We could have a single op and define a new flag to indicate we want an
> unmap to be performed.
> >
> >> Security Implications of allowing Domain 0 IOMMU control
> >> ========================================================
> >>
> >> Xen currently allows IO devices attached to Domain 0 to have direct 
> >> access to
> >> the all of the MFN address space (except Xen hypervisor memory regions),
> >> provided the Xen IOMMU option dom0-strict is not enabled.
> >>
> >> The PV IOMMU feature provides the same level of access to MFN address space
> >> and the feature is not enabled when the Xen IOMMU option dom0-strict is
> >> enabled. Therefore security is not affected by the PV IOMMU feature.
> > While I was about to reply with the same request to extend this to
> > driver domains, this last section pretty much excludes such an
> > extension. I guess that would benefit from revisiting.
> 
> I agree supporting driver domains would not be straightforward with
> regards to security so I would suggest that driver domain support could
> be added later as a separate design.

I am uncomfortable with adding any new barriers to adoption for driver
domains (in this case, a performance hit).  Couldn't this be easily
extended to all domains by DTRT with refcounts?  We'd supply a default
1-1 IOMMU table, as we (effectvely) do for dom0, and allow the guest
to specify its own if it prefers.

Tim.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2014-05-01 11:56 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-11 17:28 [RFC] Dom0 PV IOMMU control design (draft A) Malcolm Crossley
2014-04-11 17:50 ` Konrad Rzeszutek Wilk
2014-04-14 12:12   ` Malcolm Crossley
2014-04-14 12:51     ` Konrad Rzeszutek Wilk
2014-04-14 15:03       ` Malcolm Crossley
2014-04-14 15:09         ` Konrad Rzeszutek Wilk
2014-04-14 15:48         ` Jan Beulich
2014-04-14 16:38           ` Malcolm Crossley
2014-04-15  6:50             ` Jan Beulich
2014-04-14 11:52 ` David Vrabel
2014-04-14 13:47 ` Jan Beulich
2014-04-14 15:48   ` Malcolm Crossley
2014-04-14 16:00     ` Jan Beulich
2014-04-14 16:55       ` Malcolm Crossley
2014-04-15  6:56         ` Jan Beulich
2014-05-01 11:56     ` Tim Deegan
2014-04-16 14:13 ` Zhang, Xiantao
2014-04-16 15:35   ` Malcolm Crossley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).