Re: [RFC] Dom0 PV IOMMU control design (draft A)

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: Malcolm Crossley <malcolm.crossley@citrix.com>
Cc: xen-devel <xen-devel@lists.xenproject.org>
Subject: Re: [RFC] Dom0 PV IOMMU control design (draft A)
Date: Fri, 11 Apr 2014 13:50:36 -0400	[thread overview]
Message-ID: <20140411175036.GA15429@phenom.dumpdata.com> (raw)
In-Reply-To: <5348264B.1040800@citrix.com>

On Fri, Apr 11, 2014 at 06:28:43PM +0100, Malcolm Crossley wrote:
> Hi,
> 
> Here is a design for allowing Dom0 PV guests to control the IOMMU.
> This allows for the Dom0 GPFN mapping to be programmed into the
> IOMMU and avoid using the SWIOTLB bounce buffer technique in the
> Linux kernel (except for legacy 32 bit DMA IO devices)
> 
> This feature provides two gains:
> 1.  Improved performance for use cases which relied upon the bounce
> buffer e.g. NIC cards using jumbo frames with linear buffers.
> 2.  Prevent SWIOTLB bounce buffer region exhaustion which can cause
> unrecoverable Linux kernel driver errors.
> 
> A PDF version of the document is available here:
> 
> http://xenbits.xen.org/people/andrewcoop/pv-iommu-control-A.pdf
> 
> The pandoc markdown format of the document is provided below to
> allow for easier inline comments:
> 
> Introduction
> ============
> 
> Background
> -------
> 
> Xen PV guests use a Guest Pseudo-physical Frame Number(GPFN) address space
> which is decoupled from the host Machine Frame Number(MFN) address
> space. PV
> guests which interact with hardware need to translate GPFN addresses to MFN
> address because hardware uses the host address space only.
> PV guest hardware drivers are only aware of the GPFN address space only and
> assume that if GPFN addresses are contiguous then the hardware
> addresses would
> be contiguous as well. The decoupling between GPFN and MFN address
> spaces means
> GPFN and MFN addresses may not be contiguous across page boundaries
> and thus a
> buffer allocated in GPFN address space which spans a page boundary
> may not be
> contiguous in MFN address space.
> 
> PV hardware drivers cannot tolerate this behaviour and so a special
> "bounce buffer" region is used to hide this issue from the drivers.
> 
> A bounce buffer region is a special part of the GPFN address space which has
> been made to be contiguous in both GPFN and MFN address spaces. When
> a driver
> requests a buffer which spans a page boundary be made available for
> hardware
> to read then core operating system code copies the buffer into a temporarily
> reserved part of the bounce buffer region and then returns the MFN
> address of
> the reserved part of the bounce buffer region back to the driver itself. The
> driver then instructs the hardware to read the copy of the buffer in the
> bounce buffer. Similarly if the driver requests a buffer is made available
> for hardware to write to then first a region of the bounce buffer is
> reserved
> and then after the hardware completes writing then the reserved region of
> bounce buffer is copied to the originally allocated buffer.
> 
> The overheard of memory copies to/from the bounce buffer region is high
> and damages performance. Furthermore, there is a risk the fixed size
> bounce buffer region will become exhausted and it will not be possible to
> return an hardware address back to the driver. The Linux kernel
> drivers do not
> tolerate this failure and so the kernel is forced to crash the kernel, as an
> uncorrectable error has occurred.
> 
> Input/Output Memory Management Units (IOMMU) allow for an inbound address
> mapping to be created from the I/O Bus address space (typically PCI) to
> the machine frame number address space. IOMMU's typically use a page table
> mechanism to manage the mappings and therefore can create mappings
> of page size
> granularity or larger.
> 
> Purpose
> =======
> 
> Allow Xen Domain 0 PV guests to create/modify/destroy IOMMU mappings for
> hardware devices that Domain 0 has access to. This enables Domain 0
> to program
> a bus address space mapping which matches it's GPFN mapping. Once a 1-1
> mapping of GPFN to bus address space is created then a bounce buffer
> region is not required for the IO devices connected to the IOMMU.
> 
> 
> Architecture
> ============
> 
> A three argument hypercall interface (do_iommu_op), implementing two
> hypercall
> subops.
> 
> Design considerations for hypercall subops
> -------------------------------------------
> IOMMU map/unmap operations can be slow and can involve flushing the
> IOMMU TLB
> to ensure the IO device uses the updated mappings.
> 
> The subops have been designed to take an array of operations and a count as
> parameters. This allows for easily implemented hypercall
> continuations to be
> used and allows for batches of IOMMU operations to be submitted
> before flushing
> the IOMMU TLB.
> 
> 
> 
> IOMMUOP_map_page
> ----------------
> First argument, pointer to array of `struct iommu_map_op`
> Second argument, integer count of `struct iommu_map_op` elements in array

Could this be 'unsigned integer' count?

Is there a limit? Can I do 31415 of them? Can I do it for the whole
memory space of the guest?

> 
> This subop will attempt to IOMMU map each element in the `struct
> iommu_map_op`
> array and record the mapping status back into the array itself. If
> an mapping
> fault occurs then the hypercall will return with -EFAULT.

> 
> This subop will inspect the MFN address being mapped in each
> iommu_map_op to
> ensure it does not belong to the Xen hypervisor itself. If the MFN
> does belong
> to the Xen hypervisor the subop will return -EPERM in the status
> field for that
> particular iommu_map_op.

Is it OK if the MFN belongs to another guest? 

> 
> The IOMMU TLB will only be flushed when the hypercall completes or a
> hypercall
> continuation is created.
> 
>     struct iommu_map_op {
>         uint64_t bfn;

bus_frame ?

>         uint64_t mfn;
>         uint32_t flags;
>         int32_t status;
>     };
> 
> ------------------------------------------------------------------------------
> Field          Purpose
> ----- ---------------------------------------------------------------
> `bfn`          [in] Bus address frame number to mapped to specified
> mfn below

Huh? Isn't this out? If not, isn't bfn == mfn for dom0?
How would dom0 know the bus address? That usually is something only the
IOMMU knows.

> 
> `mfn`          [in] Machine address frame number
> 


We still need to do a bit of PFN -> MFN -> hypercall -> GFN and program
that in the PCIe devices right?


> `flags`        [in] Flags for signalling type of IOMMU mapping to be created
> 
> `status`       [out] Mapping status of this map operation, 0
> indicates success
> ------------------------------------------------------------------------------
> 
> 
> Defined bits for flags field
> ------------------------------------------------------------------------
> Name                        Bit                Definition
> ----                       ----- ----------------------------------
> IOMMU_MAP_OP_readable        0        Create readable IOMMU mapping
> IOMMU_MAP_OP_writeable       1        Create writeable IOMMU mapping

And is it OK to use both?

> Reserved for future use     2-31                   n/a
> ------------------------------------------------------------------------
> 
> Additional error codes specific to this hypercall:
> 
> Error code  Reason
> ---------- ------------------------------------------------------------
> EPERM       PV IOMMU mode not enabled or calling domain is not domain 0

And -EFAULT

and what about success? Do you return 0 or the number of ops that were
successfull?

> ------------------------------------------------------------------------
> 
> IOMMUOP_unmap_page
> ----------------
> First argument, pointer to array of `struct iommu_map_op`
> Second argument, integer count of `struct iommu_map_op` elements in array

Um, 'unsigned integer' count?

> 
> This subop will attempt to unmap each element in the `struct
> iommu_map_op` array
> and record the mapping status back into the array itself. If an
> unmapping fault
> occurs then the hypercall stop processing the array and return with
> an EFAULT;
> 
> The IOMMU TLB will only be flushed when the hypercall completes or a
> hypercall
> continuation is created.
> 
>     struct iommu_map_op {
>         uint64_t bfn;
>         uint64_t mfn;
>         uint32_t flags;
>         int32_t status;
>     };
> 
> --------------------------------------------------------------------
> Field          Purpose
> -----          -----------------------------------------------------
> `bfn`          [in] Bus address frame number to be unmapped

I presume this is gathered from the 'map' call?

> 
> `mfn`          [in] This field is ignored for unmap subop
> 
> `flags`        [in] This field is ignored for unmap subop
> 
> `status`       [out] Mapping status of this unmap operation, 0
> indicates success
> --------------------------------------------------------------------
> 
> Additional error codes specific to this hypercall:
> 
> Error code  Reason
> ---------- ------------------------------------------------------------
> EPERM       PV IOMMU mode not enabled or calling domain is not domain 0

EFAULT too

> ------------------------------------------------------------------------
> 
> 
> Conditions for which PV IOMMU hypercalls succeed
> ------------------------------------------------
> All the following conditions are required to be true for PV IOMMU hypercalls
> to succeed:
> 
> 1. IOMMU detected and supported by Xen
> 2. The following Xen IOMMU options are NOT enabled:
> dom0-passthrough, dom0-strict
> 3. Domain 0 is making the hypercall
> 
> 
> Security Implications of allowing Domain 0 IOMMU control
> ========================================================
> 
> Xen currently allows IO devices attached to Domain 0 to have direct
> access to
> the all of the MFN address space (except Xen hypervisor memory regions),
> provided the Xen IOMMU option dom0-strict is not enabled.
> 
> The PV IOMMU feature provides the same level of access to MFN address space
> and the feature is not enabled when the Xen IOMMU option dom0-strict is
> enabled. Therefore security is not affected by the PV IOMMU feature.
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

next prev parent reply	other threads:[~2014-04-11 17:50 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-04-11 17:28 [RFC] Dom0 PV IOMMU control design (draft A) Malcolm Crossley
2014-04-11 17:50 ` Konrad Rzeszutek Wilk [this message]
2014-04-14 12:12   ` Malcolm Crossley
2014-04-14 12:51     ` Konrad Rzeszutek Wilk
2014-04-14 15:03       ` Malcolm Crossley
2014-04-14 15:09         ` Konrad Rzeszutek Wilk
2014-04-14 15:48         ` Jan Beulich
2014-04-14 16:38           ` Malcolm Crossley
2014-04-15  6:50             ` Jan Beulich
2014-04-14 11:52 ` David Vrabel
2014-04-14 13:47 ` Jan Beulich
2014-04-14 15:48   ` Malcolm Crossley
2014-04-14 16:00     ` Jan Beulich
2014-04-14 16:55       ` Malcolm Crossley
2014-04-15  6:56         ` Jan Beulich
2014-05-01 11:56     ` Tim Deegan
2014-04-16 14:13 ` Zhang, Xiantao
2014-04-16 15:35   ` Malcolm Crossley

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140411175036.GA15429@phenom.dumpdata.com \
    --to=konrad.wilk@oracle.com \
    --cc=malcolm.crossley@citrix.com \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).