Re: [RFC] Dom0 PV IOMMU control design (draft A)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: Malcolm Crossley <malcolm.crossley@citrix.com>
Cc: xen-devel <xen-devel@lists.xenproject.org>
Subject: Re: [RFC] Dom0 PV IOMMU control design (draft A)
Date: Fri, 11 Apr 2014 13:50:36 -0400	[thread overview]
Message-ID: <20140411175036.GA15429@phenom.dumpdata.com> (raw)
In-Reply-To: <5348264B.1040800@citrix.com>

On Fri, Apr 11, 2014 at 06:28:43PM +0100, Malcolm Crossley wrote:
> Hi,
> 
> Here is a design for allowing Dom0 PV guests to control the IOMMU.
> This allows for the Dom0 GPFN mapping to be programmed into the
> IOMMU and avoid using the SWIOTLB bounce buffer technique in the
> Linux kernel (except for legacy 32 bit DMA IO devices)
> 
> This feature provides two gains:
> 1.  Improved performance for use cases which relied upon the bounce
> buffer e.g. NIC cards using jumbo frames with linear buffers.
> 2.  Prevent SWIOTLB bounce buffer region exhaustion which can cause
> unrecoverable Linux kernel driver errors.
> 
> A PDF version of the document is available here:
> 
> http://xenbits.xen.org/people/andrewcoop/pv-iommu-control-A.pdf
> 
> The pandoc markdown format of the document is provided below to
> allow for easier inline comments:
> 
> Introduction
> ============
> 
> Background
> -------
> 
> Xen PV guests use a Guest Pseudo-physical Frame Number(GPFN) address space
> which is decoupled from the host Machine Frame Number(MFN) address
> space. PV
> guests which interact with hardware need to translate GPFN addresses to MFN
> address because hardware uses the host address space only.
> PV guest hardware drivers are only aware of the GPFN address space only and
> assume that if GPFN addresses are contiguous then the hardware
> addresses would
> be contiguous as well. The decoupling between GPFN and MFN address
> spaces means
> GPFN and MFN addresses may not be contiguous across page boundaries
> and thus a
> buffer allocated in GPFN address space which spans a page boundary
> may not be
> contiguous in MFN address space.
> 
> PV hardware drivers cannot tolerate this behaviour and so a special
> "bounce buffer" region is used to hide this issue from the drivers.
> 
> A bounce buffer region is a special part of the GPFN address space which has
> been made to be contiguous in both GPFN and MFN address spaces. When
> a driver
> requests a buffer which spans a page boundary be made available for
> hardware
> to read then core operating system code copies the buffer into a temporarily
> reserved part of the bounce buffer region and then returns the MFN
> address of
> the reserved part of the bounce buffer region back to the driver itself. The
> driver then instructs the hardware to read the copy of the buffer in the
> bounce buffer. Similarly if the driver requests a buffer is made available
> for hardware to write to then first a region of the bounce buffer is
> reserved
> and then after the hardware completes writing then the reserved region of
> bounce buffer is copied to the originally allocated buffer.
> 
> The overheard of memory copies to/from the bounce buffer region is high
> and damages performance. Furthermore, there is a risk the fixed size
> bounce buffer region will become exhausted and it will not be possible to
> return an hardware address back to the driver. The Linux kernel
> drivers do not
> tolerate this failure and so the kernel is forced to crash the kernel, as an
> uncorrectable error has occurred.
> 
> Input/Output Memory Management Units (IOMMU) allow for an inbound address
> mapping to be created from the I/O Bus address space (typically PCI) to
> the machine frame number address space. IOMMU's typically use a page table
> mechanism to manage the mappings and therefore can create mappings
> of page size
> granularity or larger.
> 
> Purpose
> =======
> 
> Allow Xen Domain 0 PV guests to create/modify/destroy IOMMU mappings for
> hardware devices that Domain 0 has access to. This enables Domain 0
> to program
> a bus address space mapping which matches it's GPFN mapping. Once a 1-1
> mapping of GPFN to bus address space is created then a bounce buffer
> region is not required for the IO devices connected to the IOMMU.
> 
> 
> Architecture
> ============
> 
> A three argument hypercall interface (do_iommu_op), implementing two
> hypercall
> subops.
> 
> Design considerations for hypercall subops
> -------------------------------------------
> IOMMU map/unmap operations can be slow and can involve flushing the
> IOMMU TLB
> to ensure the IO device uses the updated mappings.
> 
> The subops have been designed to take an array of operations and a count as
> parameters. This allows for easily implemented hypercall
> continuations to be
> used and allows for batches of IOMMU operations to be submitted
> before flushing
> the IOMMU TLB.
> 
> 
> 
> IOMMUOP_map_page
> ----------------
> First argument, pointer to array of `struct iommu_map_op`
> Second argument, integer count of `struct iommu_map_op` elements in array

Could this be 'unsigned integer' count?

Is there a limit? Can I do 31415 of them? Can I do it for the whole
memory space of the guest?

> 
> This subop will attempt to IOMMU map each element in the `struct
> iommu_map_op`
> array and record the mapping status back into the array itself. If
> an mapping
> fault occurs then the hypercall will return with -EFAULT.

> 
> This subop will inspect the MFN address being mapped in each
> iommu_map_op to
> ensure it does not belong to the Xen hypervisor itself. If the MFN
> does belong
> to the Xen hypervisor the subop will return -EPERM in the status
> field for that
> particular iommu_map_op.

Is it OK if the MFN belongs to another guest? 

> 
> The IOMMU TLB will only be flushed when the hypercall completes or a
> hypercall
> continuation is created.
> 
>     struct iommu_map_op {
>         uint64_t bfn;

bus_frame ?

>         uint64_t mfn;
>         uint32_t flags;
>         int32_t status;
>     };
> 
> ------------------------------------------------------------------------------
> Field          Purpose
> ----- ---------------------------------------------------------------
> `bfn`          [in] Bus address frame number to mapped to specified
> mfn below

Huh? Isn't this out? If not, isn't bfn == mfn for dom0?
How would dom0 know the bus address? That usually is something only the
IOMMU knows.

> 
> `mfn`          [in] Machine address frame number
> 


We still need to do a bit of PFN -> MFN -> hypercall -> GFN and program
that in the PCIe devices right?


> `flags`        [in] Flags for signalling type of IOMMU mapping to be created
> 
> `status`       [out] Mapping status of this map operation, 0
> indicates success
> ------------------------------------------------------------------------------
> 
> 
> Defined bits for flags field
> ------------------------------------------------------------------------
> Name                        Bit                Definition
> ----                       ----- ----------------------------------
> IOMMU_MAP_OP_readable        0        Create readable IOMMU mapping
> IOMMU_MAP_OP_writeable       1        Create writeable IOMMU mapping

And is it OK to use both?

> Reserved for future use     2-31                   n/a
> ------------------------------------------------------------------------
> 
> Additional error codes specific to this hypercall:
> 
> Error code  Reason
> ---------- ------------------------------------------------------------
> EPERM       PV IOMMU mode not enabled or calling domain is not domain 0

And -EFAULT

and what about success? Do you return 0 or the number of ops that were
successfull?

> ------------------------------------------------------------------------
> 
> IOMMUOP_unmap_page
> ----------------
> First argument, pointer to array of `struct iommu_map_op`
> Second argument, integer count of `struct iommu_map_op` elements in array

Um, 'unsigned integer' count?

> 
> This subop will attempt to unmap each element in the `struct
> iommu_map_op` array
> and record the mapping status back into the array itself. If an
> unmapping fault
> occurs then the hypercall stop processing the array and return with
> an EFAULT;
> 
> The IOMMU TLB will only be flushed when the hypercall completes or a
> hypercall
> continuation is created.
> 
>     struct iommu_map_op {
>         uint64_t bfn;
>         uint64_t mfn;
>         uint32_t flags;
>         int32_t status;
>     };
> 
> --------------------------------------------------------------------
> Field          Purpose
> -----          -----------------------------------------------------
> `bfn`          [in] Bus address frame number to be unmapped

I presume this is gathered from the 'map' call?

> 
> `mfn`          [in] This field is ignored for unmap subop
> 
> `flags`        [in] This field is ignored for unmap subop
> 
> `status`       [out] Mapping status of this unmap operation, 0
> indicates success
> --------------------------------------------------------------------
> 
> Additional error codes specific to this hypercall:
> 
> Error code  Reason
> ---------- ------------------------------------------------------------
> EPERM       PV IOMMU mode not enabled or calling domain is not domain 0

EFAULT too

> ------------------------------------------------------------------------
> 
> 
> Conditions for which PV IOMMU hypercalls succeed
> ------------------------------------------------
> All the following conditions are required to be true for PV IOMMU hypercalls
> to succeed:
> 
> 1. IOMMU detected and supported by Xen
> 2. The following Xen IOMMU options are NOT enabled:
> dom0-passthrough, dom0-strict
> 3. Domain 0 is making the hypercall
> 
> 
> Security Implications of allowing Domain 0 IOMMU control
> ========================================================
> 
> Xen currently allows IO devices attached to Domain 0 to have direct
> access to
> the all of the MFN address space (except Xen hypervisor memory regions),
> provided the Xen IOMMU option dom0-strict is not enabled.
> 
> The PV IOMMU feature provides the same level of access to MFN address space
> and the feature is not enabled when the Xen IOMMU option dom0-strict is
> enabled. Therefore security is not affected by the PV IOMMU feature.
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

next prev parent reply	other threads:[~2014-04-11 17:50 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-04-11 17:28 [RFC] Dom0 PV IOMMU control design (draft A) Malcolm Crossley
2014-04-11 17:50 ` Konrad Rzeszutek Wilk [this message]
2014-04-14 12:12   ` Malcolm Crossley
2014-04-14 12:51     ` Konrad Rzeszutek Wilk
2014-04-14 15:03       ` Malcolm Crossley
2014-04-14 15:09         ` Konrad Rzeszutek Wilk
2014-04-14 15:48         ` Jan Beulich
2014-04-14 16:38           ` Malcolm Crossley
2014-04-15  6:50             ` Jan Beulich
2014-04-14 11:52 ` David Vrabel
2014-04-14 13:47 ` Jan Beulich
2014-04-14 15:48   ` Malcolm Crossley
2014-04-14 16:00     ` Jan Beulich
2014-04-14 16:55       ` Malcolm Crossley
2014-04-15  6:56         ` Jan Beulich
2014-05-01 11:56     ` Tim Deegan
2014-04-16 14:13 ` Zhang, Xiantao
2014-04-16 15:35   ` Malcolm Crossley

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140411175036.GA15429@phenom.dumpdata.com \
    --to=konrad.wilk@oracle.com \
    --cc=malcolm.crossley@citrix.com \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.