From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: Malcolm Crossley <malcolm.crossley@citrix.com>
Cc: xen-devel <xen-devel@lists.xenproject.org>
Subject: Re: [RFC] Dom0 PV IOMMU control design (draft A)
Date: Fri, 11 Apr 2014 13:50:36 -0400 [thread overview]
Message-ID: <20140411175036.GA15429@phenom.dumpdata.com> (raw)
In-Reply-To: <5348264B.1040800@citrix.com>
On Fri, Apr 11, 2014 at 06:28:43PM +0100, Malcolm Crossley wrote:
> Hi,
>
> Here is a design for allowing Dom0 PV guests to control the IOMMU.
> This allows for the Dom0 GPFN mapping to be programmed into the
> IOMMU and avoid using the SWIOTLB bounce buffer technique in the
> Linux kernel (except for legacy 32 bit DMA IO devices)
>
> This feature provides two gains:
> 1. Improved performance for use cases which relied upon the bounce
> buffer e.g. NIC cards using jumbo frames with linear buffers.
> 2. Prevent SWIOTLB bounce buffer region exhaustion which can cause
> unrecoverable Linux kernel driver errors.
>
> A PDF version of the document is available here:
>
> http://xenbits.xen.org/people/andrewcoop/pv-iommu-control-A.pdf
>
> The pandoc markdown format of the document is provided below to
> allow for easier inline comments:
>
> Introduction
> ============
>
> Background
> -------
>
> Xen PV guests use a Guest Pseudo-physical Frame Number(GPFN) address space
> which is decoupled from the host Machine Frame Number(MFN) address
> space. PV
> guests which interact with hardware need to translate GPFN addresses to MFN
> address because hardware uses the host address space only.
> PV guest hardware drivers are only aware of the GPFN address space only and
> assume that if GPFN addresses are contiguous then the hardware
> addresses would
> be contiguous as well. The decoupling between GPFN and MFN address
> spaces means
> GPFN and MFN addresses may not be contiguous across page boundaries
> and thus a
> buffer allocated in GPFN address space which spans a page boundary
> may not be
> contiguous in MFN address space.
>
> PV hardware drivers cannot tolerate this behaviour and so a special
> "bounce buffer" region is used to hide this issue from the drivers.
>
> A bounce buffer region is a special part of the GPFN address space which has
> been made to be contiguous in both GPFN and MFN address spaces. When
> a driver
> requests a buffer which spans a page boundary be made available for
> hardware
> to read then core operating system code copies the buffer into a temporarily
> reserved part of the bounce buffer region and then returns the MFN
> address of
> the reserved part of the bounce buffer region back to the driver itself. The
> driver then instructs the hardware to read the copy of the buffer in the
> bounce buffer. Similarly if the driver requests a buffer is made available
> for hardware to write to then first a region of the bounce buffer is
> reserved
> and then after the hardware completes writing then the reserved region of
> bounce buffer is copied to the originally allocated buffer.
>
> The overheard of memory copies to/from the bounce buffer region is high
> and damages performance. Furthermore, there is a risk the fixed size
> bounce buffer region will become exhausted and it will not be possible to
> return an hardware address back to the driver. The Linux kernel
> drivers do not
> tolerate this failure and so the kernel is forced to crash the kernel, as an
> uncorrectable error has occurred.
>
> Input/Output Memory Management Units (IOMMU) allow for an inbound address
> mapping to be created from the I/O Bus address space (typically PCI) to
> the machine frame number address space. IOMMU's typically use a page table
> mechanism to manage the mappings and therefore can create mappings
> of page size
> granularity or larger.
>
> Purpose
> =======
>
> Allow Xen Domain 0 PV guests to create/modify/destroy IOMMU mappings for
> hardware devices that Domain 0 has access to. This enables Domain 0
> to program
> a bus address space mapping which matches it's GPFN mapping. Once a 1-1
> mapping of GPFN to bus address space is created then a bounce buffer
> region is not required for the IO devices connected to the IOMMU.
>
>
> Architecture
> ============
>
> A three argument hypercall interface (do_iommu_op), implementing two
> hypercall
> subops.
>
> Design considerations for hypercall subops
> -------------------------------------------
> IOMMU map/unmap operations can be slow and can involve flushing the
> IOMMU TLB
> to ensure the IO device uses the updated mappings.
>
> The subops have been designed to take an array of operations and a count as
> parameters. This allows for easily implemented hypercall
> continuations to be
> used and allows for batches of IOMMU operations to be submitted
> before flushing
> the IOMMU TLB.
>
>
>
> IOMMUOP_map_page
> ----------------
> First argument, pointer to array of `struct iommu_map_op`
> Second argument, integer count of `struct iommu_map_op` elements in array
Could this be 'unsigned integer' count?
Is there a limit? Can I do 31415 of them? Can I do it for the whole
memory space of the guest?
>
> This subop will attempt to IOMMU map each element in the `struct
> iommu_map_op`
> array and record the mapping status back into the array itself. If
> an mapping
> fault occurs then the hypercall will return with -EFAULT.
>
> This subop will inspect the MFN address being mapped in each
> iommu_map_op to
> ensure it does not belong to the Xen hypervisor itself. If the MFN
> does belong
> to the Xen hypervisor the subop will return -EPERM in the status
> field for that
> particular iommu_map_op.
Is it OK if the MFN belongs to another guest?
>
> The IOMMU TLB will only be flushed when the hypercall completes or a
> hypercall
> continuation is created.
>
> struct iommu_map_op {
> uint64_t bfn;
bus_frame ?
> uint64_t mfn;
> uint32_t flags;
> int32_t status;
> };
>
> ------------------------------------------------------------------------------
> Field Purpose
> ----- ---------------------------------------------------------------
> `bfn` [in] Bus address frame number to mapped to specified
> mfn below
Huh? Isn't this out? If not, isn't bfn == mfn for dom0?
How would dom0 know the bus address? That usually is something only the
IOMMU knows.
>
> `mfn` [in] Machine address frame number
>
We still need to do a bit of PFN -> MFN -> hypercall -> GFN and program
that in the PCIe devices right?
> `flags` [in] Flags for signalling type of IOMMU mapping to be created
>
> `status` [out] Mapping status of this map operation, 0
> indicates success
> ------------------------------------------------------------------------------
>
>
> Defined bits for flags field
> ------------------------------------------------------------------------
> Name Bit Definition
> ---- ----- ----------------------------------
> IOMMU_MAP_OP_readable 0 Create readable IOMMU mapping
> IOMMU_MAP_OP_writeable 1 Create writeable IOMMU mapping
And is it OK to use both?
> Reserved for future use 2-31 n/a
> ------------------------------------------------------------------------
>
> Additional error codes specific to this hypercall:
>
> Error code Reason
> ---------- ------------------------------------------------------------
> EPERM PV IOMMU mode not enabled or calling domain is not domain 0
And -EFAULT
and what about success? Do you return 0 or the number of ops that were
successfull?
> ------------------------------------------------------------------------
>
> IOMMUOP_unmap_page
> ----------------
> First argument, pointer to array of `struct iommu_map_op`
> Second argument, integer count of `struct iommu_map_op` elements in array
Um, 'unsigned integer' count?
>
> This subop will attempt to unmap each element in the `struct
> iommu_map_op` array
> and record the mapping status back into the array itself. If an
> unmapping fault
> occurs then the hypercall stop processing the array and return with
> an EFAULT;
>
> The IOMMU TLB will only be flushed when the hypercall completes or a
> hypercall
> continuation is created.
>
> struct iommu_map_op {
> uint64_t bfn;
> uint64_t mfn;
> uint32_t flags;
> int32_t status;
> };
>
> --------------------------------------------------------------------
> Field Purpose
> ----- -----------------------------------------------------
> `bfn` [in] Bus address frame number to be unmapped
I presume this is gathered from the 'map' call?
>
> `mfn` [in] This field is ignored for unmap subop
>
> `flags` [in] This field is ignored for unmap subop
>
> `status` [out] Mapping status of this unmap operation, 0
> indicates success
> --------------------------------------------------------------------
>
> Additional error codes specific to this hypercall:
>
> Error code Reason
> ---------- ------------------------------------------------------------
> EPERM PV IOMMU mode not enabled or calling domain is not domain 0
EFAULT too
> ------------------------------------------------------------------------
>
>
> Conditions for which PV IOMMU hypercalls succeed
> ------------------------------------------------
> All the following conditions are required to be true for PV IOMMU hypercalls
> to succeed:
>
> 1. IOMMU detected and supported by Xen
> 2. The following Xen IOMMU options are NOT enabled:
> dom0-passthrough, dom0-strict
> 3. Domain 0 is making the hypercall
>
>
> Security Implications of allowing Domain 0 IOMMU control
> ========================================================
>
> Xen currently allows IO devices attached to Domain 0 to have direct
> access to
> the all of the MFN address space (except Xen hypervisor memory regions),
> provided the Xen IOMMU option dom0-strict is not enabled.
>
> The PV IOMMU feature provides the same level of access to MFN address space
> and the feature is not enabled when the Xen IOMMU option dom0-strict is
> enabled. Therefore security is not affected by the PV IOMMU feature.
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
next prev parent reply other threads:[~2014-04-11 17:50 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-04-11 17:28 [RFC] Dom0 PV IOMMU control design (draft A) Malcolm Crossley
2014-04-11 17:50 ` Konrad Rzeszutek Wilk [this message]
2014-04-14 12:12 ` Malcolm Crossley
2014-04-14 12:51 ` Konrad Rzeszutek Wilk
2014-04-14 15:03 ` Malcolm Crossley
2014-04-14 15:09 ` Konrad Rzeszutek Wilk
2014-04-14 15:48 ` Jan Beulich
2014-04-14 16:38 ` Malcolm Crossley
2014-04-15 6:50 ` Jan Beulich
2014-04-14 11:52 ` David Vrabel
2014-04-14 13:47 ` Jan Beulich
2014-04-14 15:48 ` Malcolm Crossley
2014-04-14 16:00 ` Jan Beulich
2014-04-14 16:55 ` Malcolm Crossley
2014-04-15 6:56 ` Jan Beulich
2014-05-01 11:56 ` Tim Deegan
2014-04-16 14:13 ` Zhang, Xiantao
2014-04-16 15:35 ` Malcolm Crossley
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140411175036.GA15429@phenom.dumpdata.com \
--to=konrad.wilk@oracle.com \
--cc=malcolm.crossley@citrix.com \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).