From: Malcolm Crossley <malcolm.crossley@citrix.com>
To: xen-devel <xen-devel@lists.xenproject.org>
Subject: [RFC] Dom0 PV IOMMU control design (draft A)
Date: Fri, 11 Apr 2014 18:28:43 +0100 [thread overview]
Message-ID: <5348264B.1040800@citrix.com> (raw)
Hi,
Here is a design for allowing Dom0 PV guests to control the IOMMU. This
allows for the Dom0 GPFN mapping to be programmed into the IOMMU and
avoid using the SWIOTLB bounce buffer technique in the Linux kernel
(except for legacy 32 bit DMA IO devices)
This feature provides two gains:
1. Improved performance for use cases which relied upon the bounce
buffer e.g. NIC cards using jumbo frames with linear buffers.
2. Prevent SWIOTLB bounce buffer region exhaustion which can cause
unrecoverable Linux kernel driver errors.
A PDF version of the document is available here:
http://xenbits.xen.org/people/andrewcoop/pv-iommu-control-A.pdf
The pandoc markdown format of the document is provided below to allow
for easier inline comments:
Introduction
============
Background
-------
Xen PV guests use a Guest Pseudo-physical Frame Number(GPFN) address space
which is decoupled from the host Machine Frame Number(MFN) address
space. PV
guests which interact with hardware need to translate GPFN addresses to MFN
address because hardware uses the host address space only.
PV guest hardware drivers are only aware of the GPFN address space only and
assume that if GPFN addresses are contiguous then the hardware addresses
would
be contiguous as well. The decoupling between GPFN and MFN address
spaces means
GPFN and MFN addresses may not be contiguous across page boundaries and
thus a
buffer allocated in GPFN address space which spans a page boundary may
not be
contiguous in MFN address space.
PV hardware drivers cannot tolerate this behaviour and so a special
"bounce buffer" region is used to hide this issue from the drivers.
A bounce buffer region is a special part of the GPFN address space which has
been made to be contiguous in both GPFN and MFN address spaces. When a
driver
requests a buffer which spans a page boundary be made available for
hardware
to read then core operating system code copies the buffer into a temporarily
reserved part of the bounce buffer region and then returns the MFN
address of
the reserved part of the bounce buffer region back to the driver itself. The
driver then instructs the hardware to read the copy of the buffer in the
bounce buffer. Similarly if the driver requests a buffer is made available
for hardware to write to then first a region of the bounce buffer is
reserved
and then after the hardware completes writing then the reserved region of
bounce buffer is copied to the originally allocated buffer.
The overheard of memory copies to/from the bounce buffer region is high
and damages performance. Furthermore, there is a risk the fixed size
bounce buffer region will become exhausted and it will not be possible to
return an hardware address back to the driver. The Linux kernel drivers
do not
tolerate this failure and so the kernel is forced to crash the kernel, as an
uncorrectable error has occurred.
Input/Output Memory Management Units (IOMMU) allow for an inbound address
mapping to be created from the I/O Bus address space (typically PCI) to
the machine frame number address space. IOMMU's typically use a page table
mechanism to manage the mappings and therefore can create mappings of
page size
granularity or larger.
Purpose
=======
Allow Xen Domain 0 PV guests to create/modify/destroy IOMMU mappings for
hardware devices that Domain 0 has access to. This enables Domain 0 to
program
a bus address space mapping which matches it's GPFN mapping. Once a 1-1
mapping of GPFN to bus address space is created then a bounce buffer
region is not required for the IO devices connected to the IOMMU.
Architecture
============
A three argument hypercall interface (do_iommu_op), implementing two
hypercall
subops.
Design considerations for hypercall subops
-------------------------------------------
IOMMU map/unmap operations can be slow and can involve flushing the
IOMMU TLB
to ensure the IO device uses the updated mappings.
The subops have been designed to take an array of operations and a count as
parameters. This allows for easily implemented hypercall continuations
to be
used and allows for batches of IOMMU operations to be submitted before
flushing
the IOMMU TLB.
IOMMUOP_map_page
----------------
First argument, pointer to array of `struct iommu_map_op`
Second argument, integer count of `struct iommu_map_op` elements in array
This subop will attempt to IOMMU map each element in the `struct
iommu_map_op`
array and record the mapping status back into the array itself. If an
mapping
fault occurs then the hypercall will return with -EFAULT.
This subop will inspect the MFN address being mapped in each
iommu_map_op to
ensure it does not belong to the Xen hypervisor itself. If the MFN does
belong
to the Xen hypervisor the subop will return -EPERM in the status field
for that
particular iommu_map_op.
The IOMMU TLB will only be flushed when the hypercall completes or a
hypercall
continuation is created.
struct iommu_map_op {
uint64_t bfn;
uint64_t mfn;
uint32_t flags;
int32_t status;
};
------------------------------------------------------------------------------
Field Purpose
----- ---------------------------------------------------------------
`bfn` [in] Bus address frame number to mapped to specified mfn
below
`mfn` [in] Machine address frame number
`flags` [in] Flags for signalling type of IOMMU mapping to be created
`status` [out] Mapping status of this map operation, 0 indicates
success
------------------------------------------------------------------------------
Defined bits for flags field
------------------------------------------------------------------------
Name Bit Definition
---- ----- ----------------------------------
IOMMU_MAP_OP_readable 0 Create readable IOMMU mapping
IOMMU_MAP_OP_writeable 1 Create writeable IOMMU mapping
Reserved for future use 2-31 n/a
------------------------------------------------------------------------
Additional error codes specific to this hypercall:
Error code Reason
---------- ------------------------------------------------------------
EPERM PV IOMMU mode not enabled or calling domain is not domain 0
------------------------------------------------------------------------
IOMMUOP_unmap_page
----------------
First argument, pointer to array of `struct iommu_map_op`
Second argument, integer count of `struct iommu_map_op` elements in array
This subop will attempt to unmap each element in the `struct
iommu_map_op` array
and record the mapping status back into the array itself. If an
unmapping fault
occurs then the hypercall stop processing the array and return with an
EFAULT;
The IOMMU TLB will only be flushed when the hypercall completes or a
hypercall
continuation is created.
struct iommu_map_op {
uint64_t bfn;
uint64_t mfn;
uint32_t flags;
int32_t status;
};
--------------------------------------------------------------------
Field Purpose
----- -----------------------------------------------------
`bfn` [in] Bus address frame number to be unmapped
`mfn` [in] This field is ignored for unmap subop
`flags` [in] This field is ignored for unmap subop
`status` [out] Mapping status of this unmap operation, 0 indicates
success
--------------------------------------------------------------------
Additional error codes specific to this hypercall:
Error code Reason
---------- ------------------------------------------------------------
EPERM PV IOMMU mode not enabled or calling domain is not domain 0
------------------------------------------------------------------------
Conditions for which PV IOMMU hypercalls succeed
------------------------------------------------
All the following conditions are required to be true for PV IOMMU hypercalls
to succeed:
1. IOMMU detected and supported by Xen
2. The following Xen IOMMU options are NOT enabled: dom0-passthrough,
dom0-strict
3. Domain 0 is making the hypercall
Security Implications of allowing Domain 0 IOMMU control
========================================================
Xen currently allows IO devices attached to Domain 0 to have direct
access to
the all of the MFN address space (except Xen hypervisor memory regions),
provided the Xen IOMMU option dom0-strict is not enabled.
The PV IOMMU feature provides the same level of access to MFN address space
and the feature is not enabled when the Xen IOMMU option dom0-strict is
enabled. Therefore security is not affected by the PV IOMMU feature.
next reply other threads:[~2014-04-11 17:28 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-04-11 17:28 Malcolm Crossley [this message]
2014-04-11 17:50 ` [RFC] Dom0 PV IOMMU control design (draft A) Konrad Rzeszutek Wilk
2014-04-14 12:12 ` Malcolm Crossley
2014-04-14 12:51 ` Konrad Rzeszutek Wilk
2014-04-14 15:03 ` Malcolm Crossley
2014-04-14 15:09 ` Konrad Rzeszutek Wilk
2014-04-14 15:48 ` Jan Beulich
2014-04-14 16:38 ` Malcolm Crossley
2014-04-15 6:50 ` Jan Beulich
2014-04-14 11:52 ` David Vrabel
2014-04-14 13:47 ` Jan Beulich
2014-04-14 15:48 ` Malcolm Crossley
2014-04-14 16:00 ` Jan Beulich
2014-04-14 16:55 ` Malcolm Crossley
2014-04-15 6:56 ` Jan Beulich
2014-05-01 11:56 ` Tim Deegan
2014-04-16 14:13 ` Zhang, Xiantao
2014-04-16 15:35 ` Malcolm Crossley
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5348264B.1040800@citrix.com \
--to=malcolm.crossley@citrix.com \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).