From: Malcolm Crossley <malcolm.crossley@citrix.com>
To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel <xen-devel@lists.xenproject.org>
Subject: Re: [RFC] Dom0 PV IOMMU control design (draft A)
Date: Mon, 14 Apr 2014 16:03:22 +0100 [thread overview]
Message-ID: <534BF8BA.2040601@citrix.com> (raw)
In-Reply-To: <20140414125157.GB22384@phenom.dumpdata.com>
On 14/04/14 13:51, Konrad Rzeszutek Wilk wrote:
> On Mon, Apr 14, 2014 at 01:12:07PM +0100, Malcolm Crossley wrote:
>> On 11/04/14 18:50, Konrad Rzeszutek Wilk wrote:
>>> On Fri, Apr 11, 2014 at 06:28:43PM +0100, Malcolm Crossley wrote:
>>>> Hi,
>>>>
>>>> Here is a design for allowing Dom0 PV guests to control the IOMMU.
> With the device driver domains I think you should also rename the
> 'dom0' to device driver or 'hardware domain' - as this functionality
> should be possible within an PV guest with PCI passthrough for example.
Currently Xen only allows Dom0 IOMMU access to all (expect Xen) MFN's.
To not change the current security implications of the feature I would
prefer that 'hardware domain' support was added as a separate design.
>
>>>> This allows for the Dom0 GPFN mapping to be programmed into the
>>>> IOMMU and avoid using the SWIOTLB bounce buffer technique in the
>>>> Linux kernel (except for legacy 32 bit DMA IO devices)
>>>>
>>>> ...
>>>>
>>>> Design considerations for hypercall subops
>>>> -------------------------------------------
>>>> IOMMU map/unmap operations can be slow and can involve flushing the
>>>> IOMMU TLB
>>>> to ensure the IO device uses the updated mappings.
>>>>
>>>> The subops have been designed to take an array of operations and a count as
>>>> parameters. This allows for easily implemented hypercall
>>>> continuations to be
>>>> used and allows for batches of IOMMU operations to be submitted
>>>> before flushing
>>>> the IOMMU TLB.
>>>>
>>>>
>>>>
>>>> IOMMUOP_map_page
>>>> ----------------
>>>> First argument, pointer to array of `struct iommu_map_op`
>>>> Second argument, integer count of `struct iommu_map_op` elements in array
>>> Could this be 'unsigned integer' count?
>> Yes will change for draft B
>>> Is there a limit? Can I do 31415 of them? Can I do it for the whole
>>> memory space of the guest?
>> There is no current limit, the hypercall will be implemented with
>> continuations to prevent denial of service attacks.
>>>> This subop will attempt to IOMMU map each element in the `struct
>>>> iommu_map_op`
>>>> array and record the mapping status back into the array itself. If
>>>> an mapping
>>>> fault occurs then the hypercall will return with -EFAULT.
>>>> This subop will inspect the MFN address being mapped in each
>>>> iommu_map_op to
>>>> ensure it does not belong to the Xen hypervisor itself. If the MFN
>>>> does belong
>>>> to the Xen hypervisor the subop will return -EPERM in the status
>>>> field for that
>>>> particular iommu_map_op.
>>> Is it OK if the MFN belongs to another guest?
>>>
>> It is OK for the MFN to belong to another guest because only Dom0 is
>> performing the mapping. This is to allow grant mapped pages to have
>> a Dom0 BFN mapping.
> That should be reflected in the spec. That the MFN have to pass the
> grant check.
Again, if the design is restricted to domain 0 only then there is no
need for a grant check because previously domain 0 had a bus addressable
mapping for all MFN's.
>>>> The IOMMU TLB will only be flushed when the hypercall completes or a
>>>> hypercall
>>>> continuation is created.
>>>>
>>>> struct iommu_map_op {
>>>> uint64_t bfn;
>>> bus_frame ?
>> Yes, or you could say, Bus Address with 4k page size granularity.
>
> The 'bfn' sounds like a new word. I was hoping you could use something
> that is more in line with the nomenclature in the I/O world. Such as
> 'bus addresses', 'bus physical frame', etc. The 'bfn' sounds just
> off. The best I could come up with was 'bus_frame' but there has to
> be something that sounds better?
I agree that bfn is not ideal. Using bus_address is dangerous because
the mapping can only be done at 4k granularity (MFN's are 4k granularity
also).
"iofn" is an alternative but potentially misleading because IOMMU's may
not be able to translate all IO device's accessible to dom0
(particularly on ARM).
>
>>>> uint64_t mfn;
>>>> uint32_t flags;
>>>> int32_t status;
>>>> };
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Field Purpose
>>>> ----- ---------------------------------------------------------------
>>>> `bfn` [in] Bus address frame number to mapped to specified
>>>> mfn below
>>> Huh? Isn't this out? If not, isn't bfn == mfn for dom0?
>>> How would dom0 know the bus address? That usually is something only the
>>> IOMMU knows.
>> The idea is that Domain0 sets up BFN to MFN mappings which are the
>> same as the GPFN to MFN mappings. This cannot be done from Xen's M2P
>> mappings because they are not up to date for PV guests.
> OK, but if GPFN == MFN, then this is not an bus frame number. You might
> as well call it 'gmfn' ?
The idea is not to force the particular mappings that domain 0 is
allowed to create but the point I was trying to make above is that
domain 0 is responsible for tracking the bfn mappings it has setup.
>>>> `mfn` [in] Machine address frame number
>>>>
>>> We still need to do a bit of PFN -> MFN -> hypercall -> GFN and program
>>> that in the PCIe devices right?
>> Yes, a GPFN to MFN lookup will be required but this is simply
>> consulting the Guest P2M. BTW, we are programming the IOMMU not the
>> PCIe device's themselves.
> We do have to program some value in the PCIe device so that it can
> do the DMA operation. Said value right now is the MFN << PAGE_SHIFT (aka
> physical address). But as the spec says - the 'physical address' might
> not be equal to the 'bus address'. Hence the value we would be programming
> in might be very well different than MFN.
Yes but domain 0 will track the BFN to MFN mappings it has setup ( if it
setup BFN= GPFN then this is easy).
So domain 0 is responsible to programming the correct BFN address into
the PCIe device.
>
> Or in other words - the IOMMU could be non-coherent and it would use
> its own addresses for physical addresses. As in, different than
> the physical ones.
I don't understand what you are trying to say here? When you say
physical address do you mean MFN (memory based addresses) or BFN (bus
addresses from the device point of view)?
>>>
>>>> `flags` [in] Flags for signalling type of IOMMU mapping to be created
>>>>
>>>> `status` [out] Mapping status of this map operation, 0
>>>> indicates success
>>>> ------------------------------------------------------------------------------
>>>>
>>>>
>>>> Defined bits for flags field
>>>> ------------------------------------------------------------------------
>>>> Name Bit Definition
>>>> ---- ----- ----------------------------------
>>>> IOMMU_MAP_OP_readable 0 Create readable IOMMU mapping
>>>> IOMMU_MAP_OP_writeable 1 Create writeable IOMMU mapping
>>> And is it OK to use both?
>> Yes and typically both would be used. I'm just allowing for read
>> only mappings and write only mappings to be created.
> You need to mention that in the spec that it is OK to combine them.
> Or mention which ones are not OK to combine.
Different IOMMU implementations may change what type of mappings can be
setup. I would expect the guest to handle the error if particular type
of mapping can't be setup which the guest requires.
>>>> Reserved for future use 2-31 n/a
>>>> ------------------------------------------------------------------------
>>>>
>>>> Additional error codes specific to this hypercall:
>>>>
>>>> Error code Reason
>>>> ---------- ------------------------------------------------------------
>>>> EPERM PV IOMMU mode not enabled or calling domain is not domain 0
>>> And -EFAULT
>> I was considering that EFAULT would be standard error code, do you
>> want it to be explicitly listed?
> Yes. This is a spec and if you don't spell it out folks might miss this.
>
>>> and what about success? Do you return 0 or the number of ops that were
>>> successfull?
>> Return 0 if all ops were successful, otherwise return the number of
>> failed operations as positive number. I leave it up to the caller to
>> determine which operations failed by iterating the input array. I'll
>> will add this to draft B.
> Thank you.
>>>> ------------------------------------------------------------------------
>>>>
>>>> IOMMUOP_unmap_page
>>>> ----------------
>>>> First argument, pointer to array of `struct iommu_map_op`
>>>> Second argument, integer count of `struct iommu_map_op` elements in array
>>> Um, 'unsigned integer' count?
>> Yes will change for draft B
>>>> This subop will attempt to unmap each element in the `struct
>>>> iommu_map_op` array
>>>> and record the mapping status back into the array itself. If an
>>>> unmapping fault
>>>> occurs then the hypercall stop processing the array and return with
>>>> an EFAULT;
>>>>
>>>> The IOMMU TLB will only be flushed when the hypercall completes or a
>>>> hypercall
>>>> continuation is created.
>>>>
>>>> struct iommu_map_op {
>>>> uint64_t bfn;
>>>> uint64_t mfn;
>>>> uint32_t flags;
>>>> int32_t status;
>>>> };
>>>>
>>>> --------------------------------------------------------------------
>>>> Field Purpose
>>>> ----- -----------------------------------------------------
>>>> `bfn` [in] Bus address frame number to be unmapped
>>> I presume this is gathered from the 'map' call?
>> It does not need to be gathered, Domain 0 is responsible for it's
>> own BFN mappings and there is no auditing of the BFN address
>> themselves.
> I can see this be the case when BFN == MFN. But if that is not the
> case I cannot see how the domain would gather the BFNs - unless it
> gets it from the 'map' call.
Domain 0 issue's the original PV IOMMU map hypercall to cause the BFN
mapping to be created so it should be able to track what needs to be
unmapped.
>
>>>> `mfn` [in] This field is ignored for unmap subop
>>>>
>>>> `flags` [in] This field is ignored for unmap subop
>>>>
>>>> `status` [out] Mapping status of this unmap operation, 0
>>>> indicates success
>>>> --------------------------------------------------------------------
>>>>
>>>> Additional error codes specific to this hypercall:
>>>>
>>>> Error code Reason
>>>> ---------- ------------------------------------------------------------
>>>> EPERM PV IOMMU mode not enabled or calling domain is not domain 0
>>> EFAULT too
>> I was considering that EFAULT would be standard error code, do you
>> want it to be explicitly listed?
> Yes.
>>
next prev parent reply other threads:[~2014-04-14 15:03 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-04-11 17:28 [RFC] Dom0 PV IOMMU control design (draft A) Malcolm Crossley
2014-04-11 17:50 ` Konrad Rzeszutek Wilk
2014-04-14 12:12 ` Malcolm Crossley
2014-04-14 12:51 ` Konrad Rzeszutek Wilk
2014-04-14 15:03 ` Malcolm Crossley [this message]
2014-04-14 15:09 ` Konrad Rzeszutek Wilk
2014-04-14 15:48 ` Jan Beulich
2014-04-14 16:38 ` Malcolm Crossley
2014-04-15 6:50 ` Jan Beulich
2014-04-14 11:52 ` David Vrabel
2014-04-14 13:47 ` Jan Beulich
2014-04-14 15:48 ` Malcolm Crossley
2014-04-14 16:00 ` Jan Beulich
2014-04-14 16:55 ` Malcolm Crossley
2014-04-15 6:56 ` Jan Beulich
2014-05-01 11:56 ` Tim Deegan
2014-04-16 14:13 ` Zhang, Xiantao
2014-04-16 15:35 ` Malcolm Crossley
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=534BF8BA.2040601@citrix.com \
--to=malcolm.crossley@citrix.com \
--cc=konrad.wilk@oracle.com \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).