Re: [RFC] Dom0 PV IOMMU control design (draft A)

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: Malcolm Crossley <malcolm.crossley@citrix.com>
Cc: xen-devel <xen-devel@lists.xenproject.org>
Subject: Re: [RFC] Dom0 PV IOMMU control design (draft A)
Date: Mon, 14 Apr 2014 11:09:54 -0400	[thread overview]
Message-ID: <20140414150954.GF23371@phenom.dumpdata.com> (raw)
In-Reply-To: <534BF8BA.2040601@citrix.com>

On Mon, Apr 14, 2014 at 04:03:22PM +0100, Malcolm Crossley wrote:
> On 14/04/14 13:51, Konrad Rzeszutek Wilk wrote:
> >On Mon, Apr 14, 2014 at 01:12:07PM +0100, Malcolm Crossley wrote:
> >>On 11/04/14 18:50, Konrad Rzeszutek Wilk wrote:
> >>>On Fri, Apr 11, 2014 at 06:28:43PM +0100, Malcolm Crossley wrote:
> >>>>Hi,
> >>>>
> >>>>Here is a design for allowing Dom0 PV guests to control the IOMMU.
> >With the device driver domains I think you should also rename the
> >'dom0' to device driver or 'hardware domain' - as this functionality
> >should be possible within an PV guest with PCI passthrough for example.
> Currently Xen only allows Dom0 IOMMU access to all (expect Xen)
> MFN's. To not change the current security implications of the
> feature I would prefer that 'hardware domain' support was added as a
> separate design.
> >
> >>>>This allows for the Dom0 GPFN mapping to be programmed into the
> >>>>IOMMU and avoid using the SWIOTLB bounce buffer technique in the
> >>>>Linux kernel (except for legacy 32 bit DMA IO devices)
> >>>>
> >>>>...
> >>>>
> >>>>Design considerations for hypercall subops
> >>>>-------------------------------------------
> >>>>IOMMU map/unmap operations can be slow and can involve flushing the
> >>>>IOMMU TLB
> >>>>to ensure the IO device uses the updated mappings.
> >>>>
> >>>>The subops have been designed to take an array of operations and a count as
> >>>>parameters. This allows for easily implemented hypercall
> >>>>continuations to be
> >>>>used and allows for batches of IOMMU operations to be submitted
> >>>>before flushing
> >>>>the IOMMU TLB.
> >>>>
> >>>>
> >>>>
> >>>>IOMMUOP_map_page
> >>>>----------------
> >>>>First argument, pointer to array of `struct iommu_map_op`
> >>>>Second argument, integer count of `struct iommu_map_op` elements in array
> >>>Could this be 'unsigned integer' count?
> >>Yes will change for draft B
> >>>Is there a limit? Can I do 31415 of them? Can I do it for the whole
> >>>memory space of the guest?
> >>There is no current limit, the hypercall will be implemented with
> >>continuations to prevent denial of service attacks.
> >>>>This subop will attempt to IOMMU map each element in the `struct
> >>>>iommu_map_op`
> >>>>array and record the mapping status back into the array itself. If
> >>>>an mapping
> >>>>fault occurs then the hypercall will return with -EFAULT.
> >>>>This subop will inspect the MFN address being mapped in each
> >>>>iommu_map_op to
> >>>>ensure it does not belong to the Xen hypervisor itself. If the MFN
> >>>>does belong
> >>>>to the Xen hypervisor the subop will return -EPERM in the status
> >>>>field for that
> >>>>particular iommu_map_op.
> >>>Is it OK if the MFN belongs to another guest?
> >>>
> >>It is OK for the MFN to belong to another guest because only Dom0 is
> >>performing the mapping. This is to allow grant mapped pages to have
> >>a Dom0 BFN mapping.
> >That should be reflected in the spec. That the MFN have to pass the
> >grant check.
> Again, if the design is restricted to domain 0 only then there is no
> need for a grant check because previously domain 0 had a bus
> addressable mapping for all MFN's.
> >>>>The IOMMU TLB will only be flushed when the hypercall completes or a
> >>>>hypercall
> >>>>continuation is created.
> >>>>
> >>>>     struct iommu_map_op {
> >>>>         uint64_t bfn;
> >>>bus_frame ?
> >>Yes, or you could say, Bus Address with 4k page size granularity.
> >
> >The 'bfn' sounds like a new word. I was hoping you could use something
> >that is more in line with the nomenclature in the I/O world. Such as
> >'bus addresses', 'bus physical frame', etc. The 'bfn' sounds just
> >off. The best I could come up with was 'bus_frame' but there has to
> >be something that sounds better?
> I agree that bfn is not ideal. Using bus_address is dangerous
> because the mapping can only be done at 4k granularity (MFN's are 4k
> granularity also).
> "iofn" is an alternative but potentially misleading because IOMMU's
> may not be able to translate all IO device's accessible to dom0
> (particularly on ARM).

Who does translate the rest of those I/O devices? Other IOMMUs?

> >
> >>>>         uint64_t mfn;
> >>>>         uint32_t flags;
> >>>>         int32_t status;
> >>>>     };
> >>>>
> >>>>------------------------------------------------------------------------------
> >>>>Field          Purpose
> >>>>----- ---------------------------------------------------------------
> >>>>`bfn`          [in] Bus address frame number to mapped to specified
> >>>>mfn below
> >>>Huh? Isn't this out? If not, isn't bfn == mfn for dom0?
> >>>How would dom0 know the bus address? That usually is something only the
> >>>IOMMU knows.
> >>The idea is that Domain0 sets up BFN to MFN mappings which are the
> >>same as the GPFN to MFN mappings. This cannot be done from Xen's M2P
> >>mappings because they are not up to date for PV guests.
> >OK, but if GPFN == MFN, then this is not an bus frame number. You might
> >as well call it 'gmfn' ?
> The idea is not to force the particular mappings that domain 0 is
> allowed to create but the point I was trying to make above is that
> domain 0 is responsible for tracking the bfn mappings it has setup.

OK, just mention that in the spec :-)

> >>>>`mfn`          [in] Machine address frame number
> >>>>
> >>>We still need to do a bit of PFN -> MFN -> hypercall -> GFN and program
> >>>that in the PCIe devices right?
> >>Yes, a GPFN to MFN lookup will be required but this is simply
> >>consulting the Guest P2M. BTW, we are programming the IOMMU not the
> >>PCIe device's themselves.
> >We do have to program some value in the PCIe device so that it can
> >do the DMA operation. Said value right now is the MFN << PAGE_SHIFT (aka
> >physical address). But as the spec says - the 'physical address' might
> >not be equal to the 'bus address'. Hence the value we would be programming
> >in might be very well different than MFN.
> Yes but domain 0 will track the BFN to MFN mappings it has setup (
> if it setup BFN= GPFN then this is easy).
> So domain 0 is responsible to programming the correct BFN address
> into the PCIe device.
> >
> >Or in other words - the IOMMU could be non-coherent and it would use
> >its own addresses for physical addresses. As in, different than
> >the physical ones.
> I don't understand what you are trying to say here? When you say
> physical address do you mean MFN (memory based addresses) or BFN
> (bus addresses from the device point of view)?

physical address = CPU viewpoint, aka MFN.
bus addresses = IOMMU or I/O viewpoint, aka BFN.

And BFN != MFN.

> >>>
> >>>>`flags`        [in] Flags for signalling type of IOMMU mapping to be created
> >>>>
> >>>>`status`       [out] Mapping status of this map operation, 0
> >>>>indicates success
> >>>>------------------------------------------------------------------------------
> >>>>
> >>>>
> >>>>Defined bits for flags field
> >>>>------------------------------------------------------------------------
> >>>>Name                        Bit                Definition
> >>>>----                       ----- ----------------------------------
> >>>>IOMMU_MAP_OP_readable        0        Create readable IOMMU mapping
> >>>>IOMMU_MAP_OP_writeable       1        Create writeable IOMMU mapping
> >>>And is it OK to use both?
> >>Yes and typically both would be used. I'm just allowing for read
> >>only mappings and write only mappings to be created.
> >You need to mention that in the spec that it is OK to combine them.
> >Or mention which ones are not OK to combine.
> Different IOMMU implementations may change what type of mappings can
> be setup. I would expect the guest to handle the error if particular
> type of mapping can't be setup which the guest requires.

Ok, that needs to be mentioned in the spec too 
> >>>>Reserved for future use     2-31                   n/a
> >>>>------------------------------------------------------------------------
> >>>>
> >>>>Additional error codes specific to this hypercall:
> >>>>
> >>>>Error code  Reason
> >>>>---------- ------------------------------------------------------------
> >>>>EPERM       PV IOMMU mode not enabled or calling domain is not domain 0
> >>>And -EFAULT
> >>I was considering that EFAULT would be standard error code, do you
> >>want it to be explicitly listed?
> >Yes. This is a spec and if you don't spell it out folks might miss this.
> >
> >>>and what about success? Do you return 0 or the number of ops that were
> >>>successfull?
> >>Return 0 if all ops were successful, otherwise return the number of
> >>failed operations as positive number. I leave it up to the caller to
> >>determine which operations failed by iterating the input array. I'll
> >>will add this to draft B.
> >Thank you.
> >>>>------------------------------------------------------------------------
> >>>>
> >>>>IOMMUOP_unmap_page
> >>>>----------------
> >>>>First argument, pointer to array of `struct iommu_map_op`
> >>>>Second argument, integer count of `struct iommu_map_op` elements in array
> >>>Um, 'unsigned integer' count?
> >>Yes will change for draft B
> >>>>This subop will attempt to unmap each element in the `struct
> >>>>iommu_map_op` array
> >>>>and record the mapping status back into the array itself. If an
> >>>>unmapping fault
> >>>>occurs then the hypercall stop processing the array and return with
> >>>>an EFAULT;
> >>>>
> >>>>The IOMMU TLB will only be flushed when the hypercall completes or a
> >>>>hypercall
> >>>>continuation is created.
> >>>>
> >>>>     struct iommu_map_op {
> >>>>         uint64_t bfn;
> >>>>         uint64_t mfn;
> >>>>         uint32_t flags;
> >>>>         int32_t status;
> >>>>     };
> >>>>
> >>>>--------------------------------------------------------------------
> >>>>Field          Purpose
> >>>>-----          -----------------------------------------------------
> >>>>`bfn`          [in] Bus address frame number to be unmapped
> >>>I presume this is gathered from the 'map' call?
> >>It does not need to be gathered, Domain 0 is responsible for it's
> >>own BFN mappings and there is no auditing of the BFN address
> >>themselves.
> >I can see this be the case when BFN == MFN. But if that is not the
> >case I cannot see how the domain would gather the BFNs - unless it
> >gets it from the 'map' call.
> Domain 0 issue's the original PV IOMMU map hypercall to cause the
> BFN mapping to be created so it should be able to track what needs
> to be unmapped.

Earlier you stated: "It does not need to be gathered' but it does -
via the 'map' call which  you confirmed it here. Please just
note it in the spec that these values should used from prior
'map' calls.


> >
> >>>>`mfn`          [in] This field is ignored for unmap subop
> >>>>
> >>>>`flags`        [in] This field is ignored for unmap subop
> >>>>
> >>>>`status`       [out] Mapping status of this unmap operation, 0
> >>>>indicates success
> >>>>--------------------------------------------------------------------
> >>>>
> >>>>Additional error codes specific to this hypercall:
> >>>>
> >>>>Error code  Reason
> >>>>---------- ------------------------------------------------------------
> >>>>EPERM       PV IOMMU mode not enabled or calling domain is not domain 0
> >>>EFAULT too
> >>I was considering that EFAULT would be standard error code, do you
> >>want it to be explicitly listed?
> >Yes.
> >>
>

next prev parent reply	other threads:[~2014-04-14 15:10 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-04-11 17:28 [RFC] Dom0 PV IOMMU control design (draft A) Malcolm Crossley
2014-04-11 17:50 ` Konrad Rzeszutek Wilk
2014-04-14 12:12   ` Malcolm Crossley
2014-04-14 12:51     ` Konrad Rzeszutek Wilk
2014-04-14 15:03       ` Malcolm Crossley
2014-04-14 15:09         ` Konrad Rzeszutek Wilk [this message]
2014-04-14 15:48         ` Jan Beulich
2014-04-14 16:38           ` Malcolm Crossley
2014-04-15  6:50             ` Jan Beulich
2014-04-14 11:52 ` David Vrabel
2014-04-14 13:47 ` Jan Beulich
2014-04-14 15:48   ` Malcolm Crossley
2014-04-14 16:00     ` Jan Beulich
2014-04-14 16:55       ` Malcolm Crossley
2014-04-15  6:56         ` Jan Beulich
2014-05-01 11:56     ` Tim Deegan
2014-04-16 14:13 ` Zhang, Xiantao
2014-04-16 15:35   ` Malcolm Crossley

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140414150954.GF23371@phenom.dumpdata.com \
    --to=konrad.wilk@oracle.com \
    --cc=malcolm.crossley@citrix.com \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).