All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: Peter Xu <peterx@redhat.com>
Cc: qemu-devel@nongnu.org, tianyu.lan@intel.com,
	kevin.tian@intel.com, mst@redhat.com, jan.kiszka@siemens.com,
	jasowang@redhat.com, alex.williamson@redhat.com,
	bd.aviv@gmail.com
Subject: Re: [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region
Date: Thu, 22 Dec 2016 09:56:01 +1100	[thread overview]
Message-ID: <20161221225601.GD14282@umbus.fritz.box> (raw)
In-Reply-To: <20161221100549.GG22006@pxdev.xzpeter.org>

[-- Attachment #1: Type: text/plain, Size: 7081 bytes --]

On Wed, Dec 21, 2016 at 06:05:49PM +0800, Peter Xu wrote:
> On Wed, Dec 21, 2016 at 01:53:37PM +1100, David Gibson wrote:
> 
> [...]
> 
> > > Could you explain why here device address space has things to do with
> > > PCI BARs? I thought BARs are for CPU address space only (so that CPU
> > > can access PCI registers via MMIO manner), am I wrong?
> > 
> > In short, yes.  So, first think about vanilla PCI - most things are
> > PCI-E these days, but the PCI addressing model which was designed for
> > the old hardware is still mostly the same.
> > 
> > With plain PCI, you have a physical bus over which address and data
> > cycles pass.  Those cycles don't distinguish between transfers from
> > host to device or device to host.  Each address cycle just gives which
> > address space: configuration, IO or memory, and an address.
> > 
> > Devices respond to addresses within their BARs, typically such cycles
> > will come from the host, but they don't have to - a device is able to
> > send cycles to another device (peer to peer DMA).  Meanwhile the host
> > bridge will respond to addresses within certain DMA windows,
> > propagating those access onwards to system memory.  How many DMA
> > windows there are, their size, location and whether they're mapped
> > directly or via an IOMMU depends on the model of host bridge.
> > 
> > On x86, traditionally, PCI addresses 0..<somewhere> were simply mapped
> > directly to memory addresses 0..<somewhere>, identity mapping RAM into
> > PCI space.  BARs would be assigned above <somewhere>, so they don't
> > collide.  I suspect old enough machines will have <somewhere> == 2G,
> > leaving 2G..4G for the BARs of 32-bit devices.  More modern x86
> > bridges must have provisions for accessing memory above 4G, but I'm
> > not entirely certain how that works.
> > 
> > PAPR traditionally also had a DMA window from 0..2G, however instead
> > of being direct mapped to RAM, it is always translated via an IOMMU.
> > More modern PAPR systems have that window by default, but allow the
> > OS to remove it and configure up to 2 DMA windows of variable length
> > and page size.  Various other platforms have various other DMA window
> > arrangements.
> > 
> > With PCI-E, of course, upstream and downstream cycles are distinct,
> > and peer to peer DMA isn't usually possible (unless a switch is
> > configured specially to allow it by forwarding cycles from one
> > downstream port to another).  But the address mndel remains logically
> > the same: there is just one PCI memory space and both device BARs and
> > host DMA windows live within it.  Firmware and/or the OS need to know
> > the details of the platform's host bridge, and configure both the BARs
> > and the DMA windows so that they don't collide.
> 
> Thanks for the thorough explanation. :)
> 
> So we should mask out all the MMIO regions (including BAR address
> ranges) for PCI device address space, right?

Uhh.. I think "masking out" is treating the problem backwards.
I think you should allow only a window covering RAM, not take
everything then try to remove MMIO.

> Since they should not be
> able to access such addresses, but system ram?

What is "they"?

> > > I think we should have a big enough IOMMU region size here. If device
> > > writes to invalid addresses, IMHO we should trap it and report to
> > > guest. If we have a smaller size than UINT64_MAX, how we can trap this
> > > behavior and report for the whole address space (it should cover [0,
> > > 2^64-1])?
> > 
> > That's not how the IOMMU works.  How it traps is dependent on the
> > specific IOMMU model, but generally they'll only even look at cycles
> > which lie within the IOMMU's DMA window.  On x86 I'm pretty sure that
> > window will be large, but it won't be 2^64.  It's also likely to have
> > a gap between 2..4GiB to allow room for the BARs of 32-bit devices.
> 
> But for x86 IOMMU region, I don't know anything like "DMA window" -
> device has its own context entry, which will point to a whole page
> table. In that sense I think at least all addresses from (0, 2^39-1)
> should be legal addresses? And that range should depend on how many
> address space bits the specific Intel IOMMU support, currently the
> emulated VT-d one supports 39 bits.

Well, sounds like the DMA window is 0..2^39-1 then.  If there's a
maximum number of bits in the specification - even if those aren't
implemented on any current model - that would also be a reasonable
choice.

Again, I strongly suspect there's a gap in the range 2..4GiB, to allow
for 32-bit BARs.  Maybe not the whole range, but some chunk of it.  I
believe that's what's called the "io hole" on x86.

This more or less has to be there.  If all of 0..4GiB was mapped to
RAM, and you had both a DMA capable device and a 32-bit device hanging
off a PCI-E to PCI bridge, the 32-bit device's BARs could pick up
cycles from the DMA device that were intended to go to RAM.

> An example would be: one with VT-d should be able to map the address
> 3G (0xc0000000, here it is an IOVA address) to any physical address
> he/she wants, as long as he/she setup the page table correctly.
> 
> Hope I didn't miss anything important..
> 
> > 
> > > > 
> > > > > +        memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s),
> > > > > +                                 "vtd_sys_alias", get_system_memory(),
> > > > > +                                 0, memory_region_size(get_system_memory()));
> > > > 
> > > > I strongly suspect using memory_region_size(get_system_memory()) is
> > > > also incorrect here.  System memory has size UINT64_MAX, but I'll bet
> > > > you you can't actually access all of that via PCI space (again, it
> > > > would collide with actual PCI BARs).  I also suspect you can't reach
> > > > CPU MMIO regions via the PCI DMA space.
> > > 
> > > Hmm, sounds correct.
> > > 
> > > However if so we will have the same problem if without IOMMU? See
> > > pci_device_iommu_address_space() - address_space_memory will be the
> > > default if we have no IOMMU protection, and that will cover e.g. CPU
> > > MMIO regions as well.
> > 
> > True.  That default is basically assuming that both the host bridge's
> > DMA windows, and its outbound IO and memory windows are identity
> > mapped between the system bus and the PCI address space.  I suspect
> > that's rarely 100% true, but it's close enough to work on a fair few
> > platforms.
> > 
> > But since you're building a more accurate model of the x86 host
> > bridge's behaviour here, you might as well try to get it as accurate
> > as possible.
> 
> Yes, but even if we can fix this problem, it should be for the
> no-iommu case as well? If so, I think it might be more suitable for
> another standalone patch.

Hm, perhaps.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

  reply	other threads:[~2016-12-21 22:56 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-19 14:41 [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region Peter Xu
2016-12-19 16:56 ` Alex Williamson
2016-12-20  3:44   ` Peter Xu
2016-12-20  4:52     ` Alex Williamson
2016-12-20  6:38       ` Peter Xu
2016-12-21  0:04         ` Alex Williamson
2016-12-21  3:19           ` Peter Xu
2016-12-21  3:49           ` David Gibson
2016-12-21  3:30       ` David Gibson
2016-12-19 23:30 ` David Gibson
2016-12-20  4:16   ` Peter Xu
2016-12-21  2:53     ` David Gibson
2016-12-21 10:05       ` Peter Xu
2016-12-21 22:56         ` David Gibson [this message]
2016-12-20 23:02 ` no-reply
2016-12-21  3:33   ` Peter Xu
2016-12-20 23:57 ` no-reply
2016-12-21  3:39   ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161221225601.GD14282@umbus.fritz.box \
    --to=david@gibson.dropbear.id.au \
    --cc=alex.williamson@redhat.com \
    --cc=bd.aviv@gmail.com \
    --cc=jan.kiszka@siemens.com \
    --cc=jasowang@redhat.com \
    --cc=kevin.tian@intel.com \
    --cc=mst@redhat.com \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=tianyu.lan@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.