qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: qemu-devel@nongnu.org, tianyu.lan@intel.com,
	kevin.tian@intel.com, mst@redhat.com, jan.kiszka@siemens.com,
	jasowang@redhat.com, alex.williamson@redhat.com,
	bd.aviv@gmail.com
Subject: Re: [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region
Date: Wed, 21 Dec 2016 18:05:49 +0800	[thread overview]
Message-ID: <20161221100549.GG22006@pxdev.xzpeter.org> (raw)
In-Reply-To: <20161221025337.GA13024@umbus.fritz.box>

On Wed, Dec 21, 2016 at 01:53:37PM +1100, David Gibson wrote:

[...]

> > Could you explain why here device address space has things to do with
> > PCI BARs? I thought BARs are for CPU address space only (so that CPU
> > can access PCI registers via MMIO manner), am I wrong?
> 
> In short, yes.  So, first think about vanilla PCI - most things are
> PCI-E these days, but the PCI addressing model which was designed for
> the old hardware is still mostly the same.
> 
> With plain PCI, you have a physical bus over which address and data
> cycles pass.  Those cycles don't distinguish between transfers from
> host to device or device to host.  Each address cycle just gives which
> address space: configuration, IO or memory, and an address.
> 
> Devices respond to addresses within their BARs, typically such cycles
> will come from the host, but they don't have to - a device is able to
> send cycles to another device (peer to peer DMA).  Meanwhile the host
> bridge will respond to addresses within certain DMA windows,
> propagating those access onwards to system memory.  How many DMA
> windows there are, their size, location and whether they're mapped
> directly or via an IOMMU depends on the model of host bridge.
> 
> On x86, traditionally, PCI addresses 0..<somewhere> were simply mapped
> directly to memory addresses 0..<somewhere>, identity mapping RAM into
> PCI space.  BARs would be assigned above <somewhere>, so they don't
> collide.  I suspect old enough machines will have <somewhere> == 2G,
> leaving 2G..4G for the BARs of 32-bit devices.  More modern x86
> bridges must have provisions for accessing memory above 4G, but I'm
> not entirely certain how that works.
> 
> PAPR traditionally also had a DMA window from 0..2G, however instead
> of being direct mapped to RAM, it is always translated via an IOMMU.
> More modern PAPR systems have that window by default, but allow the
> OS to remove it and configure up to 2 DMA windows of variable length
> and page size.  Various other platforms have various other DMA window
> arrangements.
> 
> With PCI-E, of course, upstream and downstream cycles are distinct,
> and peer to peer DMA isn't usually possible (unless a switch is
> configured specially to allow it by forwarding cycles from one
> downstream port to another).  But the address mndel remains logically
> the same: there is just one PCI memory space and both device BARs and
> host DMA windows live within it.  Firmware and/or the OS need to know
> the details of the platform's host bridge, and configure both the BARs
> and the DMA windows so that they don't collide.

Thanks for the thorough explanation. :)

So we should mask out all the MMIO regions (including BAR address
ranges) for PCI device address space, right? Since they should not be
able to access such addresses, but system ram?

> 
> > I think we should have a big enough IOMMU region size here. If device
> > writes to invalid addresses, IMHO we should trap it and report to
> > guest. If we have a smaller size than UINT64_MAX, how we can trap this
> > behavior and report for the whole address space (it should cover [0,
> > 2^64-1])?
> 
> That's not how the IOMMU works.  How it traps is dependent on the
> specific IOMMU model, but generally they'll only even look at cycles
> which lie within the IOMMU's DMA window.  On x86 I'm pretty sure that
> window will be large, but it won't be 2^64.  It's also likely to have
> a gap between 2..4GiB to allow room for the BARs of 32-bit devices.

But for x86 IOMMU region, I don't know anything like "DMA window" -
device has its own context entry, which will point to a whole page
table. In that sense I think at least all addresses from (0, 2^39-1)
should be legal addresses? And that range should depend on how many
address space bits the specific Intel IOMMU support, currently the
emulated VT-d one supports 39 bits.

An example would be: one with VT-d should be able to map the address
3G (0xc0000000, here it is an IOVA address) to any physical address
he/she wants, as long as he/she setup the page table correctly.

Hope I didn't miss anything important..

> 
> > > 
> > > > +        memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s),
> > > > +                                 "vtd_sys_alias", get_system_memory(),
> > > > +                                 0, memory_region_size(get_system_memory()));
> > > 
> > > I strongly suspect using memory_region_size(get_system_memory()) is
> > > also incorrect here.  System memory has size UINT64_MAX, but I'll bet
> > > you you can't actually access all of that via PCI space (again, it
> > > would collide with actual PCI BARs).  I also suspect you can't reach
> > > CPU MMIO regions via the PCI DMA space.
> > 
> > Hmm, sounds correct.
> > 
> > However if so we will have the same problem if without IOMMU? See
> > pci_device_iommu_address_space() - address_space_memory will be the
> > default if we have no IOMMU protection, and that will cover e.g. CPU
> > MMIO regions as well.
> 
> True.  That default is basically assuming that both the host bridge's
> DMA windows, and its outbound IO and memory windows are identity
> mapped between the system bus and the PCI address space.  I suspect
> that's rarely 100% true, but it's close enough to work on a fair few
> platforms.
> 
> But since you're building a more accurate model of the x86 host
> bridge's behaviour here, you might as well try to get it as accurate
> as possible.

Yes, but even if we can fix this problem, it should be for the
no-iommu case as well? If so, I think it might be more suitable for
another standalone patch.

Anyway, I noted this down. Thanks,

-- peterx

  reply	other threads:[~2016-12-21 10:06 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-19 14:41 [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region Peter Xu
2016-12-19 16:56 ` Alex Williamson
2016-12-20  3:44   ` Peter Xu
2016-12-20  4:52     ` Alex Williamson
2016-12-20  6:38       ` Peter Xu
2016-12-21  0:04         ` Alex Williamson
2016-12-21  3:19           ` Peter Xu
2016-12-21  3:49           ` David Gibson
2016-12-21  3:30       ` David Gibson
2016-12-19 23:30 ` David Gibson
2016-12-20  4:16   ` Peter Xu
2016-12-21  2:53     ` David Gibson
2016-12-21 10:05       ` Peter Xu [this message]
2016-12-21 22:56         ` David Gibson
2016-12-20 23:02 ` no-reply
2016-12-21  3:33   ` Peter Xu
2016-12-20 23:57 ` no-reply
2016-12-21  3:39   ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161221100549.GG22006@pxdev.xzpeter.org \
    --to=peterx@redhat.com \
    --cc=alex.williamson@redhat.com \
    --cc=bd.aviv@gmail.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=jan.kiszka@siemens.com \
    --cc=jasowang@redhat.com \
    --cc=kevin.tian@intel.com \
    --cc=mst@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=tianyu.lan@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).