From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:54653) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cJYJp-0001Vq-PG for qemu-devel@nongnu.org; Tue, 20 Dec 2016 23:15:59 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cJYJm-0000Wj-Ei for qemu-devel@nongnu.org; Tue, 20 Dec 2016 23:15:57 -0500 Received: from ozlabs.org ([103.22.144.67]:40241) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1cJYJl-0000S4-7B for qemu-devel@nongnu.org; Tue, 20 Dec 2016 23:15:54 -0500 Date: Wed, 21 Dec 2016 13:53:37 +1100 From: David Gibson Message-ID: <20161221025337.GA13024@umbus.fritz.box> References: <1482158486-18597-1-git-send-email-peterx@redhat.com> <20161219233012.GF23176@umbus.fritz.box> <20161220041650.GA22006@pxdev.xzpeter.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="OgqxwSJOaUobr8KG" Content-Disposition: inline In-Reply-To: <20161220041650.GA22006@pxdev.xzpeter.org> Subject: Re: [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Peter Xu Cc: qemu-devel@nongnu.org, tianyu.lan@intel.com, kevin.tian@intel.com, mst@redhat.com, jan.kiszka@siemens.com, jasowang@redhat.com, alex.williamson@redhat.com, bd.aviv@gmail.com --OgqxwSJOaUobr8KG Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Dec 20, 2016 at 12:16:50PM +0800, Peter Xu wrote: > On Tue, Dec 20, 2016 at 10:30:12AM +1100, David Gibson wrote: >=20 > [...] >=20 > > > +static void vtd_switch_address_space(IntelIOMMUState *s, bool enable= d) > > > +{ > > > + GHashTableIter iter; > > > + VTDBus *vtd_bus; > > > + VTDAddressSpace *as; > > > + int i; > > > + > > > + g_hash_table_iter_init(&iter, s->vtd_as_by_busptr); > > > + while (g_hash_table_iter_next (&iter, NULL, (void**)&vtd_bus)) { > > > + for (i =3D 0; i < X86_IOMMU_PCI_DEVFN_MAX; i++) { > > > + as =3D vtd_bus->dev_as[i]; > > > + if (as =3D=3D NULL) { > > > + continue; > > > + } > > > + trace_vtd_switch_address_space(pci_bus_num(vtd_bus->bus), > > > + VTD_PCI_SLOT(i), VTD_PCI_= FUNC(i), > > > + enabled); > > > + if (enabled) { > > > + memory_region_add_subregion_overlap(&as->root, 0, > > > + &as->iommu, 2); > > > + } else { > > > + memory_region_del_subregion(&as->root, &as->iommu); > >=20 > > Why not use memory_region_set_enabled() rather than actually > > adding/deleting the subregion? >=20 > Good idea, thanks. :-) >=20 > [...] >=20 > > > @@ -2343,15 +2378,47 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUSt= ate *s, PCIBus *bus, int devfn) > > > vtd_dev_as->devfn =3D (uint8_t)devfn; > > > vtd_dev_as->iommu_state =3D s; > > > vtd_dev_as->context_cache_entry.context_cache_gen =3D 0; > > > + > > > + /* > > > + * When DMAR is disabled, memory region relationships looks > > > + * like: > > > + * > > > + * 0000000000000000-ffffffffffffffff (prio 0, RW): vtd_root > > > + * 0000000000000000-ffffffffffffffff (prio 1, RW): vtd_sys_= alias > > > + * 00000000fee00000-00000000feefffff (prio 64, RW): intel_i= ommu_ir > > > + * > > > + * When DMAR is disabled, it becomes: > > > + * > > > + * 0000000000000000-ffffffffffffffff (prio 0, RW): vtd_root > > > + * 0000000000000000-ffffffffffffffff (prio 2, RW): intel_io= mmu > > > + * 0000000000000000-ffffffffffffffff (prio 1, RW): vtd_sys_= alias > > > + * 00000000fee00000-00000000feefffff (prio 64, RW): intel_i= ommu_ir > > > + * > > > + * The intel_iommu region is dynamically added/removed. > > > + */ > > > memory_region_init_iommu(&vtd_dev_as->iommu, OBJECT(s), > > > &s->iommu_ops, "intel_iommu", UINT6= 4_MAX); > >=20 > > I'm almost certain UINT64_MAX is wrong here. For one thing it would > > collide with PCI BARs. For another, I can't imagine that the IOMMU > > page tables can really span an entire 2^64 space. >=20 > Could you explain why here device address space has things to do with > PCI BARs? I thought BARs are for CPU address space only (so that CPU > can access PCI registers via MMIO manner), am I wrong? In short, yes. So, first think about vanilla PCI - most things are PCI-E these days, but the PCI addressing model which was designed for the old hardware is still mostly the same. With plain PCI, you have a physical bus over which address and data cycles pass. Those cycles don't distinguish between transfers from host to device or device to host. Each address cycle just gives which address space: configuration, IO or memory, and an address. Devices respond to addresses within their BARs, typically such cycles will come from the host, but they don't have to - a device is able to send cycles to another device (peer to peer DMA). Meanwhile the host bridge will respond to addresses within certain DMA windows, propagating those access onwards to system memory. How many DMA windows there are, their size, location and whether they're mapped directly or via an IOMMU depends on the model of host bridge. On x86, traditionally, PCI addresses 0.. were simply mapped directly to memory addresses 0.., identity mapping RAM into PCI space. BARs would be assigned above , so they don't collide. I suspect old enough machines will have =3D=3D 2G, leaving 2G..4G for the BARs of 32-bit devices. More modern x86 bridges must have provisions for accessing memory above 4G, but I'm not entirely certain how that works. PAPR traditionally also had a DMA window from 0..2G, however instead of being direct mapped to RAM, it is always translated via an IOMMU. More modern PAPR systems have that window by default, but allow the OS to remove it and configure up to 2 DMA windows of variable length and page size. Various other platforms have various other DMA window arrangements. With PCI-E, of course, upstream and downstream cycles are distinct, and peer to peer DMA isn't usually possible (unless a switch is configured specially to allow it by forwarding cycles from one downstream port to another). But the address mndel remains logically the same: there is just one PCI memory space and both device BARs and host DMA windows live within it. Firmware and/or the OS need to know the details of the platform's host bridge, and configure both the BARs and the DMA windows so that they don't collide. > I think we should have a big enough IOMMU region size here. If device > writes to invalid addresses, IMHO we should trap it and report to > guest. If we have a smaller size than UINT64_MAX, how we can trap this > behavior and report for the whole address space (it should cover [0, > 2^64-1])? That's not how the IOMMU works. How it traps is dependent on the specific IOMMU model, but generally they'll only even look at cycles which lie within the IOMMU's DMA window. On x86 I'm pretty sure that window will be large, but it won't be 2^64. It's also likely to have a gap between 2..4GiB to allow room for the BARs of 32-bit devices. > >=20 > > > + memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s), > > > + "vtd_sys_alias", get_system_memory(= ), > > > + 0, memory_region_size(get_system_me= mory())); > >=20 > > I strongly suspect using memory_region_size(get_system_memory()) is > > also incorrect here. System memory has size UINT64_MAX, but I'll bet > > you you can't actually access all of that via PCI space (again, it > > would collide with actual PCI BARs). I also suspect you can't reach > > CPU MMIO regions via the PCI DMA space. >=20 > Hmm, sounds correct. >=20 > However if so we will have the same problem if without IOMMU? See > pci_device_iommu_address_space() - address_space_memory will be the > default if we have no IOMMU protection, and that will cover e.g. CPU > MMIO regions as well. True. That default is basically assuming that both the host bridge's DMA windows, and its outbound IO and memory windows are identity mapped between the system bus and the PCI address space. I suspect that's rarely 100% true, but it's close enough to work on a fair few platforms. But since you're building a more accurate model of the x86 host bridge's behaviour here, you might as well try to get it as accurate as possible. > > So, I think you should find out what this limit actually is and > > restrict the alias to that window. >=20 > /me needs some more reading to figure this out. Still not quite > familiar with the whole VM memory regions. Hints are welcomed... It's not really a question of VM memory regions. Rather it's about how the host bridge translates between cpu side and PCI side addresses. > > > memory_region_init_io(&vtd_dev_as->iommu_ir, OBJECT(s), > > > &vtd_mem_ir_ops, s, "intel_iommu_ir", > > > VTD_INTERRUPT_ADDR_SIZE); > > > - memory_region_add_subregion(&vtd_dev_as->iommu, VTD_INTERRUP= T_ADDR_FIRST, > > > - &vtd_dev_as->iommu_ir); > > > - address_space_init(&vtd_dev_as->as, > > > - &vtd_dev_as->iommu, "intel_iommu"); > > > + memory_region_init(&vtd_dev_as->root, OBJECT(s), > > > + "vtd_root", UINT64_MAX); > > > + memory_region_add_subregion_overlap(&vtd_dev_as->root, > > > + VTD_INTERRUPT_ADDR_FIRST, > > > + &vtd_dev_as->iommu_ir, 6= 4); > > > + address_space_init(&vtd_dev_as->as, &vtd_dev_as->root, name); > > > + memory_region_add_subregion_overlap(&vtd_dev_as->root, 0, > > > + &vtd_dev_as->sys_alias, = 1); > > > + if (s->dmar_enabled) { > > > + memory_region_add_subregion_overlap(&vtd_dev_as->root, 0, > > > + &vtd_dev_as->iommu, = 2); > > > + } > >=20 > > Hmm. You have the IOMMU translated region overlaying the > > direct-mapped alias. You enable and disable the IOMMU subregion, but > > you always leave the direct mapping enabled. You might get away with > > this because the size of the IOMMU region is UINT64_MAX, which will > > overlay everything - but as above, I think that's wrong. If that's > > changed then guest devices may be able to access portions of the raw > > address space outside the IOMMU mapped region, which could break the > > guest's expectations of device isolation. > >=20 > > I think it would be much safer to disable the system memory alias when > > the IOMMU is enabled. >=20 > Reasonable. Will adopt. >=20 > Thanks! >=20 > -- peterx >=20 --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --OgqxwSJOaUobr8KG Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBCAAGBQJYWe6uAAoJEGw4ysog2bOSHGkP/i+aa4AYOAC836swu6y44LNZ iyA/6aHrPcKDReI7dnLASq2+j40WLYLNnMuwRBH/suybACiis6OJ2bYZoO3LkU8X pNT8thqAaAgKoeFx9cqKFAKZZ50ldKK9ui7VJMegg7qvrkR0KLapw7DUwYNRiOA3 FZmy+Y2UV9DAvdIO8I0Dg1AdnurJ2LFrfpY+2JBIwBxuppgLHKQSBmcCYsqbaOOZ HsDFRuYc7uzSpTuX5ZXNR9OcBN4qUOgjvaPev9dpBmn4OToc4CXVe7G1P+t59u4G iXtB8+t8XiGfhL5tUbKFg/XDyPPa5VwLweas62TeXwmPAWjV1ZmfOJ1tLTg9fGPT KhAslv3mu+oDpC3HzIu3D2X/WOkZpnQeniH/VpbshGGIEc/pFdF5FT974h3Feh8P rCV65hM8SA8irHBfG+0JpkyCAWRB9H9vByduq4Y8AchbFrldCfrFLHRZukL6v1xA pebAdm0ZGvFL9F+WqNFPOR1rvjtLk8jFv2g3f1EIU0XhzA/FFYx/JgPc6hkq58ip iyQF9AMG24NytMKHrEIcSQSk7ZC/mG3D67i6Ts6o3Jq7lvghf29BpBjo54rMTMZR 1AlLl62JTxIUwGvXb/mMqIDiC58K4n/xhVOX5KjEu/0U1l9t8KNnr6S5byvlyVI9 Rhp2HMlHoJXC9p1c4xnT =OOBK -----END PGP SIGNATURE----- --OgqxwSJOaUobr8KG--