From mboxrd@z Thu Jan 1 00:00:00 1970 From: Alex Williamson Subject: Re: KVM pci-assign - iommu width is not sufficient for mapped address Date: Fri, 08 Jan 2016 11:52:57 -0700 Message-ID: <1452279177.29599.226.camel@redhat.com> References: <1452175812.29599.132.camel@redhat.com> <1452228786.29599.188.camel@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: kvm@vger.kernel.org To: Shyam Return-path: Received: from mx1.redhat.com ([209.132.183.28]:56768 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755456AbcAHSw6 (ORCPT ); Fri, 8 Jan 2016 13:52:58 -0500 In-Reply-To: Sender: kvm-owner@vger.kernel.org List-ID: On Fri, 2016-01-08 at 12:22 +0530, Shyam wrote: > Hi Alex, >=20 > It will be hard to reproduce this on Fedora/RHEL. We have Ubuntu > based > server/VM & I can shift to any kernel/qemu/vfio versions that you > recommend. >=20 > Both our Host & Guest run Ubuntu Trusty (Ubuntu 14.04.3 LTS) with > Linux Kernel version 3.18.19 (from > http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.18.19-vivid/). >=20 > Qemu version on the host is > QEMU emulator version 2.0.0 (Debian 2.0.0+dfsg-2ubuntu1.21), > Copyright > (c) 2003-2008 Fabrice Bellard >=20 > We are using 8 X Intel RMS3CC080 SSD's for this test. We expose these > SSD's to the VM (through iSER) & then setup dm-stripe over them > within > the VM. We create two dm-linear out of this at 100GB size & expose > through SCST to an external server. External server iSER connects to > these devices & have multipath 4Xpaths (policy: queue-length:0) per > device. From external server we run fio with 4 threads & each with > 64-outstanding IOs of 100% 4K random-reads. >=20 > This is the performance difference we see >=20 > with PCI-assign to the VM > randrw 100:0 64iodepth 4thr 4kb - R: 550,224K wait_us:2,245 cpu > tot:85.57 usr:3.96 sys:31.55 iow:50.06 >=20 > i.e. we get 137-140K IOPs or 550MB/s >=20 > with VFIO to the VM > randrw 100:0 64iodepth 4thr 4kb - R: 309,432K wait_us:3,964 cpu > tot:78.58 usr:2.28 sys:18.00 iow:58.30 >=20 > i.e. we get 77-80K IOPs or 310MB/s >=20 > The only change between the two runs is to have a VM that is spawned > with VFIO instead of pci-assign. There is no other difference in > software versions or any settings. >=20 > $ grep VFIO /boot/config-`uname -r` > CONFIG_VFIO_IOMMU_TYPE1=3Dm > CONFIG_VFIO=3Dm > CONFIG_VFIO_PCI=3Dm > CONFIG_VFIO_PCI_VGA=3Dy > CONFIG_KVM_VFIO=3Dy >=20 > I uploaded QEMU command-line & lspci outputs at > https://www.dropbox.com/s/imbqn0274i6hhnz/vfio_issue.tgz?dl=3D0 >=20 > Pls let me know if you have any issues in downloading it. >=20 > Please let us know if you see any KVM acceleration is disabled & > suggested next steps to debug with VFIO tracing. Thanks for your > help! Thanks for the logs, everything appears to be setup correctly. =C2=A0On= e suspicion I have is the difference between pci-assign and vfio-pci in the way the MSI-X Pending Bits Array (PBA) is handled. =C2=A0Legacy KVM device assignment handles MSI-X itself and ignores the PBA. =C2=A0On th= is hardware the MSI-X vector table and PBA are nicely aligned on separate 4k pages, which means that pci-assign will give the VM direct access to everything on the PBA page. =C2=A0On the other hand, vfio-pci registers= MSI- X with QEMU, which does handle the PBA. =C2=A0The vast majority of driv= ers never use the PBA and the PCI spec includes an implementation note suggesting that hardware vendors include additional alignment to prevent MSI-X structures from overlapping with other registers. =C2=A0M= y hypothesis is that this device perhaps does not abide by that recommendation and may be regularly accessing the PBA page, thus causing a vfio-pci assigned device to trap through to QEMU more regularly than a legacy assigned device. If I could ask you to build and run a new QEMU, I think we can easily test this hypothesis by making vfio-pci behave more like pci-assign. =C2=A0The following patch is based on QEMU 2.5 and simply skips the ste= p of placing the PBA memory region overlapping the device, allowing direct access in this case. =C2=A0The patch is easily adaptable to older versi= ons of QEMU, but if we need to do any further tracing, it's probably best to do so on 2.5 anyway. =C2=A0This is only a proof of concept, if it pr= oves to be the culprit we'll need to think about how to handle it more cleanly. =C2=A0Here's the patch: diff --git a/hw/pci/msix.c b/hw/pci/msix.c index 64c93d8..a5ad18c 100644 --- a/hw/pci/msix.c +++ b/hw/pci/msix.c @@ -291,7 +291,7 @@ int msix_init(struct PCIDevice *dev, unsigned short= nentries, =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0memory_region_add_subregion(table_bar, ta= ble_offset, &dev->msix_table_mmio); =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0memory_region_init_io(&dev->msix_pba_mmio= , OBJECT(dev), &msix_pba_mmio_ops, dev, =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0"msix-pba", pba_size); -=C2=A0=C2=A0=C2=A0=C2=A0memory_region_add_subregion(pba_bar, pba_offse= t, &dev->msix_pba_mmio); +=C2=A0=C2=A0=C2=A0=C2=A0/* memory_region_add_subregion(pba_bar, pba_of= fset, &dev->msix_pba_mmio); */ =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0return 0; =C2=A0} @@ -369,7 +369,7 @@ void msix_uninit(PCIDevice *dev, MemoryRegion *tabl= e_bar, MemoryRegion *pba_bar) =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0dev->msix_cap =3D 0; =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0msix_free_irq_entries(dev); =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0dev->msix_entries_nr =3D 0; -=C2=A0=C2=A0=C2=A0=C2=A0memory_region_del_subregion(pba_bar, &dev->msi= x_pba_mmio); +=C2=A0=C2=A0=C2=A0=C2=A0/* memory_region_del_subregion(pba_bar, &dev->= msix_pba_mmio); */ =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0g_free(dev->msix_pba); =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0dev->msix_pba =3D NULL; =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0memory_region_del_subregion(table_bar, &d= ev->msix_table_mmio); Thanks, Alex