From mboxrd@z Thu Jan  1 00:00:00 1970
From: Alex Williamson <alex.williamson@redhat.com>
Subject: Re: KVM pci-assign - iommu width is not sufficient for mapped
 address
Date: Fri, 08 Jan 2016 11:52:57 -0700
Message-ID: <1452279177.29599.226.camel@redhat.com>
References: <CABYsEFdjH+mY7+a2VvpF4S=jF_XPac6iDYFWr7WDNHYKZB_mpw@mail.gmail.com>
	 <1452175812.29599.132.camel@redhat.com>
	 <CABYsEFcHodzaCPGy4K=bBtpNs+keS0NfjM1HBCz1-91vtKduig@mail.gmail.com>
	 <1452228786.29599.188.camel@redhat.com>
	 <CABYsEFfByf=F1gCF49ESo1pZ1kJKKidZe+Sx+PmB4UciBa=Mfw@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: kvm@vger.kernel.org
To: Shyam <shyam.kaushik@gmail.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:56768 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1755456AbcAHSw6 (ORCPT <rfc822;kvm@vger.kernel.org>);
	Fri, 8 Jan 2016 13:52:58 -0500
In-Reply-To: <CABYsEFfByf=F1gCF49ESo1pZ1kJKKidZe+Sx+PmB4UciBa=Mfw@mail.gmail.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Fri, 2016-01-08 at 12:22 +0530, Shyam wrote:
> Hi Alex,
>=20
> It will be hard to reproduce this on Fedora/RHEL. We have Ubuntu
> based
> server/VM & I can shift to any kernel/qemu/vfio versions that you
> recommend.
>=20
> Both our Host & Guest run Ubuntu Trusty (Ubuntu 14.04.3 LTS) with
> Linux Kernel version 3.18.19 (from
> http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.18.19-vivid/).
>=20
> Qemu version on the host is
> QEMU emulator version 2.0.0 (Debian 2.0.0+dfsg-2ubuntu1.21),
> Copyright
> (c) 2003-2008 Fabrice Bellard
>=20
> We are using 8 X Intel RMS3CC080 SSD's for this test. We expose these
> SSD's to the VM (through iSER) & then setup dm-stripe over them
> within
> the VM. We create two dm-linear out of this at 100GB size & expose
> through SCST to an external server. External server iSER connects to
> these devices & have multipath 4Xpaths (policy: queue-length:0) per
> device. From external server we run fio with 4 threads & each with
> 64-outstanding IOs of 100% 4K random-reads.
>=20
> This is the performance difference we see
>=20
> with PCI-assign to the VM
> randrw 100:0 64iodepth 4thr 4kb - R: 550,224K wait_us:2,245 cpu
> tot:85.57 usr:3.96 sys:31.55 iow:50.06
>=20
> i.e. we get 137-140K IOPs or 550MB/s
>=20
> with VFIO to the VM
> randrw 100:0 64iodepth 4thr 4kb - R: 309,432K wait_us:3,964 cpu
> tot:78.58 usr:2.28 sys:18.00 iow:58.30
>=20
> i.e. we get 77-80K IOPs or 310MB/s
>=20
> The only change between the two runs is to have a VM that is spawned
> with VFIO instead of pci-assign. There is no other difference in
> software versions or any settings.
>=20
> $ grep VFIO /boot/config-`uname -r`
> CONFIG_VFIO_IOMMU_TYPE1=3Dm
> CONFIG_VFIO=3Dm
> CONFIG_VFIO_PCI=3Dm
> CONFIG_VFIO_PCI_VGA=3Dy
> CONFIG_KVM_VFIO=3Dy
>=20
> I uploaded QEMU command-line & lspci outputs at
> https://www.dropbox.com/s/imbqn0274i6hhnz/vfio_issue.tgz?dl=3D0
>=20
> Pls let me know if you have any issues in downloading it.
>=20
> Please let us know if you see any KVM acceleration is disabled &
> suggested next steps to debug with VFIO tracing. Thanks for your
> help!

Thanks for the logs, everything appears to be setup correctly. =C2=A0On=
e
suspicion I have is the difference between pci-assign and vfio-pci in
the way the MSI-X Pending Bits Array (PBA) is handled. =C2=A0Legacy KVM
device assignment handles MSI-X itself and ignores the PBA. =C2=A0On th=
is
hardware the MSI-X vector table and PBA are nicely aligned on separate
4k pages, which means that pci-assign will give the VM direct access to
everything on the PBA page. =C2=A0On the other hand, vfio-pci registers=
 MSI-
X with QEMU, which does handle the PBA. =C2=A0The vast majority of driv=
ers
never use the PBA and the PCI spec includes an implementation note
suggesting that hardware vendors include additional alignment to
prevent MSI-X structures from overlapping with other registers. =C2=A0M=
y
hypothesis is that this device perhaps does not abide by that
recommendation and may be regularly accessing the PBA page, thus
causing a vfio-pci assigned device to trap through to QEMU more
regularly than a legacy assigned device.

If I could ask you to build and run a new QEMU, I think we can easily
test this hypothesis by making vfio-pci behave more like pci-assign.
=C2=A0The following patch is based on QEMU 2.5 and simply skips the ste=
p of
placing the PBA memory region overlapping the device, allowing direct
access in this case. =C2=A0The patch is easily adaptable to older versi=
ons
of QEMU, but if we need to do any further tracing, it's probably best
to do so on 2.5 anyway. =C2=A0This is only a proof of concept, if it pr=
oves
to be the culprit we'll need to think about how to handle it more
cleanly. =C2=A0Here's the patch:

diff --git a/hw/pci/msix.c b/hw/pci/msix.c
index 64c93d8..a5ad18c 100644
--- a/hw/pci/msix.c
+++ b/hw/pci/msix.c
@@ -291,7 +291,7 @@ int msix_init(struct PCIDevice *dev, unsigned short=
 nentries,
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0memory_region_add_subregion(table_bar, ta=
ble_offset, &dev->msix_table_mmio);
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0memory_region_init_io(&dev->msix_pba_mmio=
, OBJECT(dev), &msix_pba_mmio_ops, dev,
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0"msix-pba", pba_size);
-=C2=A0=C2=A0=C2=A0=C2=A0memory_region_add_subregion(pba_bar, pba_offse=
t, &dev->msix_pba_mmio);
+=C2=A0=C2=A0=C2=A0=C2=A0/* memory_region_add_subregion(pba_bar, pba_of=
fset, &dev->msix_pba_mmio); */
=C2=A0
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0return 0;
=C2=A0}
@@ -369,7 +369,7 @@ void msix_uninit(PCIDevice *dev, MemoryRegion *tabl=
e_bar, MemoryRegion *pba_bar)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0dev->msix_cap =3D 0;
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0msix_free_irq_entries(dev);
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0dev->msix_entries_nr =3D 0;
-=C2=A0=C2=A0=C2=A0=C2=A0memory_region_del_subregion(pba_bar, &dev->msi=
x_pba_mmio);
+=C2=A0=C2=A0=C2=A0=C2=A0/* memory_region_del_subregion(pba_bar, &dev->=
msix_pba_mmio); */
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0g_free(dev->msix_pba);
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0dev->msix_pba =3D NULL;
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0memory_region_del_subregion(table_bar, &d=
ev->msix_table_mmio);

Thanks,
Alex