From mboxrd@z Thu Jan 1 00:00:00 1970 From: Alex Williamson Subject: Re: KVM pci-assign - iommu width is not sufficient for mapped address Date: Mon, 11 Jan 2016 16:20:23 -0700 Message-ID: <1452554423.29599.333.camel@redhat.com> References: <1452175812.29599.132.camel@redhat.com> <1452228786.29599.188.camel@redhat.com> <1452279177.29599.226.camel@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: kvm@vger.kernel.org To: Shyam Return-path: Received: from mx1.redhat.com ([209.132.183.28]:55919 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1760838AbcAKXUY (ORCPT ); Mon, 11 Jan 2016 18:20:24 -0500 In-Reply-To: Sender: kvm-owner@vger.kernel.org List-ID: On Mon, 2016-01-11 at 15:41 +0530, Shyam wrote: > Hi Alex, >=20 > You are spot on! >=20 > Applying your patch on QEMU 2.5.50 (latest from github master) solves > the performance issue fully. We are able to get back to pci-assign > performance numbers. Great! >=20 > Can you please see how to formalize this patch cleanly? I will be > happy to test additional patches for you. Thanks a lot for your help! Hi Shyam, Thanks for the testing. =C2=A0I'm really tempted to just disable PBA emulation altogether, but I came up with the below patch which enables it only in the off chance that it's needed. =C2=A0Patch is against curr= ent qemu.git, please test. =C2=A0Thanks! Alex commit 4f97c12c9f801fabdd3405758408f516e8ea1a80 Author: Alex Williamson Date:=C2=A0=C2=A0=C2=A0Mon Jan 11 10:44:13 2016 -0700 =C2=A0=C2=A0=C2=A0=C2=A0vfio/pci: Lazy PBA emulation =C2=A0=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0The PCI spec recommends devices use additional = alignment for MSI-X =C2=A0=C2=A0=C2=A0=C2=A0data structures to allow software to map them t= o separate processor =C2=A0=C2=A0=C2=A0=C2=A0pages.=C2=A0=C2=A0One advantage of doing this i= s that we can emulate those data =C2=A0=C2=A0=C2=A0=C2=A0structures without a significant performance im= pact to the operation =C2=A0=C2=A0=C2=A0=C2=A0of the device.=C2=A0=C2=A0Some devices fail to = implement that suggestion and =C2=A0=C2=A0=C2=A0=C2=A0assigned device performance suffers. =C2=A0=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0One such case of this is a Mellanox MT27500 ser= ies, ConnectX-3 VF, =C2=A0=C2=A0=C2=A0=C2=A0where the MSI-X vector table and PBA are aligne= d on separate 4K =C2=A0=C2=A0=C2=A0=C2=A0pages.=C2=A0=C2=A0If PBA emulation is enabled, = performance suffers.=C2=A0=C2=A0It's not =C2=A0=C2=A0=C2=A0=C2=A0clear how much value we get from PBA emulation,= but the solution here =C2=A0=C2=A0=C2=A0=C2=A0is to only lazily enable the emulated PBA when = a masked MSI-X vector =C2=A0=C2=A0=C2=A0=C2=A0fires.=C2=A0=C2=A0We then attempt to more aggre= sively disable the PBA memory =C2=A0=C2=A0=C2=A0=C2=A0region any time a vector is unmasked.=C2=A0=C2=A0= The expectation is then that =C2=A0=C2=A0=C2=A0=C2=A0a typical VM will run entirely with PBA emulati= on disabled, and only =C2=A0=C2=A0=C2=A0=C2=A0when used is that emulation re-enabled. =C2=A0=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0Reported-by: Shyam Kaushik =C2=A0=C2=A0=C2=A0=C2=A0Signed-off-by: Alex Williamson diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 1fb868c..e66c47f 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -356,6 +356,13 @@ static void vfio_msi_interrupt(void *opaque) =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0if (vdev->interrupt =3D=3D VFIO_INT_MSIX)= { =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0get_msg =3D msix_= get_message; =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0notify =3D msix_n= otify; + +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0/* A masked vector fir= ing needs to use the PBA, enable it */ +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0if (msix_is_masked(&vd= ev->pdev, nr)) { +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= set_bit(nr, vdev->msix->pending); +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= memory_region_set_enabled(&vdev->pdev.msix_pba_mmio, true); +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= trace_vfio_msix_pba_enable(vdev->vbasedev.name); +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0} =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0} else if (vdev->interrupt =3D=3D VFIO_IN= T_MSI) { =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0get_msg =3D msi_g= et_message; =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0notify =3D msi_no= tify; @@ -535,6 +542,14 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev= , unsigned int nr, =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0} =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0} =C2=A0 +=C2=A0=C2=A0=C2=A0=C2=A0/* Disable PBA emulation when nothing more is = pending. */ +=C2=A0=C2=A0=C2=A0=C2=A0clear_bit(nr, vdev->msix->pending); +=C2=A0=C2=A0=C2=A0=C2=A0if (find_first_bit(vdev->msix->pending, +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0vdev-= >nr_vectors) =3D=3D vdev->nr_vectors) { +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0memory_region_set_enab= led(&vdev->pdev.msix_pba_mmio, false); +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0trace_vfio_msix_pba_di= sable(vdev->vbasedev.name); +=C2=A0=C2=A0=C2=A0=C2=A0} + =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0return 0; =C2=A0} =C2=A0 @@ -738,6 +753,9 @@ static void vfio_msix_disable(VFIOPCIDevice *vdev) =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0vfio_msi_disable_common(vdev); =C2=A0 +=C2=A0=C2=A0=C2=A0=C2=A0memset(vdev->msix->pending, 0, +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0BITS= _TO_LONGS(vdev->msix->entries) * sizeof(unsigned long)); + =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0trace_vfio_msix_disable(vdev->vbasedev.na= me); =C2=A0} =C2=A0 @@ -1251,6 +1269,8 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, i= nt pos) =C2=A0{ =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0int ret; =C2=A0 +=C2=A0=C2=A0=C2=A0=C2=A0vdev->msix->pending =3D g_malloc0(BITS_TO_LONG= S(vdev->msix->entries) * +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= sizeof(unsigned long)); =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0ret =3D msix_init(&vdev->pdev, vdev->msix= ->entries, =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0&vdev->bars[vdev-= >msix->table_bar].region.mem, =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0vdev->msix->table= _bar, vdev->msix->table_offset, @@ -1264,6 +1284,24 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, = int pos) =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0return ret; =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0} =C2=A0 +=C2=A0=C2=A0=C2=A0=C2=A0/* +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* The PCI spec suggests that devices pro= vide additional alignment for +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* MSI-X structures and avoid overlapping= non-MSI-X related registers. +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* For an assigned device, this hopefully= means that emulation of MSI-X +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* structures does not affect the perform= ance of the device.=C2=A0=C2=A0If devices +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* fail to provide that alignment, a sign= ificant performance penalty may +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* result, for instance Mellanox MT27500 = VFs: +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* http://www.spinics.net/lists/kvm/msg12= 5881.html +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* The PBA is simply not that important f= or such a serious regression and +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* most drivers do not appear to look at = it.=C2=A0=C2=A0The solution for this is to +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* disable the PBA MemoryRegion unless it= 's being used.=C2=A0=C2=A0We disable it +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* here and only enable it if a masked ve= ctor fires through QEMU.=C2=A0=C2=A0As the +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* vector-use notifier is called, which o= ccurs on unmask, we test whether +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* PBA emulation is needed and again disa= ble if not. +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0*/ +=C2=A0=C2=A0=C2=A0=C2=A0memory_region_set_enabled(&vdev->pdev.msix_pba= _mmio, false); + =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0return 0; =C2=A0} =C2=A0 @@ -1275,6 +1313,7 @@ static void vfio_teardown_msi(VFIOPCIDevice *vdev= ) =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0msix_uninit(&vdev= ->pdev, =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0&vdev->bars[vdev-= >msix->table_bar].region.mem, =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0&vdev->bars[vdev-= >msix->pba_bar].region.mem); +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0g_free(vdev->msix->pen= ding); =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0} =C2=A0} =C2=A0 diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h index f004d52..6256587 100644 --- a/hw/vfio/pci.h +++ b/hw/vfio/pci.h @@ -95,6 +95,7 @@ typedef struct VFIOMSIXInfo { =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0uint32_t pba_offset; =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0MemoryRegion mmap_mem; =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0void *mmap; +=C2=A0=C2=A0=C2=A0=C2=A0unsigned long *pending; =C2=A0} VFIOMSIXInfo; =C2=A0 =C2=A0typedef struct VFIOPCIDevice { diff --git a/trace-events b/trace-events index 934a7b6..c9ac144 100644 --- a/trace-events +++ b/trace-events @@ -1631,6 +1631,8 @@ vfio_msi_interrupt(const char *name, int index, u= int64_t addr, int data) " (%s) =C2=A0vfio_msix_vector_do_use(const char *name, int index) " (%s) vecto= r %d used" =C2=A0vfio_msix_vector_release(const char *name, int index) " (%s) vect= or %d released" =C2=A0vfio_msix_enable(const char *name) " (%s)" +vfio_msix_pba_disable(const char *name) " (%s)" +vfio_msix_pba_enable(const char *name) " (%s)" =C2=A0vfio_msix_disable(const char *name) " (%s)" =C2=A0vfio_msi_enable(const char *name, int nr_vectors) " (%s) Enabled = %d MSI vectors" =C2=A0vfio_msi_disable(const char *name) " (%s)"