From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:43308) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Y0XZL-0001kP-P1 for qemu-devel@nongnu.org; Mon, 15 Dec 2014 10:28:25 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Y0XZC-0000iN-VV for qemu-devel@nongnu.org; Mon, 15 Dec 2014 10:28:19 -0500 Received: from mx-v6.kamp.de ([2a02:248:0:51::16]:58050 helo=mx01.kamp.de) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Y0XZC-0000hq-9p for qemu-devel@nongnu.org; Mon, 15 Dec 2014 10:28:10 -0500 Message-ID: <548EFE05.10306@kamp.de> Date: Mon, 15 Dec 2014 16:28:05 +0100 From: Peter Lieven MIME-Version: 1.0 References: <1418655903-7613-1-git-send-email-pl@kamp.de> <1418656764.1095.257.camel@bling.home> In-Reply-To: <1418656764.1095.257.camel@bling.home> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH] vfio-pci: add a switch to disable PCI AER List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alex Williamson Cc: qemu-devel@nongnu.org On 15.12.2014 16:19, Alex Williamson wrote: > On Mon, 2014-12-15 at 16:05 +0100, Peter Lieven wrote: >> AER is meant to let a device driver recover from >> errors discovered on the PCIe bus. However, the >> current implementation of vfio-pci does not distingish >> between correctable or uncorrectalbe as well as fatal >> vs. non-fatal errors. Any kind of error can trigger the >> error correction interrupt and cause all vServers to >> switch to RUN_STATE_INTERNAL_ERROR. >> >> I have observed correctable non-fatal errors on a >> PCI root hub which where then propagated to all vServers >> on this root hub causing them to shut down. >> >> I added this switch to be able to ignore AER interrupts >> until a proper interface to propagate the error type from >> kernel to qemu is there. That would be the old behaviour >> of pci-assign or a host kernel not supporting AER. > I don't think it's been proven that the kernel is signaling on all the > error types claimed here. Our QE has certainly done testing on > corrected errors, the question is whether uncorrected, non-fatal errors > are getting through. If they are and if those errors should not be > signaled, I'd rather fix it in the kernel than require users to add > obscure device parameters to their VMs. Thanks, As far as I understand it is the task of the vfio-pci kernel driver to signal if it has recovered or not by the return code of the error handler. Currently it signals the VM unconditionally in that error handler. I added this switch because it fixes my and potentials others' issues. Fixing in the kernel sounds better, but in my case its easier to add a fix to qemu than compiling or obtaining a new kernel. Peter > > Alex > >> Signed-off-by: Peter Lieven >> --- >> hw/misc/vfio.c | 13 ++++++++++++- >> 1 file changed, 12 insertions(+), 1 deletion(-) >> >> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c >> index fd318a1..b479708 100644 >> --- a/hw/misc/vfio.c >> +++ b/hw/misc/vfio.c >> @@ -217,6 +217,8 @@ typedef struct VFIODevice { >> uint32_t features; >> #define VFIO_FEATURE_ENABLE_VGA_BIT 0 >> #define VFIO_FEATURE_ENABLE_VGA (1 << VFIO_FEATURE_ENABLE_VGA_BIT) >> +#define VFIO_FEATURE_ENABLE_PCI_AER_BIT 1 >> +#define VFIO_FEATURE_ENABLE_PCI_AER (1 << VFIO_FEATURE_ENABLE_PCI_AER_BIT) >> int32_t bootindex; >> uint8_t pm_cap; >> bool reset_works; >> @@ -4025,7 +4027,13 @@ static int vfio_get_device(VFIOGroup *group, const char *name, VFIODevice *vdev) >> DPRINTF("VFIO_DEVICE_GET_IRQ_INFO failure: %m\n"); >> ret = 0; >> } else if (irq_info.count == 1) { >> - vdev->pci_aer = true; >> + vdev->pci_aer = !!(vdev->features & VFIO_FEATURE_ENABLE_PCI_AER); >> + if (!vdev->pci_aer) { >> + error_report("vfio: %04x:%02x:%02x.%x " >> + "Ignoring error recovery interrupts for the device", >> + vdev->host.domain, vdev->host.bus, vdev->host.slot, >> + vdev->host.function); >> + } >> } else { >> error_report("vfio: %04x:%02x:%02x.%x " >> "Could not enable error recovery for the device", >> @@ -4381,6 +4389,9 @@ static Property vfio_pci_dev_properties[] = { >> intx.mmap_timeout, 1100), >> DEFINE_PROP_BIT("x-vga", VFIODevice, features, >> VFIO_FEATURE_ENABLE_VGA_BIT, false), >> + DEFINE_PROP_BIT("pci-aer", VFIODevice, features, >> + VFIO_FEATURE_ENABLE_PCI_AER, true), >> + >> /* >> * TODO - support passed fds... is this necessary? >> * DEFINE_PROP_STRING("vfiofd", VFIODevice, vfiofd_name), > >