From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:36399) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YXM4t-0004RC-RC for qemu-devel@nongnu.org; Sun, 15 Mar 2015 23:52:32 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YXM4q-0002rJ-KV for qemu-devel@nongnu.org; Sun, 15 Mar 2015 23:52:31 -0400 Received: from mx1.redhat.com ([209.132.183.28]:39942) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YXM4q-0002r2-Ct for qemu-devel@nongnu.org; Sun, 15 Mar 2015 23:52:28 -0400 Message-ID: <1426477927.3643.160.camel@redhat.com> From: Alex Williamson Date: Sun, 15 Mar 2015 21:52:07 -0600 In-Reply-To: <55064870.6040209@cn.fujitsu.com> References: <3c81eaae84d6b1fa6e229e765a534fdf180e1ce4.1426155432.git.chen.fan.fnst@cn.fujitsu.com> <1426286084.3643.144.camel@redhat.com> <55064870.6040209@cn.fujitsu.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH v5 5/7] vfio-pci: pass the aer error to guest List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Chen Fan Cc: izumi.taku@jp.fujitsu.com, qemu-devel@nongnu.org On Mon, 2015-03-16 at 11:05 +0800, Chen Fan wrote: > On 03/14/2015 06:34 AM, Alex Williamson wrote: > > On Thu, 2015-03-12 at 18:23 +0800, Chen Fan wrote: > >> when the vfio device encounters an uncorrectable error in host, > >> the vfio_pci driver will signal the eventfd registered by this > >> vfio device, the results in the qemu eventfd handler getting > >> invoked. > >> > >> this patch is to pass the error to guest and have the guest driver > >> recover from the error. > > What is going to be the typical recovery mechanism for the guest? I'm > > concerned that the topology of the device in the guest doesn't > > necessarily match the topology of the device in the host, so if the > > guest were to attempt a bus reset to recover a device, for instance, > > what happens? > the recovery mechanism is that when guest got an aer error from a device, > guest will clean the corresponding status bit in device register. and for > need reset device, the guest aer driver would reset all devices under bus. Sorry, I'm still confused, how does the guest aer driver reset all devices under a bus? Are we talking about function-level, device specific reset mechanisms or secondary bus resets? If the guest is performing secondary bus resets, what guarantee do they have that it will translate to a physical secondary bus reset? vfio may only do an FLR when the bus is reset or it may not be able to do anything depending on the available function-level resets and physical and virtual topology of the device. Thanks, Alex > >> Signed-off-by: Chen Fan > >> --- > >> hw/vfio/pci.c | 34 ++++++++++++++++++++++++++++------ > >> 1 file changed, 28 insertions(+), 6 deletions(-) > >> > >> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c > >> index 0a515b6..8966c49 100644 > >> --- a/hw/vfio/pci.c > >> +++ b/hw/vfio/pci.c > >> @@ -3240,18 +3240,40 @@ static void vfio_put_device(VFIOPCIDevice *vdev) > >> static void vfio_err_notifier_handler(void *opaque) > >> { > >> VFIOPCIDevice *vdev = opaque; > >> + PCIDevice *dev = &vdev->pdev; > >> + PCIEAERMsg msg = { > >> + .severity = 0, > >> + .source_id = (pci_bus_num(dev->bus) << 8) | dev->devfn, > >> + }; > >> > >> if (!event_notifier_test_and_clear(&vdev->err_notifier)) { > >> return; > >> } > >> > >> + /* we should read the error details from the real hardware > >> + * configuration spaces, here we only need to do is signaling > >> + * to guest an uncorrectable error has occurred. > >> + */ > > Inconsistent comment style > > > >> + if(dev->exp.aer_cap) { > > ^ space > > > >> + uint8_t *aer_cap = dev->config + dev->exp.aer_cap; > >> + uint32_t uncor_status; > >> + bool isfatal; > >> + > >> + uncor_status = vfio_pci_read_config(dev, > >> + dev->exp.aer_cap + PCI_ERR_UNCOR_STATUS, 4); > >> + > >> + isfatal = uncor_status & pci_get_long(aer_cap + PCI_ERR_UNCOR_SEVER); > >> + > >> + msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN : > >> + PCI_ERR_ROOT_CMD_NONFATAL_EN; > >> + > >> + pcie_aer_msg(dev, &msg); > >> + return; > >> + } > >> + > >> /* > >> - * TBD. Retrieve the error details and decide what action > >> - * needs to be taken. One of the actions could be to pass > >> - * the error to the guest and have the guest driver recover > >> - * from the error. This requires that PCIe capabilities be > >> - * exposed to the guest. For now, we just terminate the > >> - * guest to contain the error. > >> + * If the aer capability is not exposed to the guest. we just > >> + * terminate the guest to contain the error. > >> */ > >> > >> error_report("%s(%04x:%02x:%02x.%x) Unrecoverable error detected. " > > > > > > . > > >