From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:41314) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Yfs1q-0008Sm-Gp for qemu-devel@nongnu.org; Wed, 08 Apr 2015 11:36:36 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Yfs1n-00063n-Ot for qemu-devel@nongnu.org; Wed, 08 Apr 2015 11:36:34 -0400 Received: from mx1.redhat.com ([209.132.183.28]:59807) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Yfs1n-00063Z-Hw for qemu-devel@nongnu.org; Wed, 08 Apr 2015 11:36:31 -0400 Message-ID: <1428507381.5567.501.camel@redhat.com> From: Alex Williamson Date: Wed, 08 Apr 2015 09:36:21 -0600 In-Reply-To: <5524EE06.8010107@cn.fujitsu.com> References: <3c81eaae84d6b1fa6e229e765a534fdf180e1ce4.1426155432.git.chen.fan.fnst@cn.fujitsu.com> <1426286084.3643.144.camel@redhat.com> <55064870.6040209@cn.fujitsu.com> <1426477927.3643.160.camel@redhat.com> <550687B1.7020504@cn.fujitsu.com> <1426514950.3643.169.camel@redhat.com> <55121525.2040408@cn.fujitsu.com> <1427251289.3643.829.camel@redhat.com> <551B7032.9010007@cn.fujitsu.com> <1427903200.5567.227.camel@redhat.com> <5524EE06.8010107@cn.fujitsu.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH v5 5/7] vfio-pci: pass the aer error to guest List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Chen Fan Cc: izumi.taku@jp.fujitsu.com, qemu-devel@nongnu.org On Wed, 2015-04-08 at 16:59 +0800, Chen Fan wrote: > On 04/01/2015 11:46 PM, Alex Williamson wrote: > > On Wed, 2015-04-01 at 12:12 +0800, Chen Fan wrote: > >> On 03/25/2015 10:41 AM, Alex Williamson wrote: > >>> On Wed, 2015-03-25 at 09:53 +0800, Chen Fan wrote: > >>>> On 03/16/2015 10:09 PM, Alex Williamson wrote: > >>>>> On Mon, 2015-03-16 at 15:35 +0800, Chen Fan wrote: > >>>>>> On 03/16/2015 11:52 AM, Alex Williamson wrote: > >>>>>>> On Mon, 2015-03-16 at 11:05 +0800, Chen Fan wrote: > >>>>>>>> On 03/14/2015 06:34 AM, Alex Williamson wrote: > >>>>>>>>> On Thu, 2015-03-12 at 18:23 +0800, Chen Fan wrote: > >>>>>>>>>> when the vfio device encounters an uncorrectable error in host, > >>>>>>>>>> the vfio_pci driver will signal the eventfd registered by this > >>>>>>>>>> vfio device, the results in the qemu eventfd handler getting > >>>>>>>>>> invoked. > >>>>>>>>>> > >>>>>>>>>> this patch is to pass the error to guest and have the guest driver > >>>>>>>>>> recover from the error. > >>>>>>>>> What is going to be the typical recovery mechanism for the guest? I'm > >>>>>>>>> concerned that the topology of the device in the guest doesn't > >>>>>>>>> necessarily match the topology of the device in the host, so if the > >>>>>>>>> guest were to attempt a bus reset to recover a device, for instance, > >>>>>>>>> what happens? > >>>>>>>> the recovery mechanism is that when guest got an aer error from a device, > >>>>>>>> guest will clean the corresponding status bit in device register. and for > >>>>>>>> need reset device, the guest aer driver would reset all devices under bus. > >>>>>>> Sorry, I'm still confused, how does the guest aer driver reset all > >>>>>>> devices under a bus? Are we talking about function-level, device > >>>>>>> specific reset mechanisms or secondary bus resets? If the guest is > >>>>>>> performing secondary bus resets, what guarantee do they have that it > >>>>>>> will translate to a physical secondary bus reset? vfio may only do an > >>>>>>> FLR when the bus is reset or it may not be able to do anything depending > >>>>>>> on the available function-level resets and physical and virtual topology > >>>>>>> of the device. Thanks, > >>>>>> in general, functions depends on the corresponding device driver behaviors > >>>>>> to do the recovery. e.g: implemented the error_detect, slot_reset callbacks. > >>>>>> and for link reset, it usually do secondary bus reset. > >>>>>> > >>>>>> and do we must require to the physical secondary bus reset for vfio device > >>>>>> as bus reset? > >>>>> That depends on how the guest driver attempts recovery, doesn't it? > >>>>> There are only a very limited number of cases where a secondary bus > >>>>> reset initiated by the guest will translate to a secondary bus reset of > >>>>> the physical device (iirc, single function device without FLR). In most > >>>>> cases, it will at best be translated to an FLR. VFIO really only does > >>>>> bus resets on VM reset because that's the only time we know that it's ok > >>>>> to reset multiple devices. If the guest driver is depending on a > >>>>> secondary bus reset to put the device into a recoverable state and we're > >>>>> not able to provide that, then we're actually reducing containment of > >>>>> the error by exposing AER to the guest and allowing it to attempt > >>>>> recovery. So in practice, I'm afraid we're risking the integrity of the > >>>>> VM by exposing AER to the guest and making it think that it can perform > >>>>> recovery operations that are not effective. Thanks, > >>>> I also have seen that if device without FLR, it seems can do hot reset > >>>> by ioctl VFIO_DEVICE_PCI_HOT_RESET to reset the physical slot or bus > >>>> in vfio_pci_reset. does it satisfy the recovery issues that you said? > >>> The hot reset interface can only be used when a) the user (QEMU) owns > >>> all of the devices on the bus and b) we know we're resetting all of the > >>> devices. That mostly limits its use to VM reset. I think that on a > >>> secondary bus reset, we don't know the scope of the reset at the QEMU > >>> vfio driver, so we only make use of reset methods with a function-level > >>> scope. That would only result in a secondary bus reset if that's the > >>> reset mechanism used by the host kernel's PCI code (pci_reset_function), > >>> which is limited to single function devices on a secondary bus, with no > >>> other reset mechanisms. The host reset is also only available in some > >>> configurations, for instance if we have a dual-port NIC where each > >>> function is a separate IOMMU group, then we clearly cannot do a hot > >>> reset unless both functions are assigned to the same VM _and_ appear to > >>> the guest on the same virtual bus. So even if we could know the scope > >>> of the reset in the QEMU vfio driver, we can only make use of it under > >>> very strict guest configurations. Thanks, > >> Hi Alex, > >> > >> have you some idea or scenario to fix/escape this issue? > > Hi Chen, > > > > I expect there are two major components to this. The first is that > > QEMU/vfio-pci needs to enforce that a bus reset is possible for the host > > and guest topology when guest AER handling is specified for a device. > > That means that everything affected by the bus reset needs to be exposed > > to the guest in a compatible way. For instance, if a bus reset affects > > devices from multiple groups, the guest needs to not only own all of > > those groups, but they also need to be exposed to the guest such that > > the virtual bus layout reflects the extent of the reset for the physical > > bus. This also implies that guest AER handling cannot be the default > > since it will impose significant configuration restrictions on device > > assignment. > > > > This seems like a difficult configuration enforcement to make, but maybe > > there are simplifying assumptions that can help. For instance the > > devices need to be exposed as PCIe therefore we won't have multiple > > slots in use on a bus and I think we can therefore mostly ignore hotplug > > since we can only hotplug at a slot granularity. That may also imply > > that we should simply enforce a 1:1 mapping of physical functions to > > virtual functions. At least one function from each group affected by a > > reset must be exposed to the guest. > > > > The second issue is that individual QEMU PCI devices have no callback > > for a bus reset. QEMU/vfio-pci currently has the DeviceClass.reset > > callback, which we assume to be a function-level reset. We also > > register with qemu_register_reset() for a VM reset, which is the only > > point currently that we know we can do a reset affecting multiple > > devices. Infrastructure will need to be added to QEMU/PCI to expose the > > link down/RST signal to devices on a bus to trigger a multi-device reset > > in vfio-pci. > > > > Hopefully I'm not missing something, but I think both of those changes > > are going to be required before we can have anything remotely > > supportable for guest-based AER error handle. This pretty complicated > > for the user and also for libvirt to figure out. At a minimum libvirt > > would need to support a new guest-based AER handling flag for devices. > > We probably need to determine whether this is unique to vfio-pci or a > > generic PCIDevice option. Thanks, > > Hi Alex, > Solving the two issues seem like a big workload. do we have a simple > way to support qemu AER ? Hi Chen, The simpler way is the existing, containment-only solution where QEMU stops the guest on an uncorrected error. Do you have any other suggestions? I don't see how we can rely on guest involvement in recovery unless the guest has the same abilities to reset the device as it would on bare metal. Thanks, Alex