From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:41314)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <alex.williamson@redhat.com>) id 1Yfs1q-0008Sm-Gp
	for qemu-devel@nongnu.org; Wed, 08 Apr 2015 11:36:36 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <alex.williamson@redhat.com>) id 1Yfs1n-00063n-Ot
	for qemu-devel@nongnu.org; Wed, 08 Apr 2015 11:36:34 -0400
Received: from mx1.redhat.com ([209.132.183.28]:59807)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <alex.williamson@redhat.com>) id 1Yfs1n-00063Z-Hw
	for qemu-devel@nongnu.org; Wed, 08 Apr 2015 11:36:31 -0400
Message-ID: <1428507381.5567.501.camel@redhat.com>
From: Alex Williamson <alex.williamson@redhat.com>
Date: Wed, 08 Apr 2015 09:36:21 -0600
In-Reply-To: <5524EE06.8010107@cn.fujitsu.com>
References: <cover.1426155431.git.chen.fan.fnst@cn.fujitsu.com>
	<3c81eaae84d6b1fa6e229e765a534fdf180e1ce4.1426155432.git.chen.fan.fnst@cn.fujitsu.com>
	<1426286084.3643.144.camel@redhat.com>
	<55064870.6040209@cn.fujitsu.com>			
	<1426477927.3643.160.camel@redhat.com>
	<550687B1.7020504@cn.fujitsu.com>		
	<1426514950.3643.169.camel@redhat.com>
	<55121525.2040408@cn.fujitsu.com>	
	<1427251289.3643.829.camel@redhat.com>
	<551B7032.9010007@cn.fujitsu.com>
	<1427903200.5567.227.camel@redhat.com>
	<5524EE06.8010107@cn.fujitsu.com>
Content-Type: text/plain; charset="UTF-8"
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH v5 5/7] vfio-pci: pass the aer error to
	guest
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
Cc: izumi.taku@jp.fujitsu.com, qemu-devel@nongnu.org

On Wed, 2015-04-08 at 16:59 +0800, Chen Fan wrote:
> On 04/01/2015 11:46 PM, Alex Williamson wrote:
> > On Wed, 2015-04-01 at 12:12 +0800, Chen Fan wrote:
> >> On 03/25/2015 10:41 AM, Alex Williamson wrote:
> >>> On Wed, 2015-03-25 at 09:53 +0800, Chen Fan wrote:
> >>>> On 03/16/2015 10:09 PM, Alex Williamson wrote:
> >>>>> On Mon, 2015-03-16 at 15:35 +0800, Chen Fan wrote:
> >>>>>> On 03/16/2015 11:52 AM, Alex Williamson wrote:
> >>>>>>> On Mon, 2015-03-16 at 11:05 +0800, Chen Fan wrote:
> >>>>>>>> On 03/14/2015 06:34 AM, Alex Williamson wrote:
> >>>>>>>>> On Thu, 2015-03-12 at 18:23 +0800, Chen Fan wrote:
> >>>>>>>>>> when the vfio device encounters an uncorrectable error in host,
> >>>>>>>>>> the vfio_pci driver will signal the eventfd registered by this
> >>>>>>>>>> vfio device, the results in the qemu eventfd handler getting
> >>>>>>>>>> invoked.
> >>>>>>>>>>
> >>>>>>>>>> this patch is to pass the error to guest and have the guest driver
> >>>>>>>>>> recover from the error.
> >>>>>>>>> What is going to be the typical recovery mechanism for the guest?  I'm
> >>>>>>>>> concerned that the topology of the device in the guest doesn't
> >>>>>>>>> necessarily match the topology of the device in the host, so if the
> >>>>>>>>> guest were to attempt a bus reset to recover a device, for instance,
> >>>>>>>>> what happens?
> >>>>>>>> the recovery mechanism is that when guest got an aer error from a device,
> >>>>>>>> guest will clean the corresponding status bit in device register. and for
> >>>>>>>> need reset device, the guest aer driver would reset all devices under bus.
> >>>>>>> Sorry, I'm still confused, how does the guest aer driver reset all
> >>>>>>> devices under a bus?  Are we talking about function-level, device
> >>>>>>> specific reset mechanisms or secondary bus resets?  If the guest is
> >>>>>>> performing secondary bus resets, what guarantee do they have that it
> >>>>>>> will translate to a physical secondary bus reset?  vfio may only do an
> >>>>>>> FLR when the bus is reset or it may not be able to do anything depending
> >>>>>>> on the available function-level resets and physical and virtual topology
> >>>>>>> of the device.  Thanks,
> >>>>>> in general, functions depends on the corresponding device driver behaviors
> >>>>>> to do the recovery. e.g: implemented the error_detect, slot_reset callbacks.
> >>>>>> and for link reset, it usually do secondary bus reset.
> >>>>>>
> >>>>>> and do we must require to the physical secondary bus reset for vfio device
> >>>>>> as bus reset?
> >>>>> That depends on how the guest driver attempts recovery, doesn't it?
> >>>>> There are only a very limited number of cases where a secondary bus
> >>>>> reset initiated by the guest will translate to a secondary bus reset of
> >>>>> the physical device (iirc, single function device without FLR).  In most
> >>>>> cases, it will at best be translated to an FLR.  VFIO really only does
> >>>>> bus resets on VM reset because that's the only time we know that it's ok
> >>>>> to reset multiple devices.  If the guest driver is depending on a
> >>>>> secondary bus reset to put the device into a recoverable state and we're
> >>>>> not able to provide that, then we're actually reducing containment of
> >>>>> the error by exposing AER to the guest and allowing it to attempt
> >>>>> recovery.  So in practice, I'm afraid we're risking the integrity of the
> >>>>> VM by exposing AER to the guest and making it think that it can perform
> >>>>> recovery operations that are not effective.  Thanks,
> >>>> I also have seen that if device without FLR, it seems can do hot reset
> >>>> by ioctl VFIO_DEVICE_PCI_HOT_RESET to reset the physical slot or bus
> >>>> in vfio_pci_reset. does it satisfy the recovery issues that you said?
> >>> The hot reset interface can only be used when a) the user (QEMU) owns
> >>> all of the devices on the bus and b) we know we're resetting all of the
> >>> devices.  That mostly limits its use to VM reset.  I think that on a
> >>> secondary bus reset, we don't know the scope of the reset at the QEMU
> >>> vfio driver, so we only make use of reset methods with a function-level
> >>> scope.  That would only result in a secondary bus reset if that's the
> >>> reset mechanism used by the host kernel's PCI code (pci_reset_function),
> >>> which is limited to single function devices on a secondary bus, with no
> >>> other reset mechanisms.  The host reset is also only available in some
> >>> configurations, for instance if we have a dual-port NIC where each
> >>> function is a separate IOMMU group, then we clearly cannot do a hot
> >>> reset unless both functions are assigned to the same VM _and_ appear to
> >>> the guest on the same virtual bus.  So even if we could know the scope
> >>> of the reset in the QEMU vfio driver, we can only make use of it under
> >>> very strict guest configurations.  Thanks,
> >> Hi Alex,
> >>
> >>      have you some idea or scenario to fix/escape this issue?
> > Hi Chen,
> >
> > I expect there are two major components to this.  The first is that
> > QEMU/vfio-pci needs to enforce that a bus reset is possible for the host
> > and guest topology when guest AER handling is specified for a device.
> > That means that everything affected by the bus reset needs to be exposed
> > to the guest in a compatible way.  For instance, if a bus reset affects
> > devices from multiple groups, the guest needs to not only own all of
> > those groups, but they also need to be exposed to the guest such that
> > the virtual bus layout reflects the extent of the reset for the physical
> > bus.  This also implies that guest AER handling cannot be the default
> > since it will impose significant configuration restrictions on device
> > assignment.
> >
> > This seems like a difficult configuration enforcement to make, but maybe
> > there are simplifying assumptions that can help.  For instance the
> > devices need to be exposed as PCIe therefore we won't have multiple
> > slots in use on a bus and I think we can therefore mostly ignore hotplug
> > since we can only hotplug at a slot granularity.  That may also imply
> > that we should simply enforce a 1:1 mapping of physical functions to
> > virtual functions.  At least one function from each group affected by a
> > reset must be exposed to the guest.
> >
> > The second issue is that individual QEMU PCI devices have no callback
> > for a bus reset.  QEMU/vfio-pci currently has the DeviceClass.reset
> > callback, which we assume to be a function-level reset.  We also
> > register with qemu_register_reset() for a VM reset, which is the only
> > point currently that we know we can do a reset affecting multiple
> > devices.  Infrastructure will need to be added to QEMU/PCI to expose the
> > link down/RST signal to devices on a bus to trigger a multi-device reset
> > in vfio-pci.
> >
> > Hopefully I'm not missing something, but I think both of those changes
> > are going to be required before we can have anything remotely
> > supportable for guest-based AER error handle.  This pretty complicated
> > for the user and also for libvirt to figure out.  At a minimum libvirt
> > would need to support a new guest-based AER handling flag for devices.
> > We probably need to determine whether this is unique to vfio-pci or a
> > generic PCIDevice option.  Thanks,
> 
> Hi Alex,
>    Solving the two issues seem like a big workload. do we have a simple
>    way to support qemu AER ?

Hi Chen,

The simpler way is the existing, containment-only solution where QEMU
stops the guest on an uncorrected error.  Do you have any other
suggestions?  I don't see how we can rely on guest involvement in
recovery unless the guest has the same abilities to reset the device as
it would on bare metal.  Thanks,

Alex