From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:49306)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <alex.williamson@redhat.com>) id 1b0acX-0007OK-DD
	for qemu-devel@nongnu.org; Wed, 11 May 2016 16:20:39 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <alex.williamson@redhat.com>) id 1b0acS-0006b8-Pu
	for qemu-devel@nongnu.org; Wed, 11 May 2016 16:20:37 -0400
Received: from mx1.redhat.com ([209.132.183.28]:44379)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <alex.williamson@redhat.com>) id 1b0acS-0006b3-Dg
	for qemu-devel@nongnu.org; Wed, 11 May 2016 16:20:32 -0400
Date: Wed, 11 May 2016 14:20:28 -0600
From: Alex Williamson <alex.williamson@redhat.com>
Message-ID: <20160511142028.2f6f511d@t450s.home>
In-Reply-To: <f006881e-70f7-7745-040a-ba3bbdfa6663@cn.fujitsu.com>
References: <1459856523-17085-1-git-send-email-caoj.fnst@cn.fujitsu.com>
	<1459856523-17085-12-git-send-email-caoj.fnst@cn.fujitsu.com>
	<20160411153827.3884ded1@t450s.home>
	<570EEC42.3040300@cn.fujitsu.com> <571EE2D6.4000100@cn.fujitsu.com>
	<20160426084815.24ec5200@t450s.home>
	<572BF5A1.3090100@cn.fujitsu.com>
	<20160506103932.7d1df9aa@t450s.home>
	<f006881e-70f7-7745-040a-ba3bbdfa6663@cn.fujitsu.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [patch v6 11/12] vfio: register aer resume
 notification handler for aer resume
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Zhou Jie <zhoujie2011@cn.fujitsu.com>
Cc: Chen Fan <chen.fan.fnst@cn.fujitsu.com>, izumi.taku@jp.fujitsu.com, Cao jin <caoj.fnst@cn.fujitsu.com>, qemu-devel@nongnu.org, mst@redhat.com

On Wed, 11 May 2016 11:11:39 +0800
Zhou Jie <zhoujie2011@cn.fujitsu.com> wrote:

> Hi, Alex
>      What do you think about the following solution?
>      1. Detect support for resume notification.
>         If host vfio driver does not have resume notifier flags,
>         Directly fail to boot up VM as with aer enabled.

It's not a flag used to detect the resume notifier, but simply probing
IRQ_INFO for the index allocated for this notification.

>      2. Immediately notify the VM on error detected.
>      3. Stall any access to the device until resume is signaled.
>         Disable mmaps, drop writes, return -1 for reads.
>      4. Delay the guest directed bus reset.
>         Don't reset bus in vfio_pci_reset function.
>      5. Wait for resume notification.
>         If we don't get the resume notification from the host after
>         some timeout, we would abort the guest directed bus reset
>         altogether and make the device disappear,
>         Initiating an unplug of the device to prevent it from further
>         interacting with the VM.
>      6. After get the resume notification.
>         Reset bus.
>         It the second bus reset. Because the host did bus reset already.
>         But as you said we shouldn't necessarily design the API that
>         strictly around the current behavior of the Linux AER handler.

Otherwise it sounds like what I had proposed.  Thanks,

Alex

> On 2016/5/7 0:39, Alex Williamson wrote:
> > On Fri, 6 May 2016 09:38:41 +0800
> > Chen Fan <chen.fan.fnst@cn.fujitsu.com> wrote:
> >  
> >> On 04/26/2016 10:48 PM, Alex Williamson wrote:  
> >>> On Tue, 26 Apr 2016 11:39:02 +0800
> >>> Chen Fan<chen.fan.fnst@cn.fujitsu.com>  wrote:
> >>>  
> >>>> On 04/14/2016 09:02 AM, Chen Fan wrote:  
> >>>>> On 04/12/2016 05:38 AM, Alex Williamson wrote:  
> >>>>>> On Tue, 5 Apr 2016 19:42:02 +0800
> >>>>>> Cao jin<caoj.fnst@cn.fujitsu.com>  wrote:
> >>>>>>  
> >>>>>>> From: Chen Fan<chen.fan.fnst@cn.fujitsu.com>
> >>>>>>>
> >>>>>>> for supporting aer recovery, host and guest would run the same aer
> >>>>>>> recovery code, that would do the secondary bus reset if the error
> >>>>>>> is fatal, the aer recovery process:
> >>>>>>>     1. error_detected
> >>>>>>>     2. reset_link (if fatal)
> >>>>>>>     3. slot_reset/mmio_enabled
> >>>>>>>     4. resume
> >>>>>>>
> >>>>>>> it indicates that host will do secondary bus reset to reset
> >>>>>>> the physical devices under bus in step 2, that would cause
> >>>>>>> devices in D3 status in a short time. but in qemu, we register
> >>>>>>> an error detected handler, that would be invoked as host broadcasts
> >>>>>>> the error-detected event in step 1, in order to avoid guest do
> >>>>>>> reset_link when host do reset_link simultaneously. it may cause
> >>>>>>> fatal error. we introduce a resmue notifier to assure host reset
> >>>>>>> completely. then do guest aer injection.  
> >>>>>> Why is it safe to continue running the VM between the error detected
> >>>>>> notification and the resume notification?  We're just pushing back the
> >>>>>> point at which we inject the AER into the guest, potentially negating
> >>>>>> any benefit by allowing the VM to consume bad data.  Shouldn't we
> >>>>>> instead be immediately notifying the VM on error detected, but stalling
> >>>>>> any access to the device until resume is signaled?  How do we know that
> >>>>>> resume will ever be signaled?  We have both the problem that we may be
> >>>>>> running on an older kernel that won't support a resume notification and
> >>>>>> the problem that seeing a resume notification depends on the host being
> >>>>>> able to successfully complete a link reset after fatal error. We can
> >>>>>> detect support for resume notification, but we still need a strategy
> >>>>>> for never receiving it.  Thanks,  
> >>>>> That's make sense, but I haven't came up with a good idea. do you have
> >>>>> any idea, Alex?  
> >>> I don't know that there are any good solutions here.  We need to
> >>> respond to the current error notifier interrupt and not regress from
> >>> our support there.  I think that means that if we want to switch from a
> >>> simple halt-on-error to a mechanism for the guest to handle recovery,
> >>> we need to disable access to the device between being notified that the
> >>> error occurred and being notified to resume.  We can do that by
> >>> disabling mmaps to the device and preventing access via the slow path
> >>> handlers.  I don't know what the best solution is for preventing access,
> >>> do we block and pause the VM or do we drop writes and return -1 for
> >>> reads, that's something that needs to be determined.  We also need to
> >>> inject the AER into the VM at the point we're notified of an error
> >>> because the VM needs to know as soon as possible to stop using the
> >>> device or trusting any data from it.  The next coordination point would
> >>> be something like the resume notifier that you've added and there are
> >>> numerous questions around the interaction of that with the guest
> >>> handling.  Clearly we can't do a guest directed bus reset until we get
> >>> the resume notifier, so do we block that execution path in QEMU until
> >>> the resume notification is received?  What happens if we don't get that
> >>> notification?  Is there any way that we can rely on the host having
> >>> done a bus reset to the point where we don't need to act on the guest
> >>> directed reset?  These are all things that need to be figured out.
> >>> Thanks,  
> >> Maybe we can simply pause the vcpu running and avoid the VM to
> >> access the device. and add two flags in VFIO_DEVICE_GET_INFO to query
> >> whether the vfio pci driver has a resume notifier,
> >> if it does not have resume notifier flags, we can directly fail to boot
> >> up VM
> >> as with aer enabled.  
> >
> > We can already tell if a resume interrupt is supported between the IRQ
> > count in vfio_device_info and a probe with vfio_irq_info, what would
> > additional flags in vfio_device_info tell us beyond a resume interrupt
> > being supported?  Is pausing the VM acceptable from a service guarantee
> > perspective to users?  A bus reset can take a full second and I imagine
> > deeper PCI hierarchies can push that out depending on what level the
> > error occurs.  A second of downtime may be enough to trigger failovers
> > to other systems.  If we were to disable mmaps when a fault occurs, we
> > could trap any further device access, drop writes, return -1 for
> > reads.  This seems reasonable since we've already notified the VM that
> > the device had a fault.  The synchronization point seems like when the
> > guest tries to do a bus reset, we need to block that until we get the
> > resume notification from the host.  Perhaps if that doesn't occur after
> > some timeout, we would abort the guest directed bus reset altogether
> > and make the device disappear, perhaps even initiating an unplug of the
> > device to prevent it from further interacting with the VM.
> >  
> >> otherwise, we should wait for resume notifier coming to
> >> restart the cpu. about the problem of the reduplicated bus reset by host
> >> and guest,
> >> I think qemu can according to the error is fatal or non-fatal to decide
> >> whether need
> >> to do a bus reset on guest, I think it's not critical and could be
> >> resolved later.  
> >
> > The vfio error interrupt doesn't signal non-fatal errors afaik.  I'm
> > also not sure we have an guarantee that the host has performed a bus
> > reset, we shouldn't necessarily design the API that strictly around the
> > current behavior of the Linux AER handler.  So I don't know that
> > there's any practical way to avoid duplicate bus resets between host
> > and guest recovery.  Thanks,
> >
> > Alex
> >
> >
> > .
> >  
> 
> 
>