From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:49306) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1b0acX-0007OK-DD for qemu-devel@nongnu.org; Wed, 11 May 2016 16:20:39 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1b0acS-0006b8-Pu for qemu-devel@nongnu.org; Wed, 11 May 2016 16:20:37 -0400 Received: from mx1.redhat.com ([209.132.183.28]:44379) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1b0acS-0006b3-Dg for qemu-devel@nongnu.org; Wed, 11 May 2016 16:20:32 -0400 Date: Wed, 11 May 2016 14:20:28 -0600 From: Alex Williamson Message-ID: <20160511142028.2f6f511d@t450s.home> In-Reply-To: References: <1459856523-17085-1-git-send-email-caoj.fnst@cn.fujitsu.com> <1459856523-17085-12-git-send-email-caoj.fnst@cn.fujitsu.com> <20160411153827.3884ded1@t450s.home> <570EEC42.3040300@cn.fujitsu.com> <571EE2D6.4000100@cn.fujitsu.com> <20160426084815.24ec5200@t450s.home> <572BF5A1.3090100@cn.fujitsu.com> <20160506103932.7d1df9aa@t450s.home> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [patch v6 11/12] vfio: register aer resume notification handler for aer resume List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Zhou Jie Cc: Chen Fan , izumi.taku@jp.fujitsu.com, Cao jin , qemu-devel@nongnu.org, mst@redhat.com On Wed, 11 May 2016 11:11:39 +0800 Zhou Jie wrote: > Hi, Alex > What do you think about the following solution? > 1. Detect support for resume notification. > If host vfio driver does not have resume notifier flags, > Directly fail to boot up VM as with aer enabled. It's not a flag used to detect the resume notifier, but simply probing IRQ_INFO for the index allocated for this notification. > 2. Immediately notify the VM on error detected. > 3. Stall any access to the device until resume is signaled. > Disable mmaps, drop writes, return -1 for reads. > 4. Delay the guest directed bus reset. > Don't reset bus in vfio_pci_reset function. > 5. Wait for resume notification. > If we don't get the resume notification from the host after > some timeout, we would abort the guest directed bus reset > altogether and make the device disappear, > Initiating an unplug of the device to prevent it from further > interacting with the VM. > 6. After get the resume notification. > Reset bus. > It the second bus reset. Because the host did bus reset already. > But as you said we shouldn't necessarily design the API that > strictly around the current behavior of the Linux AER handler. Otherwise it sounds like what I had proposed. Thanks, Alex > On 2016/5/7 0:39, Alex Williamson wrote: > > On Fri, 6 May 2016 09:38:41 +0800 > > Chen Fan wrote: > > > >> On 04/26/2016 10:48 PM, Alex Williamson wrote: > >>> On Tue, 26 Apr 2016 11:39:02 +0800 > >>> Chen Fan wrote: > >>> > >>>> On 04/14/2016 09:02 AM, Chen Fan wrote: > >>>>> On 04/12/2016 05:38 AM, Alex Williamson wrote: > >>>>>> On Tue, 5 Apr 2016 19:42:02 +0800 > >>>>>> Cao jin wrote: > >>>>>> > >>>>>>> From: Chen Fan > >>>>>>> > >>>>>>> for supporting aer recovery, host and guest would run the same aer > >>>>>>> recovery code, that would do the secondary bus reset if the error > >>>>>>> is fatal, the aer recovery process: > >>>>>>> 1. error_detected > >>>>>>> 2. reset_link (if fatal) > >>>>>>> 3. slot_reset/mmio_enabled > >>>>>>> 4. resume > >>>>>>> > >>>>>>> it indicates that host will do secondary bus reset to reset > >>>>>>> the physical devices under bus in step 2, that would cause > >>>>>>> devices in D3 status in a short time. but in qemu, we register > >>>>>>> an error detected handler, that would be invoked as host broadcasts > >>>>>>> the error-detected event in step 1, in order to avoid guest do > >>>>>>> reset_link when host do reset_link simultaneously. it may cause > >>>>>>> fatal error. we introduce a resmue notifier to assure host reset > >>>>>>> completely. then do guest aer injection. > >>>>>> Why is it safe to continue running the VM between the error detected > >>>>>> notification and the resume notification? We're just pushing back the > >>>>>> point at which we inject the AER into the guest, potentially negating > >>>>>> any benefit by allowing the VM to consume bad data. Shouldn't we > >>>>>> instead be immediately notifying the VM on error detected, but stalling > >>>>>> any access to the device until resume is signaled? How do we know that > >>>>>> resume will ever be signaled? We have both the problem that we may be > >>>>>> running on an older kernel that won't support a resume notification and > >>>>>> the problem that seeing a resume notification depends on the host being > >>>>>> able to successfully complete a link reset after fatal error. We can > >>>>>> detect support for resume notification, but we still need a strategy > >>>>>> for never receiving it. Thanks, > >>>>> That's make sense, but I haven't came up with a good idea. do you have > >>>>> any idea, Alex? > >>> I don't know that there are any good solutions here. We need to > >>> respond to the current error notifier interrupt and not regress from > >>> our support there. I think that means that if we want to switch from a > >>> simple halt-on-error to a mechanism for the guest to handle recovery, > >>> we need to disable access to the device between being notified that the > >>> error occurred and being notified to resume. We can do that by > >>> disabling mmaps to the device and preventing access via the slow path > >>> handlers. I don't know what the best solution is for preventing access, > >>> do we block and pause the VM or do we drop writes and return -1 for > >>> reads, that's something that needs to be determined. We also need to > >>> inject the AER into the VM at the point we're notified of an error > >>> because the VM needs to know as soon as possible to stop using the > >>> device or trusting any data from it. The next coordination point would > >>> be something like the resume notifier that you've added and there are > >>> numerous questions around the interaction of that with the guest > >>> handling. Clearly we can't do a guest directed bus reset until we get > >>> the resume notifier, so do we block that execution path in QEMU until > >>> the resume notification is received? What happens if we don't get that > >>> notification? Is there any way that we can rely on the host having > >>> done a bus reset to the point where we don't need to act on the guest > >>> directed reset? These are all things that need to be figured out. > >>> Thanks, > >> Maybe we can simply pause the vcpu running and avoid the VM to > >> access the device. and add two flags in VFIO_DEVICE_GET_INFO to query > >> whether the vfio pci driver has a resume notifier, > >> if it does not have resume notifier flags, we can directly fail to boot > >> up VM > >> as with aer enabled. > > > > We can already tell if a resume interrupt is supported between the IRQ > > count in vfio_device_info and a probe with vfio_irq_info, what would > > additional flags in vfio_device_info tell us beyond a resume interrupt > > being supported? Is pausing the VM acceptable from a service guarantee > > perspective to users? A bus reset can take a full second and I imagine > > deeper PCI hierarchies can push that out depending on what level the > > error occurs. A second of downtime may be enough to trigger failovers > > to other systems. If we were to disable mmaps when a fault occurs, we > > could trap any further device access, drop writes, return -1 for > > reads. This seems reasonable since we've already notified the VM that > > the device had a fault. The synchronization point seems like when the > > guest tries to do a bus reset, we need to block that until we get the > > resume notification from the host. Perhaps if that doesn't occur after > > some timeout, we would abort the guest directed bus reset altogether > > and make the device disappear, perhaps even initiating an unplug of the > > device to prevent it from further interacting with the VM. > > > >> otherwise, we should wait for resume notifier coming to > >> restart the cpu. about the problem of the reduplicated bus reset by host > >> and guest, > >> I think qemu can according to the error is fatal or non-fatal to decide > >> whether need > >> to do a bus reset on guest, I think it's not critical and could be > >> resolved later. > > > > The vfio error interrupt doesn't signal non-fatal errors afaik. I'm > > also not sure we have an guarantee that the host has performed a bus > > reset, we shouldn't necessarily design the API that strictly around the > > current behavior of the Linux AER handler. So I don't know that > > there's any practical way to avoid duplicate bus resets between host > > and guest recovery. Thanks, > > > > Alex > > > > > > . > > > > >