Re: [PATCH] vfio/pci: Support error recovery

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Alex Williamson <alex.williamson@redhat.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Cao jin <caoj.fnst@cn.fujitsu.com>,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	izumi.taku@jp.fujitsu.com
Subject: Re: [PATCH] vfio/pci: Support error recovery
Date: Wed, 14 Dec 2016 15:49:23 -0700	[thread overview]
Message-ID: <20161214154923.5a4e7c90@t450s.home> (raw)
In-Reply-To: <20161215002444-mutt-send-email-mst@kernel.org>

On Thu, 15 Dec 2016 00:25:13 +0200
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Dec 14, 2016 at 03:16:37PM -0700, Alex Williamson wrote:
> > On Wed, 14 Dec 2016 18:24:23 +0800
> > Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
> >   
> > > Sorry for late.
> > > after reading all your comments, I think I will try the solution 1.
> > > 
> > > On 12/13/2016 03:12 AM, Alex Williamson wrote:  
> > > > On Mon, 12 Dec 2016 21:49:01 +0800
> > > > Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
> > > >     
> > > >> Hi,
> > > >> I have 2 solutions(high level design) came to me, please see if they are
> > > >> acceptable, or which one is acceptable. Also have some questions.
> > > >>
> > > >> 1. block guest access during host recovery
> > > >>
> > > >>    add new field error_recovering in struct vfio_pci_device to
> > > >>    indicate host recovery status. aer driver in host will still do
> > > >>    reset link
> > > >>
> > > >>    - set error_recovering in vfio-pci driver's error_detected, used to
> > > >>      block all kinds of user access(config space, mmio)
> > > >>    - in order to solve concurrent issue of device resetting & user
> > > >>      access, check device state[*] in vfio-pci driver's resume, see if
> > > >>      device reset is done, if it is, then clear"error_recovering", or
> > > >>      else new a timer, check device state periodically until device
> > > >>      reset is done. (what if device reset don't end for a long time?)
> > > >>    - In qemu, translate guest link reset to host link reset.
> > > >>      A question here: we already have link reset in host, is a second
> > > >>      link reset necessary? why?
> > > >>  
> > > >>    [*] how to check device state: reading certain config space
> > > >>        register, check return value is valid or not(All F's)    
> > > > 
> > > > Isn't this exactly the path we were on previously?    
> > > 
> > > Yes, it is basically the previous path, plus the optimization.
> > >   
> > > > There might be an
> > > > optimization that we could skip back-to-back resets, but how can you
> > > > necessarily infer that the resets are for the same thing? If the user
> > > > accesses the device between resets, can you still guarantee the guest
> > > > directed reset is unnecessary?  If time passes between resets, do you
> > > > know they're for the same event?  How much time can pass between the
> > > > host and guest reset to know they're for the same event?  In the
> > > > process of error handling, which is more important, speed or
> > > > correctness?
> > > >      
> > > 
> > > I think vfio driver itself won't know what each reset comes for, and I
> > > don't quite understand why should vfio care this question, is this a new
> > > question in the design?  
> > 
> > You're suggesting an optimization to eliminate one of the resets,
> > and as we've discussed, I don't see removing the host induced reset
> > as a viable option.  That means you want to eliminate the guest
> > directed reset.  There are potentially three levels to do that, the
> > vfio-pci driver in the host kernel, somewhere in QEMU, or eliminate it
> > within the guest.  My comments were directed to the first option, the
> > host kernel level cannot correlate user directed resets as duplicates
> > of host directed resets.  
> >    
> > > But I think it make sense that the user access during 2 resets maybe a
> > > trouble for guest recovery, misbehaved user could be out of our
> > > imagination.  Correctness is more important.
> > > 
> > > If I understand you right, let me make a summary: host recovery just
> > > does link reset, which is incomplete, so we'd better do a complete guest
> > > recovery for correctness.  
> > 
> > We don't know whether the host link reset is incomplete, but we can't do
> > a link reset transparently to the device, the device is no longer in the
> > same state after the reset.  The device specific driver, which exists
> > in userspace needs to be involved in device recovery.  Therefore
> > regardless of how QEMU handles the error, the driver within the guest
> > needs to be notified and perform recovery.  Since the device is PCI and
> > we're on x86 and nobody wants to introduce paravirtual error recovery,
> > we must use AER.  Part of AER recovery includes the possibility of
> > performing a link reset.  So it seems this eliminates avoiding the link
> > reset within the guest.
> > 
> > That leaves QEMU.  Here we need to decide whether a guest triggered
> > link reset induces a host link reset.  The current working theory is
> > that yes, this must be the case.  If there is ever a case where a
> > driver within the guest could trigger a link reset for the purposes
> > of error recovery when the host has not, I think this must be the
> > case.  Therefore, at least some guest induced link resets must become
> > host link resets.  Currently we assume all guest induced link resets
> > become host link resets.  Minimally to avoid that, QEMU would need to
> > know (not assume) whether the host performed a link reset.  Even with
> > that, QEMU would need to be able to correlate that a link reset from
> > the guest is a duplicate of a link reset that was already performed by
> > the host.  That implies that QEMU needs to deduce the intention of
> > the guest.  That seems like a complicated task for a patch series that
> > is already complicated enough, especially for a feature of questionable
> > value given the configuration restrictions (imo).
> > 
> > I would much rather focus on getting it right and making it as simple
> > as we can, even if that means links get reset one too many times on
> > error.
> >   
> > > >> 2. skip link reset in aer driver of host kernel, for vfio-pci.
> > > >>    Let user decide how to do serious recovery
> > > >>
> > > >>    add new field "user_driver" in struct pci_dev, used to skip link
> > > >>    reset for vfio-pci; add new field "link_reset" in struct
> > > >>    vfio_pci_device to indicate link has been reset or not during
> > > >>    recovery
> > > >>
> > > >>    - set user_driver in vfio_pci_probe(), to skip link reset for
> > > >>      vfio-pci in host.
> > > >>    - (use a flag)block user access(config, mmio) during host recovery
> > > >>      (not sure if this step is necessary)
> > > >>    - In qemu, translate guest link reset to host link reset.
> > > >>    - In vfio-pci driver, set link_reset after VFIO_DEVICE_PCI_HOT_RESET
> > > >>      is executed
> > > >>    - In vfio-pci driver's resume, new a timer, check "link_reset" field
> > > >>      periodically, if it is set in reasonable time, then clear it and
> > > >>      delete timer, or else, vfio-pci driver will does the link reset!    
> > > > 
> > > > What happens in the case of a multifunction device where each function
> > > > is part of a separate IOMMU group and one function is hot-removed from
> > > > the user? We can't do a link reset on that function since the other
> > > > function is still in use.  We have no choice but release a device in an
> > > > unknown state back to the host.    
> > > 
> > > hot-remove from user, do you mean, for example, all functions assigned
> > > to VM, then suddenly a person does something like following
> > > 
> > > $ echo 0000:06:00.0 > /sys/bus/pci/drivers/vfio-pci/unbind
> > > 
> > > $ echo 0000:06:00.0 > /sys/bus/pci/drivers/igb/bind
> > > 
> > > to return device to host driver, or don't bind it to host driver, let it
> > > in driver-less state???  
> > 
> > Yes, the host kernel has no visiblity to how a user is making use of
> > devices.  To support AER we require a similar topology between host and
> > guest such that a guest link reset translates to a host reset.  That
> > requirement is imposed by userspace, ie. QEMU.  The host kernel cannot
> > presume that this is the case.  
> 
> So enforce this to enable recovery functionality. Why can't you?

How?
 
> >  Therefore we could have a
> > multi-function device where each function is assigned to the same or
> > different users in any configuration.  If a fault occurs and we defer
> > to the user to perform the link reset, we have absolutely no guarantee
> > that it will ever occur.  If the functions are assigned to different
> > users, then each user individually doesn't have the capability to
> > perform a link reset.  If the devices happen to be assigned to a single
> > user when the error occurs, we cannot assume the user has an AER
> > compatible configuration, the devices could be exposed as separate
> > single function devices, any one of which might be individually removed
> > from the user and made use of by the host, such as your sysfs example
> > above.  The host cannot perform a link reset in this case either
> > as the sibling devices are still in use by the guest.  Thanks,
> > 
> > Alex

next prev parent reply	other threads:[~2016-12-14 22:50 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-27 11:34 [PATCH] vfio/pci: Support error recovery Cao jin
2016-11-28  3:00 ` Michael S. Tsirkin
2016-11-28  9:32   ` Cao jin
2016-11-30  1:46     ` Michael S. Tsirkin
2016-12-01 13:38       ` Cao jin
2016-12-01  4:04 ` Alex Williamson
2016-12-01  4:51   ` Michael S. Tsirkin
2016-12-01 13:40     ` Cao jin
2016-12-06  3:46       ` Michael S. Tsirkin
2016-12-06  6:47         ` Cao jin
2016-12-01 13:40   ` Cao jin
2016-12-01 14:55     ` Alex Williamson
2016-12-04 12:16       ` Cao jin
2016-12-04 15:30         ` Alex Williamson
2016-12-05  5:52           ` Cao jin
2016-12-05 16:17             ` Alex Williamson
2016-12-06  3:55               ` Michael S. Tsirkin
2016-12-06  4:59                 ` Alex Williamson
2016-12-06 10:46                   ` Cao jin
2016-12-06 15:35                     ` Alex Williamson
2016-12-07  2:49                       ` Cao jin
2016-12-08 14:46                       ` Cao jin
2016-12-08 16:30                         ` Michael S. Tsirkin
2016-12-09  3:40                           ` Cao jin
2016-12-09  3:40                         ` Cao jin
2016-12-06  6:11               ` Cao jin
2016-12-06 15:25                 ` Alex Williamson
2016-12-07  2:58                   ` Cao jin
2016-12-12 13:49 ` Cao jin
2016-12-12 19:12   ` Alex Williamson
2016-12-12 22:29     ` Michael S. Tsirkin
2016-12-12 22:43       ` Alex Williamson
2016-12-13  3:15         ` Michael S. Tsirkin
2016-12-13  3:39           ` Alex Williamson
2016-12-13 16:12             ` Michael S. Tsirkin
2016-12-13 16:27               ` Alex Williamson
2016-12-14  1:58                 ` Michael S. Tsirkin
2016-12-14  3:00                   ` Alex Williamson
2016-12-14 22:20                     ` Michael S. Tsirkin
2016-12-14 22:47                       ` Alex Williamson
2016-12-14 23:00                         ` Michael S. Tsirkin
2016-12-14 23:32                           ` Alex Williamson
2016-12-14 10:24     ` Cao jin
2016-12-14 22:16       ` Alex Williamson
2016-12-14 22:25         ` Michael S. Tsirkin
2016-12-14 22:49           ` Alex Williamson [this message]
2016-12-15 13:56         ` Cao jin
2016-12-15 14:50           ` Michael S. Tsirkin
2016-12-15 22:01             ` Alex Williamson
2016-12-16 10:15               ` Cao jin
2016-12-16 10:15             ` Cao jin
2016-12-15 17:02           ` Alex Williamson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161214154923.5a4e7c90@t450s.home \
    --to=alex.williamson@redhat.com \
    --cc=caoj.fnst@cn.fujitsu.com \
    --cc=izumi.taku@jp.fujitsu.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mst@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).