From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:47377) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aloMt-00036m-ES for qemu-devel@nongnu.org; Thu, 31 Mar 2016 21:59:25 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aloMq-0004rT-4n for qemu-devel@nongnu.org; Thu, 31 Mar 2016 21:59:23 -0400 Received: from [59.151.112.132] (port=16177 helo=heian.cn.fujitsu.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aloMo-0004r2-5r for qemu-devel@nongnu.org; Thu, 31 Mar 2016 21:59:20 -0400 Message-ID: <56FDD4A4.2040703@cn.fujitsu.com> Date: Fri, 1 Apr 2016 09:53:40 +0800 From: Chen Fan MIME-Version: 1.0 References: <1458727927-15082-1-git-send-email-caoj.fnst@cn.fujitsu.com> <1458727927-15082-12-git-send-email-caoj.fnst@cn.fujitsu.com> <20160324165428.68bbcc96@t450s.home> <56F49681.8040101@cn.fujitsu.com> <20160324202255.4a397ca8@ul30vt.home> <56FCC9CB.9060802@cn.fujitsu.com> <20160331094404.4126791b@t450s.home> <56FDD17B.1090304@cn.fujitsu.com> In-Reply-To: <56FDD17B.1090304@cn.fujitsu.com> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [patch v5 11/12] vfio: device may stuck in D3 when doing aer recovery List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alex Williamson Cc: izumi.taku@jp.fujitsu.com, Cao jin , qemu-devel@nongnu.org, mst@redhat.com On 04/01/2016 09:40 AM, Chen Fan wrote: > > On 03/31/2016 11:44 PM, Alex Williamson wrote: >> On Thu, 31 Mar 2016 14:55:07 +0800 >> Chen Fan wrote: >> >>> On 03/25/2016 10:22 AM, Alex Williamson wrote: >>>> On Fri, 25 Mar 2016 09:38:09 +0800 >>>> Chen Fan wrote: >>>>> On 03/25/2016 06:54 AM, Alex Williamson wrote: >>>>>> On Wed, 23 Mar 2016 18:12:06 +0800 >>>>>> Cao jin wrote: >>>>>>> From: Chen Fan >>>>>>> >>>>>>> when a physical device aer occurred, the device state probably >>>>>>> is not in D0 in a short time, if we recover the device quickly. >>>>>>> we may stuck in D3 state when force to change device state to D0. >>>>>>> we may need to wait for a short time to inject the error to guest. >>>>>>> >>>>>>> Signed-off-by: Chen Fan >>>>>>> --- >>>>>>> hw/vfio/pci.c | 3 +++ >>>>>>> 1 file changed, 3 insertions(+) >>>>>>> >>>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c >>>>>>> index 25fc095..5216e7f 100644 >>>>>>> --- a/hw/vfio/pci.c >>>>>>> +++ b/hw/vfio/pci.c >>>>>>> @@ -2658,6 +2658,9 @@ static void vfio_err_notifier_handler(void >>>>>>> *opaque) >>>>>>> msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN : >>>>>>> PCI_ERR_ROOT_CMD_NONFATAL_EN; >>>>>>> + /* wait a bit to ensure aer device is ready */ >>>>>>> + usleep(2 * 1000); >>>>>> Where does this number come from? Why would the device be in D3? I >>>>>> don't understand this at all. >>>>> Hi Alex, >>>>> >>>>> when I tested the code in my environment, I found that when >>>>> I used >>>>> the aer-inject module to inject a fake aer error to device on >>>>> host, the qemu >>>>> would throw out the message "vfio: Unable to power on device, >>>>> stuck in D3" >>>>> on and off. if I use "gdb" to debug the vfio_pci_pre_reset, the >>>>> phenomenon >>>>> would not appearance, I just thought it should be some timing race >>>>> issue, >>>>> so I use a sleep() to wait 2ms (double the reset time of 1ms) to >>>>> ensure the >>>>> device state is ready. maybe the root reason still need to be >>>>> investigated deeply. >>>> Yes, it sounds like you need to investigate this further, the delay is >>>> arbitrary and perhaps suggests a race that needs to be fixed >>>> correctly. Thanks, >>> Hi Alex, >>> >>> after done some investigation of the problem, I found that only >>> when the injected >>> error is fatal, the problem will appear. because in aer do_recovery, >>> host will call reset_link >>> on the root port, which would invoke pci_reset_bridge_secondary_bus in >>> aer_root_reset, >>> that would reset the bridge and all the device under that. so when qemu >>> receive the aer >>> notification, then propagate the error to guest, guest does the same >>> way >>> to perform the >>> recovery, if the guest `reset_link` that will call the vfio_pre_reset >>> done at the stage of host >>> bridge reset, the device status would probable stick in D3. >>> >>> so I think after qemu receive the aer notification, we should wait for >>> enough time to >>> ensure the bridge has been reset completely. I just use sleep <=10ms to >>> test the code, >>> seems still appear the message "vfio: Unable to power on device, stuck >>> in D3". so I think >>> we should sleep 100ms to ensure the delay sufficient. I have tested >>> that >>> code 100+ times >>> by inject aer error. the issue no longer appears. >> I'm not satisfied with this. pci_reset_bridge_secondary_bus() is >> invoked by both the host AER code and the guest AER code, the latter >> via the vfio PCI hot reset interface. The >> pci_reset_bridge_secondary_bus() function includes the spec defined >> delay by which point all the devices should be operational again. The >> spec also defines that devices are in D0 after reset, which implies >> that the only reason we would ever be seeing a device in D3 is if we're >> reading the device while it is still in reset or before it has >> recovered from reset. That implies that either >> pci_reset_bridge_secondary_bus() is not waiting long enough or QEMU is >> allowing one device to call vfio_pre_reset while another device is >> still in reset. I suspect QEMU serializes reset such that the latter >> case is not possible, which means that you might have a device that >> takes longer to reset than the spec defines. Such a quirk should be >> handled in the host kernel reset, not by adding arbitrary delays in >> userspace code. Thanks, > maybe it is not the delay issue actually. let me analyze the aer process > in do_recovery: > > host qemu guest > > broadcast: error_detected > +--> error_detected (VFIO) > +--> eventfd_signal -----> inject_aer ---------> > broadcast: error_detected > *reset_link* +--> error_detected (real driver) > *reset_link* > broadcast: mmio/slot_reset > broadcast: mmio/slot_reset > broadcast: > resume broadcast: resume apologies to all, the above diagram was chaotic. I redraw it. host qemu guest broadcast: error_detected (host) +--> error_detected (VFIO) +--> eventfd_signal -----> inject_aer -----> broadcast: error_detected (gust) *reset_link* (host) +--> error_detected (real driver) (gust) *reset_link* (gust) broadcast: mmio/slot_reset (host) broadcast: mmio/slot_reset (gust) broadcast: resume (host) broadcast: resume (gust) Thanks, Chen > > we can see that the reset_link in host and in guest are not called in > order. the reset_link > in guest may execute during the host broadcast "error_detected" or > "reset_link" as long > as there are many devices need to walk on the bus. so in the case qemu > has the opportunity > to call vfio_pre_reset while host is still in reset. > > so can we consider moving the eventfd_signal in error_detected to > resume/slot_reset callback > in vfio driver, then we can assure that the host reset_link finish > before guest do reset_link. > or adding a new resume/slot_reset event between vfio driver and qemu > to delay the guest reset_link. > > Thanks, > Chen > > > > > > > > > > > >> >> Alex >> >> >> . >> > > > > > . >