From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:45338)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <chen.fan.fnst@cn.fujitsu.com>) id 1alo9n-0006xO-8a
	for qemu-devel@nongnu.org; Thu, 31 Mar 2016 21:45:52 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <chen.fan.fnst@cn.fujitsu.com>) id 1alo9j-00024F-6n
	for qemu-devel@nongnu.org; Thu, 31 Mar 2016 21:45:51 -0400
Received: from [59.151.112.132] (port=37001 helo=heian.cn.fujitsu.com)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <chen.fan.fnst@cn.fujitsu.com>) id 1alo9h-000249-Sc
	for qemu-devel@nongnu.org; Thu, 31 Mar 2016 21:45:47 -0400
Message-ID: <56FDD17B.1090304@cn.fujitsu.com>
Date: Fri, 1 Apr 2016 09:40:11 +0800
From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
MIME-Version: 1.0
References: <1458727927-15082-1-git-send-email-caoj.fnst@cn.fujitsu.com>	<1458727927-15082-12-git-send-email-caoj.fnst@cn.fujitsu.com>	<20160324165428.68bbcc96@t450s.home>	<56F49681.8040101@cn.fujitsu.com>	<20160324202255.4a397ca8@ul30vt.home>	<56FCC9CB.9060802@cn.fujitsu.com>
	<20160331094404.4126791b@t450s.home>
In-Reply-To: <20160331094404.4126791b@t450s.home>
Content-Type: text/plain; charset="windows-1252"; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [patch v5 11/12] vfio: device may stuck in D3 when
 doing aer recovery
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: izumi.taku@jp.fujitsu.com, Cao jin <caoj.fnst@cn.fujitsu.com>, qemu-devel@nongnu.org, mst@redhat.com


On 03/31/2016 11:44 PM, Alex Williamson wrote:
> On Thu, 31 Mar 2016 14:55:07 +0800
> Chen Fan <chen.fan.fnst@cn.fujitsu.com> wrote:
>
>> On 03/25/2016 10:22 AM, Alex Williamson wrote:
>>> On Fri, 25 Mar 2016 09:38:09 +0800
>>> Chen Fan <chen.fan.fnst@cn.fujitsu.com> wrote:
>>>   
>>>> On 03/25/2016 06:54 AM, Alex Williamson wrote:
>>>>> On Wed, 23 Mar 2016 18:12:06 +0800
>>>>> Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
>>>>>      
>>>>>> From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
>>>>>>
>>>>>> when a physical device aer occurred, the device state probably
>>>>>> is not in D0 in a short time, if we recover the device quickly.
>>>>>> we may stuck in D3 state when force to change device state to D0.
>>>>>> we may need to wait for a short time to inject the error to guest.
>>>>>>
>>>>>> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
>>>>>> ---
>>>>>>     hw/vfio/pci.c | 3 +++
>>>>>>     1 file changed, 3 insertions(+)
>>>>>>
>>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>>> index 25fc095..5216e7f 100644
>>>>>> --- a/hw/vfio/pci.c
>>>>>> +++ b/hw/vfio/pci.c
>>>>>> @@ -2658,6 +2658,9 @@ static void vfio_err_notifier_handler(void *opaque)
>>>>>>             msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN :
>>>>>>                                      PCI_ERR_ROOT_CMD_NONFATAL_EN;
>>>>>>     
>>>>>> +        /* wait a bit to ensure aer device is ready */
>>>>>> +        usleep(2 * 1000);
>>>>> Where does this number come from?  Why would the device be in D3?  I
>>>>> don't understand this at all.
>>>> Hi Alex,
>>>>
>>>>        when I tested the code in my environment, I found that when I used
>>>> the aer-inject module to inject a fake aer error to device on host, the qemu
>>>> would throw out the message "vfio: Unable to power on device, stuck in D3"
>>>> on and off. if I use "gdb" to debug the vfio_pci_pre_reset, the phenomenon
>>>> would not appearance, I just thought it should be some timing race issue,
>>>> so I use a sleep() to wait 2ms (double the reset time of 1ms) to ensure the
>>>> device state is ready. maybe the root reason still need to be
>>>> investigated deeply.
>>> Yes, it sounds like you need to investigate this further, the delay is
>>> arbitrary and perhaps suggests a race that needs to be fixed
>>> correctly.  Thanks,
>> Hi Alex,
>>
>>       after done some investigation of the problem, I found that only
>> when the injected
>>    error is fatal, the problem will appear. because in aer do_recovery,
>> host will call reset_link
>> on the root port, which would invoke pci_reset_bridge_secondary_bus in
>> aer_root_reset,
>> that would reset the bridge and all the device under that. so when qemu
>> receive the aer
>> notification, then propagate the error to guest, guest does the same way
>> to perform the
>> recovery, if the guest `reset_link` that will call the vfio_pre_reset
>> done at the stage of host
>> bridge reset, the device status would probable stick in D3.
>>
>> so I think after qemu receive the aer notification, we should wait for
>> enough time to
>> ensure the bridge has been reset completely. I just use sleep <=10ms to
>> test the code,
>> seems still appear the message "vfio: Unable to power on device, stuck
>> in D3". so I think
>> we should sleep 100ms to ensure the delay sufficient. I have tested that
>> code 100+ times
>> by inject aer error. the issue no longer appears.
> I'm not satisfied with this.  pci_reset_bridge_secondary_bus() is
> invoked by both the host AER code and the guest AER code, the latter
> via the vfio PCI hot reset interface.  The
> pci_reset_bridge_secondary_bus() function includes the spec defined
> delay by which point all the devices should be operational again.  The
> spec also defines that devices are in D0 after reset, which implies
> that the only reason we would ever be seeing a device in D3 is if we're
> reading the device while it is still in reset or before it has
> recovered from reset.  That implies that either
> pci_reset_bridge_secondary_bus() is not waiting long enough or QEMU is
> allowing one device to call vfio_pre_reset while another device is
> still in reset.  I suspect QEMU serializes reset such that the latter
> case is not possible, which means that you might have a device that
> takes longer to reset than the spec defines.  Such a quirk should be
> handled in the host kernel reset, not by adding arbitrary delays in
> userspace code.  Thanks,
maybe it is not the delay issue actually. let me analyze the aer process
in do_recovery:

                host                                      qemu          
      guest

         broadcast: error_detected
         +--> error_detected (VFIO)
           +--> eventfd_signal   ----->    inject_aer ---------> 
broadcast: error_detected
*reset_link* +--> error_detected (real driver)
*reset_link*
         broadcast: mmio/slot_reset                                     
broadcast: mmio/slot_reset
         broadcast: 
resume                                                  broadcast: resume

we can see that the reset_link in host and in guest are not called in 
order. the reset_link
in guest may execute during the host broadcast "error_detected" or 
"reset_link" as long
as there are many devices need to walk on the bus. so in the case qemu 
has the opportunity
to call vfio_pre_reset while host is still in reset.

so can we consider moving the eventfd_signal in error_detected to 
resume/slot_reset callback
in vfio driver, then we can assure that the host reset_link finish 
before guest do reset_link.
or adding a new resume/slot_reset event between vfio driver and qemu to 
delay the guest reset_link.

Thanks,
Chen


>
> Alex
>
>
> .
>