From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:35467)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <chen.fan.fnst@cn.fujitsu.com>) id 1aqVmk-0008VO-5W
	for qemu-devel@nongnu.org; Wed, 13 Apr 2016 21:09:31 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <chen.fan.fnst@cn.fujitsu.com>) id 1aqVmf-00022T-EB
	for qemu-devel@nongnu.org; Wed, 13 Apr 2016 21:09:30 -0400
Received: from [59.151.112.132] (port=8553 helo=heian.cn.fujitsu.com)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <chen.fan.fnst@cn.fujitsu.com>) id 1aqVma-00020q-1t
	for qemu-devel@nongnu.org; Wed, 13 Apr 2016 21:09:25 -0400
Message-ID: <570EEC42.3040300@cn.fujitsu.com>
Date: Thu, 14 Apr 2016 09:02:58 +0800
From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
MIME-Version: 1.0
References: <1459856523-17085-1-git-send-email-caoj.fnst@cn.fujitsu.com>	<1459856523-17085-12-git-send-email-caoj.fnst@cn.fujitsu.com>
	<20160411153827.3884ded1@t450s.home>
In-Reply-To: <20160411153827.3884ded1@t450s.home>
Content-Type: text/plain; charset="windows-1252"; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [patch v6 11/12] vfio: register aer resume
 notification handler for aer resume
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Alex Williamson <alex.williamson@redhat.com>, Cao jin <caoj.fnst@cn.fujitsu.com>
Cc: qemu-devel@nongnu.org, mst@redhat.com, izumi.taku@jp.fujitsu.com


On 04/12/2016 05:38 AM, Alex Williamson wrote:
> On Tue, 5 Apr 2016 19:42:02 +0800
> Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
>
>> From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
>>
>> for supporting aer recovery, host and guest would run the same aer
>> recovery code, that would do the secondary bus reset if the error
>> is fatal, the aer recovery process:
>>    1. error_detected
>>    2. reset_link (if fatal)
>>    3. slot_reset/mmio_enabled
>>    4. resume
>>
>> it indicates that host will do secondary bus reset to reset
>> the physical devices under bus in step 2, that would cause
>> devices in D3 status in a short time. but in qemu, we register
>> an error detected handler, that would be invoked as host broadcasts
>> the error-detected event in step 1, in order to avoid guest do
>> reset_link when host do reset_link simultaneously. it may cause
>> fatal error. we introduce a resmue notifier to assure host reset
>> completely. then do guest aer injection.
> Why is it safe to continue running the VM between the error detected
> notification and the resume notification?  We're just pushing back the
> point at which we inject the AER into the guest, potentially negating
> any benefit by allowing the VM to consume bad data.  Shouldn't we
> instead be immediately notifying the VM on error detected, but stalling
> any access to the device until resume is signaled?  How do we know that
> resume will ever be signaled?  We have both the problem that we may be
> running on an older kernel that won't support a resume notification and
> the problem that seeing a resume notification depends on the host being
> able to successfully complete a link reset after fatal error.  We can
> detect support for resume notification, but we still need a strategy
> for never receiving it.  Thanks,
That's make sense, but I haven't came up with a good idea. do you have
any idea, Alex?

Thanks,
Chen


>
> Alex
>
>> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
>> ---
>>   hw/vfio/pci.c              | 157 +++++++++++++++++++++++++++++++++++----------
>>   hw/vfio/pci.h              |   2 +
>>   linux-headers/linux/vfio.h |   1 +
>>   3 files changed, 126 insertions(+), 34 deletions(-)
>>
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 691ff5e..d79fb3d 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -2610,12 +2610,7 @@ static void vfio_put_device(VFIOPCIDevice *vdev)
>>   static void vfio_err_notifier_handler(void *opaque)
>>   {
>>       VFIOPCIDevice *vdev = opaque;
>> -    PCIDevice *dev = &vdev->pdev;
>>       Error *local_err = NULL;
>> -    PCIEAERMsg msg = {
>> -        .severity = 0,
>> -        .source_id = (pci_bus_num(dev->bus) << 8) | dev->devfn,
>> -    };
>>   
>>       if (!event_notifier_test_and_clear(&vdev->err_notifier)) {
>>           return;
>> @@ -2636,35 +2631,7 @@ static void vfio_err_notifier_handler(void *opaque)
>>           goto stop;
>>       }
>>   
>> -    /*
>> -     * we should read the error details from the real hardware
>> -     * configuration spaces, here we only need to do is signaling
>> -     * to guest an uncorrectable error has occurred.
>> -     */
>> -    if (dev->exp.aer_cap) {
>> -        uint8_t *aer_cap = dev->config + dev->exp.aer_cap;
>> -        uint32_t uncor_status;
>> -        bool isfatal;
>> -
>> -        uncor_status = vfio_pci_read_config(dev,
>> -                           dev->exp.aer_cap + PCI_ERR_UNCOR_STATUS, 4);
>> -
>> -        /*
>> -         * if the error is not emitted by this device, we can
>> -         * just ignore it.
>> -         */
>> -        if (!(uncor_status & ~0UL)) {
>> -            return;
>> -        }
>> -
>> -        isfatal = uncor_status & pci_get_long(aer_cap + PCI_ERR_UNCOR_SEVER);
>> -
>> -        msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN :
>> -                                 PCI_ERR_ROOT_CMD_NONFATAL_EN;
>> -
>> -        pcie_aer_msg(dev, &msg);
>> -        return;
>> -    }
>> +    return;
>>   
>>   stop:
>>       /*
>> @@ -2757,6 +2724,126 @@ static void vfio_unregister_err_notifier(VFIOPCIDevice *vdev)
>>       event_notifier_cleanup(&vdev->err_notifier);
>>   }
>>   
>> +static void vfio_resume_notifier_handler(void *opaque)
>> +{
>> +    VFIOPCIDevice *vdev = opaque;
>> +    PCIDevice *dev = &vdev->pdev;
>> +    PCIEAERMsg msg = {
>> +        .severity = 0,
>> +        .source_id = (pci_bus_num(dev->bus) << 8) | dev->devfn,
>> +    };
>> +
>> +    if (!event_notifier_test_and_clear(&vdev->resume_notifier)) {
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * we should read the error details from the real hardware
>> +     * configuration spaces, here we only need to do is signaling
>> +     * to guest an uncorrectable error has occurred.
>> +     */
>> +    if (dev->exp.aer_cap) {
>> +        uint8_t *aer_cap = dev->config + dev->exp.aer_cap;
>> +        uint32_t uncor_status;
>> +        bool isfatal;
>> +
>> +        uncor_status = vfio_pci_read_config(dev,
>> +                           dev->exp.aer_cap + PCI_ERR_UNCOR_STATUS, 4);
>> +
>> +        /*
>> +         * if the error is not emitted by this device, we can
>> +         * just ignore it.
>> +         */
>> +        if (!(uncor_status & ~0UL)) {
>> +            return;
>> +        }
>> +
>> +        isfatal = uncor_status & pci_get_long(aer_cap + PCI_ERR_UNCOR_SEVER);
>> +
>> +        msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN :
>> +                                 PCI_ERR_ROOT_CMD_NONFATAL_EN;
>> +
>> +        pcie_aer_msg(dev, &msg);
>> +    }
>> +}
>> +
>> +static void vfio_register_aer_resume_notifier(VFIOPCIDevice *vdev)
>> +{
>> +    int ret;
>> +    int argsz;
>> +    struct vfio_irq_set *irq_set;
>> +    int32_t *pfd;
>> +
>> +    if (!(vdev->features & VFIO_FEATURE_ENABLE_AER)) {
>> +        return;
>> +    }
>> +
>> +    if (event_notifier_init(&vdev->resume_notifier, 0)) {
>> +        error_report("vfio: Unable to init event notifier for"
>> +                     " resume notification");
>> +        return;
>> +    }
>> +
>> +    argsz = sizeof(*irq_set) + sizeof(*pfd);
>> +
>> +    irq_set = g_malloc0(argsz);
>> +    irq_set->argsz = argsz;
>> +    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
>> +                     VFIO_IRQ_SET_ACTION_TRIGGER;
>> +    irq_set->index = VFIO_PCI_RESUME_IRQ_INDEX;
>> +    irq_set->start = 0;
>> +    irq_set->count = 1;
>> +    pfd = (int32_t *)&irq_set->data;
>> +
>> +    *pfd = event_notifier_get_fd(&vdev->resume_notifier);
>> +    qemu_set_fd_handler(*pfd, vfio_resume_notifier_handler, NULL, vdev);
>> +
>> +    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
>> +    if (ret) {
>> +        error_report("vfio: Failed to set up resume notification");
>> +        qemu_set_fd_handler(*pfd, NULL, NULL, vdev);
>> +        event_notifier_cleanup(&vdev->resume_notifier);
>> +    } else {
>> +        vdev->resume_enabled = true;
>> +    }
>> +    g_free(irq_set);
>> +}
>> +
>> +static void vfio_unregister_aer_resume_notifier(VFIOPCIDevice *vdev)
>> +{
>> +    int argsz;
>> +    struct vfio_irq_set *irq_set;
>> +    int32_t *pfd;
>> +    int ret;
>> +
>> +    if (!vdev->resume_enabled) {
>> +        return;
>> +    }
>> +
>> +    argsz = sizeof(*irq_set) + sizeof(*pfd);
>> +
>> +    irq_set = g_malloc0(argsz);
>> +    irq_set->argsz = argsz;
>> +    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
>> +                     VFIO_IRQ_SET_ACTION_TRIGGER;
>> +    irq_set->index = VFIO_PCI_RESUME_IRQ_INDEX;
>> +    irq_set->start = 0;
>> +    irq_set->count = 1;
>> +    pfd = (int32_t *)&irq_set->data;
>> +    *pfd = -1;
>> +
>> +    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
>> +    if (ret) {
>> +        error_report("vfio: Failed to de-assign error fd: %m");
>> +    }
>> +    g_free(irq_set);
>> +    qemu_set_fd_handler(event_notifier_get_fd(&vdev->resume_notifier),
>> +                        NULL, NULL, vdev);
>> +    event_notifier_cleanup(&vdev->resume_notifier);
>> +
>> +    vdev->resume_enabled = false;
>> +}
>> +
>>   static void vfio_req_notifier_handler(void *opaque)
>>   {
>>       VFIOPCIDevice *vdev = opaque;
>> @@ -3062,6 +3149,7 @@ static int vfio_initfn(PCIDevice *pdev)
>>       }
>>   
>>       vfio_register_err_notifier(vdev);
>> +    vfio_register_aer_resume_notifier(vdev);
>>       vfio_register_req_notifier(vdev);
>>       vfio_setup_resetfn_quirk(vdev);
>>   
>> @@ -3092,6 +3180,7 @@ static void vfio_exitfn(PCIDevice *pdev)
>>       VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
>>   
>>       vfio_unregister_req_notifier(vdev);
>> +    vfio_unregister_aer_resume_notifier(vdev);
>>       vfio_unregister_err_notifier(vdev);
>>       pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
>>       vfio_disable_interrupts(vdev);
>> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
>> index 9fb0206..3ebc58f 100644
>> --- a/hw/vfio/pci.h
>> +++ b/hw/vfio/pci.h
>> @@ -119,6 +119,7 @@ typedef struct VFIOPCIDevice {
>>       VFIOVGA *vga; /* 0xa0000, 0x3b0, 0x3c0 */
>>       PCIHostDeviceAddress host;
>>       EventNotifier err_notifier;
>> +    EventNotifier resume_notifier;
>>       EventNotifier req_notifier;
>>       int (*resetfn)(struct VFIOPCIDevice *);
>>       uint32_t vendor_id;
>> @@ -144,6 +145,7 @@ typedef struct VFIOPCIDevice {
>>       bool no_kvm_msi;
>>       bool no_kvm_msix;
>>       bool single_depend_dev;
>> +    bool resume_enabled;
>>   } VFIOPCIDevice;
>>   
>>   uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>> index 15e096c..6d1826d 100644
>> --- a/linux-headers/linux/vfio.h
>> +++ b/linux-headers/linux/vfio.h
>> @@ -345,6 +345,7 @@ enum {
>>   	VFIO_PCI_MSIX_IRQ_INDEX,
>>   	VFIO_PCI_ERR_IRQ_INDEX,
>>   	VFIO_PCI_REQ_IRQ_INDEX,
>> +	VFIO_PCI_RESUME_IRQ_INDEX,
>>   	VFIO_PCI_NUM_IRQS
>>   };
>>   
>
>
> .
>