qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Farhan Ali <alifm@linux.ibm.com>
To: "Cédric Le Goater" <clg@redhat.com>,
	"Markus Armbruster" <armbru@redhat.com>
Cc: qemu-devel@nongnu.org, qemu-s390x@nongnu.org,
	mjrosato@linux.ibm.com, thuth@redhat.com,
	alex.williamson@redhat.com
Subject: Re: [PATCH v3 2/5] vfio/pci: Add an error handler callback
Date: Mon, 29 Sep 2025 10:20:38 -0700	[thread overview]
Message-ID: <0c221734-5faf-4829-bc17-21ec96a91fa5@linux.ibm.com> (raw)
In-Reply-To: <4207529b-a0a5-4360-8449-f4c20661e9e8@redhat.com>


On 9/27/2025 12:05 AM, Cédric Le Goater wrote:
> On 9/27/25 07:59, Markus Armbruster wrote:
>> Farhan Ali <alifm@linux.ibm.com> writes:
>>
>>> On 9/25/2025 9:57 PM, Markus Armbruster wrote:
>>>> Farhan Ali <alifm@linux.ibm.com> writes:
>>>>
>>>>> Provide a vfio error handling callback, that can be used by 
>>>>> devices to
>>>>> handle PCI errors for passthrough devices.
>>>>>
>>>>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
>>>>> ---
>>>>>    hw/vfio/pci.c | 8 ++++++++
>>>>>    hw/vfio/pci.h | 1 +
>>>>>    2 files changed, 9 insertions(+)
>>>>>
>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>> index bc0b4c4d56..b02a974954 100644
>>>>> --- a/hw/vfio/pci.c
>>>>> +++ b/hw/vfio/pci.c
>>>>> @@ -3063,11 +3063,19 @@ void vfio_pci_put_device(VFIOPCIDevice *vdev)
>>>>>   static void vfio_err_notifier_handler(void *opaque)
>>>>>   {
>>>>>       VFIOPCIDevice *vdev = opaque;
>>>>> +    Error *err = NULL;
>>>>>
>>>>>       if (!event_notifier_test_and_clear(&vdev->err_notifier)) {
>>>>>           return;
>>>>>       }
>>>>>
>>>>> +    if (vdev->err_handler) {
>>>>> +        if (vdev->err_handler(vdev, &err)) {
>>>>> +            return;
>>>>> +        }
>>>>> +        error_report_err(err);
>>>>> +    }
>>>>
>>>> This is unusual.
>>>>
>>>> Functions taking an Error ** argument usually do so to report errors.
>>>> The rules spelled out in qapi/error.h apply.  In particular:
>>>>
>>>>    * - On success, the function should not touch *errp.  On 
>>>> failure, it
>>>>    *   should set a new error, e.g. with error_setg(errp, ...), or
>>>>    *   propagate an existing one, e.g. with error_propagate(errp, 
>>>> ...).
>>>>    *
>>>>    * - Whenever practical, also return a value that indicates 
>>>> success /
>>>>    *   failure.  This can make the error checking more concise, and 
>>>> can
>>>>    *   avoid useless error object creation and destruction. Note that
>>>>    *   we still have many functions returning void.  We recommend
>>>>    *   • bool-valued functions return true on success / false on 
>>>> failure,
>>>>
>>>> If ->err_handler() behaved that way, it @err would be null after it
>>>> returns false.  We'd call error_report_err(NULL), and crash.
>>>>
>>>> Functions with unusual behavior need a contract: a comment spelling 
>>>> out
>>>> their behavior.
>>>>
>>>> What is the intended behavior of the err_handler() callback?
>>>
>>> Hi Markus,
>>>
>>> Thanks for reviewing! The intended behavior for err_handler() is to 
>>> set errp and report the error on false/failure. With the above code, 
>>> I also intended fall through to vm_stop() when err_handler() fails.
>>>
>>> I think I misunderstood the errp error handling, it seems like the 
>>> correct way to do what I intended would be
>>>
>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>> index b02a974954..630de46c90 100644
>>> --- a/hw/vfio/pci.c
>>> +++ b/hw/vfio/pci.c
>>> @@ -3070,10 +3070,11 @@ static void vfio_err_notifier_handler(void 
>>> *opaque)
>>>       }
>>>
>>>       if (vdev->err_handler) {
>>> -        if (vdev->err_handler(vdev, &err)) {
>>> +        if (!vdev->err_handler(vdev, &err)) {
>>> +            error_report_err(err);
>>> +        } else {
>>>               return;
>>>           }
>>> -        error_report_err(err);
>>>       }
>>>
>>> Please correct me if I missed anything.
>>
>> Resulting function:
>>
>>     static void vfio_err_notifier_handler(void *opaque)
>>     {
>>         VFIOPCIDevice *vdev = opaque;
>>         Error *err = NULL;
>>
>>         if (!event_notifier_test_and_clear(&vdev->err_notifier)) {
>>             return;
>>         }
>>
>>         if (vdev->err_handler) {
>>             if (!vdev->err_handler(vdev, &err)) {
>>                 error_report_err(err);
>>             } else {
>>                 return;
>>             }
>>         }
>>
>>         /*
>>          * TBD. Retrieve the error details and decide what action
>>          * needs to be taken. One of the actions could be to pass
>>          * the error to the guest and have the guest driver recover
>>          * from the error. This requires that PCIe capabilities be
>>          * exposed to the guest. For now, we just terminate the
>>          * guest to contain the error.
>>          */
>>
>>         error_report("%s(%s) Unrecoverable error detected. Please 
>> collect any data possible and then kill the guest", __func__, 
>> vdev->vbasedev.name);
>>
>>         vm_stop(RUN_STATE_INTERNAL_ERROR);
>>     }
>>
>> Slighly rearranged for clearer control flow:
>>
>>     static void vfio_err_notifier_handler(void *opaque)
>>     {
>>         VFIOPCIDevice *vdev = opaque;
>>         Error *err = NULL;
>>
>>         if (!event_notifier_test_and_clear(&vdev->err_notifier)) {
>>             return;
>>         }
>>
>>         if (vdev->err_handler) {
>>             if (vdev->err_handler(vdev, &err)) {
>>                 /* Error successfully handled */
>>                 return;
>>             }
>>             error_report_err(err);
>>         }

Yes, this is what i intended to do with my patch and provide a clearer 
flow. Though the compiler error reported by Cedric, is a little 
confusing, need to understand why that happens.


>>
>>         /*
>>          * TBD. Retrieve the error details and decide what action
>>          * needs to be taken. One of the actions could be to pass
>>          * the error to the guest and have the guest driver recover
>>          * from the error. This requires that PCIe capabilities be
>>          * exposed to the guest. For now, we just terminate the
>>          * guest to contain the error.
>>          */
>>
>>         error_report("%s(%s) Unrecoverable error detected. Please 
>> collect any data possible and then kill the guest", __func__, 
>> vdev->vbasedev.name);
>>
>>         vm_stop(RUN_STATE_INTERNAL_ERROR);
>>     }
>>
>> Questions / issues:
>>
>> * Is the comment still accurate?

This comment would still apply for vfio-pci devices on other 
architectures except for s390x. We are trying to change this behavior 
for s390x.

>>
>> * When ->err_handler() fails, we report the error twice. Would it make
>>    sense to combine the two error reports into one?
>
> Yes. It was my request too.
>
> Thanks,
>
> C.

I was a little hesitant about changing the existing error message as its 
been there for almost 12 years (since commit 7b4b0e9eda ("vfio: 
QEMU-AER: Qemu changes to support AER for VFIO-PCI devices")). Nothing 
should ever dependent on specific error messages, but still.. .If the 
preference is to combine/change the message I can do that.


>
>
>
>> * Preexisting: the second error message is ugly.
>>
>>    Error messages should be short and to the point: single phrase, with
>>    no newline or trailing punctuation.  The "please collect ..." part
>>    does not belong to the error message proper, it's advice on what to
>>    do.  Better: report the error, then print advice:
>>
>>         error_report("%s(%s) Unrecoverable error detected",
>>                      __func__, vdev->vbasedev.name);
>>         error_printf("Please collect any data possible and then kill 
>> the guest.");
>>
>>    Including __func__ in an error message is an anti-pattern. Look at
>>
>>      vfio_err_notifier_handler(fred) Unrecoverable error detected
>>
>>    with a user's eyes: "vfio_err_notifier_handler" is programmer
>>    gobbledygook, the device name "fred" is useful once you realize what
>>    it is, "Unrecoverable error detected" lacks detail.
>>
>> [...]
>>
>
How about "(device) Unrecoverable PCIe error detected for device"

Thanks
Farhan




  reply	other threads:[~2025-09-29 17:22 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-25 17:48 [PATCH v3 0/5] Error recovery for zPCI passthrough devices Farhan Ali
2025-09-25 17:48 ` [PATCH v3 1/5] [NOTFORMERGE] linux-headers: Update for zpci vfio device Farhan Ali
2025-09-25 17:48 ` [PATCH v3 2/5] vfio/pci: Add an error handler callback Farhan Ali
2025-09-26  4:57   ` Markus Armbruster
2025-09-26  7:40     ` Cédric Le Goater
2025-09-26 18:44       ` Farhan Ali
2025-09-26 17:53     ` Farhan Ali
2025-09-27  5:59       ` Markus Armbruster
2025-09-27  7:05         ` Cédric Le Goater
2025-09-29 17:20           ` Farhan Ali [this message]
2025-09-30  9:20             ` Markus Armbruster
2025-09-30 17:15               ` Farhan Ali
2025-10-01  4:52                 ` Markus Armbruster
2025-10-01 18:21                   ` Farhan Ali
2025-10-06  6:06                     ` Markus Armbruster
2025-09-25 17:48 ` [PATCH v3 3/5] vfio: Introduce vfio_device_feature helper function Farhan Ali
2025-09-25 17:48 ` [PATCH v3 4/5] s390x/pci: Add PCI error handling for vfio pci devices Farhan Ali
2025-09-25 17:48 ` [PATCH v3 5/5] s390x/pci: Reset a device in error state Farhan Ali

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0c221734-5faf-4829-bc17-21ec96a91fa5@linux.ibm.com \
    --to=alifm@linux.ibm.com \
    --cc=alex.williamson@redhat.com \
    --cc=armbru@redhat.com \
    --cc=clg@redhat.com \
    --cc=mjrosato@linux.ibm.com \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-s390x@nongnu.org \
    --cc=thuth@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).