From: Farhan Ali <alifm@linux.ibm.com>
To: "Cédric Le Goater" <clg@redhat.com>,
"Markus Armbruster" <armbru@redhat.com>
Cc: qemu-devel@nongnu.org, qemu-s390x@nongnu.org,
mjrosato@linux.ibm.com, thuth@redhat.com,
alex.williamson@redhat.com
Subject: Re: [PATCH v3 2/5] vfio/pci: Add an error handler callback
Date: Mon, 29 Sep 2025 10:20:38 -0700 [thread overview]
Message-ID: <0c221734-5faf-4829-bc17-21ec96a91fa5@linux.ibm.com> (raw)
In-Reply-To: <4207529b-a0a5-4360-8449-f4c20661e9e8@redhat.com>
On 9/27/2025 12:05 AM, Cédric Le Goater wrote:
> On 9/27/25 07:59, Markus Armbruster wrote:
>> Farhan Ali <alifm@linux.ibm.com> writes:
>>
>>> On 9/25/2025 9:57 PM, Markus Armbruster wrote:
>>>> Farhan Ali <alifm@linux.ibm.com> writes:
>>>>
>>>>> Provide a vfio error handling callback, that can be used by
>>>>> devices to
>>>>> handle PCI errors for passthrough devices.
>>>>>
>>>>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
>>>>> ---
>>>>> hw/vfio/pci.c | 8 ++++++++
>>>>> hw/vfio/pci.h | 1 +
>>>>> 2 files changed, 9 insertions(+)
>>>>>
>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>> index bc0b4c4d56..b02a974954 100644
>>>>> --- a/hw/vfio/pci.c
>>>>> +++ b/hw/vfio/pci.c
>>>>> @@ -3063,11 +3063,19 @@ void vfio_pci_put_device(VFIOPCIDevice *vdev)
>>>>> static void vfio_err_notifier_handler(void *opaque)
>>>>> {
>>>>> VFIOPCIDevice *vdev = opaque;
>>>>> + Error *err = NULL;
>>>>>
>>>>> if (!event_notifier_test_and_clear(&vdev->err_notifier)) {
>>>>> return;
>>>>> }
>>>>>
>>>>> + if (vdev->err_handler) {
>>>>> + if (vdev->err_handler(vdev, &err)) {
>>>>> + return;
>>>>> + }
>>>>> + error_report_err(err);
>>>>> + }
>>>>
>>>> This is unusual.
>>>>
>>>> Functions taking an Error ** argument usually do so to report errors.
>>>> The rules spelled out in qapi/error.h apply. In particular:
>>>>
>>>> * - On success, the function should not touch *errp. On
>>>> failure, it
>>>> * should set a new error, e.g. with error_setg(errp, ...), or
>>>> * propagate an existing one, e.g. with error_propagate(errp,
>>>> ...).
>>>> *
>>>> * - Whenever practical, also return a value that indicates
>>>> success /
>>>> * failure. This can make the error checking more concise, and
>>>> can
>>>> * avoid useless error object creation and destruction. Note that
>>>> * we still have many functions returning void. We recommend
>>>> * • bool-valued functions return true on success / false on
>>>> failure,
>>>>
>>>> If ->err_handler() behaved that way, it @err would be null after it
>>>> returns false. We'd call error_report_err(NULL), and crash.
>>>>
>>>> Functions with unusual behavior need a contract: a comment spelling
>>>> out
>>>> their behavior.
>>>>
>>>> What is the intended behavior of the err_handler() callback?
>>>
>>> Hi Markus,
>>>
>>> Thanks for reviewing! The intended behavior for err_handler() is to
>>> set errp and report the error on false/failure. With the above code,
>>> I also intended fall through to vm_stop() when err_handler() fails.
>>>
>>> I think I misunderstood the errp error handling, it seems like the
>>> correct way to do what I intended would be
>>>
>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>> index b02a974954..630de46c90 100644
>>> --- a/hw/vfio/pci.c
>>> +++ b/hw/vfio/pci.c
>>> @@ -3070,10 +3070,11 @@ static void vfio_err_notifier_handler(void
>>> *opaque)
>>> }
>>>
>>> if (vdev->err_handler) {
>>> - if (vdev->err_handler(vdev, &err)) {
>>> + if (!vdev->err_handler(vdev, &err)) {
>>> + error_report_err(err);
>>> + } else {
>>> return;
>>> }
>>> - error_report_err(err);
>>> }
>>>
>>> Please correct me if I missed anything.
>>
>> Resulting function:
>>
>> static void vfio_err_notifier_handler(void *opaque)
>> {
>> VFIOPCIDevice *vdev = opaque;
>> Error *err = NULL;
>>
>> if (!event_notifier_test_and_clear(&vdev->err_notifier)) {
>> return;
>> }
>>
>> if (vdev->err_handler) {
>> if (!vdev->err_handler(vdev, &err)) {
>> error_report_err(err);
>> } else {
>> return;
>> }
>> }
>>
>> /*
>> * TBD. Retrieve the error details and decide what action
>> * needs to be taken. One of the actions could be to pass
>> * the error to the guest and have the guest driver recover
>> * from the error. This requires that PCIe capabilities be
>> * exposed to the guest. For now, we just terminate the
>> * guest to contain the error.
>> */
>>
>> error_report("%s(%s) Unrecoverable error detected. Please
>> collect any data possible and then kill the guest", __func__,
>> vdev->vbasedev.name);
>>
>> vm_stop(RUN_STATE_INTERNAL_ERROR);
>> }
>>
>> Slighly rearranged for clearer control flow:
>>
>> static void vfio_err_notifier_handler(void *opaque)
>> {
>> VFIOPCIDevice *vdev = opaque;
>> Error *err = NULL;
>>
>> if (!event_notifier_test_and_clear(&vdev->err_notifier)) {
>> return;
>> }
>>
>> if (vdev->err_handler) {
>> if (vdev->err_handler(vdev, &err)) {
>> /* Error successfully handled */
>> return;
>> }
>> error_report_err(err);
>> }
Yes, this is what i intended to do with my patch and provide a clearer
flow. Though the compiler error reported by Cedric, is a little
confusing, need to understand why that happens.
>>
>> /*
>> * TBD. Retrieve the error details and decide what action
>> * needs to be taken. One of the actions could be to pass
>> * the error to the guest and have the guest driver recover
>> * from the error. This requires that PCIe capabilities be
>> * exposed to the guest. For now, we just terminate the
>> * guest to contain the error.
>> */
>>
>> error_report("%s(%s) Unrecoverable error detected. Please
>> collect any data possible and then kill the guest", __func__,
>> vdev->vbasedev.name);
>>
>> vm_stop(RUN_STATE_INTERNAL_ERROR);
>> }
>>
>> Questions / issues:
>>
>> * Is the comment still accurate?
This comment would still apply for vfio-pci devices on other
architectures except for s390x. We are trying to change this behavior
for s390x.
>>
>> * When ->err_handler() fails, we report the error twice. Would it make
>> sense to combine the two error reports into one?
>
> Yes. It was my request too.
>
> Thanks,
>
> C.
I was a little hesitant about changing the existing error message as its
been there for almost 12 years (since commit 7b4b0e9eda ("vfio:
QEMU-AER: Qemu changes to support AER for VFIO-PCI devices")). Nothing
should ever dependent on specific error messages, but still.. .If the
preference is to combine/change the message I can do that.
>
>
>
>> * Preexisting: the second error message is ugly.
>>
>> Error messages should be short and to the point: single phrase, with
>> no newline or trailing punctuation. The "please collect ..." part
>> does not belong to the error message proper, it's advice on what to
>> do. Better: report the error, then print advice:
>>
>> error_report("%s(%s) Unrecoverable error detected",
>> __func__, vdev->vbasedev.name);
>> error_printf("Please collect any data possible and then kill
>> the guest.");
>>
>> Including __func__ in an error message is an anti-pattern. Look at
>>
>> vfio_err_notifier_handler(fred) Unrecoverable error detected
>>
>> with a user's eyes: "vfio_err_notifier_handler" is programmer
>> gobbledygook, the device name "fred" is useful once you realize what
>> it is, "Unrecoverable error detected" lacks detail.
>>
>> [...]
>>
>
How about "(device) Unrecoverable PCIe error detected for device"
Thanks
Farhan
next prev parent reply other threads:[~2025-09-29 17:22 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-25 17:48 [PATCH v3 0/5] Error recovery for zPCI passthrough devices Farhan Ali
2025-09-25 17:48 ` [PATCH v3 1/5] [NOTFORMERGE] linux-headers: Update for zpci vfio device Farhan Ali
2025-09-25 17:48 ` [PATCH v3 2/5] vfio/pci: Add an error handler callback Farhan Ali
2025-09-26 4:57 ` Markus Armbruster
2025-09-26 7:40 ` Cédric Le Goater
2025-09-26 18:44 ` Farhan Ali
2025-09-26 17:53 ` Farhan Ali
2025-09-27 5:59 ` Markus Armbruster
2025-09-27 7:05 ` Cédric Le Goater
2025-09-29 17:20 ` Farhan Ali [this message]
2025-09-30 9:20 ` Markus Armbruster
2025-09-30 17:15 ` Farhan Ali
2025-10-01 4:52 ` Markus Armbruster
2025-10-01 18:21 ` Farhan Ali
2025-10-06 6:06 ` Markus Armbruster
2025-09-25 17:48 ` [PATCH v3 3/5] vfio: Introduce vfio_device_feature helper function Farhan Ali
2025-09-25 17:48 ` [PATCH v3 4/5] s390x/pci: Add PCI error handling for vfio pci devices Farhan Ali
2025-09-25 17:48 ` [PATCH v3 5/5] s390x/pci: Reset a device in error state Farhan Ali
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0c221734-5faf-4829-bc17-21ec96a91fa5@linux.ibm.com \
--to=alifm@linux.ibm.com \
--cc=alex.williamson@redhat.com \
--cc=armbru@redhat.com \
--cc=clg@redhat.com \
--cc=mjrosato@linux.ibm.com \
--cc=qemu-devel@nongnu.org \
--cc=qemu-s390x@nongnu.org \
--cc=thuth@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).