From: Markus Armbruster <armbru@redhat.com>
To: Farhan Ali <alifm@linux.ibm.com>
Cc: "Cédric Le Goater" <clg@redhat.com>,
qemu-devel@nongnu.org, qemu-s390x@nongnu.org,
mjrosato@linux.ibm.com, thuth@redhat.com,
alex.williamson@redhat.com
Subject: Re: [PATCH v3 2/5] vfio/pci: Add an error handler callback
Date: Tue, 30 Sep 2025 11:20:47 +0200 [thread overview]
Message-ID: <87qzvo9tzk.fsf@pond.sub.org> (raw)
In-Reply-To: <0c221734-5faf-4829-bc17-21ec96a91fa5@linux.ibm.com> (Farhan Ali's message of "Mon, 29 Sep 2025 10:20:38 -0700")
Farhan Ali <alifm@linux.ibm.com> writes:
> On 9/27/2025 12:05 AM, Cédric Le Goater wrote:
>> On 9/27/25 07:59, Markus Armbruster wrote:
>>> Farhan Ali <alifm@linux.ibm.com> writes:
>>>
>>>> On 9/25/2025 9:57 PM, Markus Armbruster wrote:
>>>>> Farhan Ali <alifm@linux.ibm.com> writes:
>>>>>
>>>>>> Provide a vfio error handling callback, that can be used by devices to
>>>>>> handle PCI errors for passthrough devices.
>>>>>>
>>>>>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
>>>>>> ---
>>>>>> hw/vfio/pci.c | 8 ++++++++
>>>>>> hw/vfio/pci.h | 1 +
>>>>>> 2 files changed, 9 insertions(+)
>>>>>>
>>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>>> index bc0b4c4d56..b02a974954 100644
>>>>>> --- a/hw/vfio/pci.c
>>>>>> +++ b/hw/vfio/pci.c
>>>>>> @@ -3063,11 +3063,19 @@ void vfio_pci_put_device(VFIOPCIDevice *vdev)
>>>>>> static void vfio_err_notifier_handler(void *opaque)
>>>>>> {
>>>>>> VFIOPCIDevice *vdev = opaque;
>>>>>> + Error *err = NULL;
>>>>>>
>>>>>> if (!event_notifier_test_and_clear(&vdev->err_notifier)) {
>>>>>> return;
>>>>>> }
>>>>>>
>>>>>> + if (vdev->err_handler) {
>>>>>> + if (vdev->err_handler(vdev, &err)) {
>>>>>> + return;
>>>>>> + }
>>>>>> + error_report_err(err);
>>>>>> + }
>>>>>
>>>>> This is unusual.
>>>>>
>>>>> Functions taking an Error ** argument usually do so to report errors.
>>>>> The rules spelled out in qapi/error.h apply. In particular:
>>>>>
>>>>> * - On success, the function should not touch *errp. On failure, it
>>>>> * should set a new error, e.g. with error_setg(errp, ...), or
>>>>> * propagate an existing one, e.g. with error_propagate(errp, ...).
>>>>> *
>>>>> * - Whenever practical, also return a value that indicates success /
>>>>> * failure. This can make the error checking more concise, and can
>>>>> * avoid useless error object creation and destruction. Note that
>>>>> * we still have many functions returning void. We recommend
>>>>> * • bool-valued functions return true on success / false on failure,
>>>>>
>>>>> If ->err_handler() behaved that way, it @err would be null after it
>>>>> returns false. We'd call error_report_err(NULL), and crash.
>>>>>
>>>>> Functions with unusual behavior need a contract: a comment spelling out
>>>>> their behavior.
>>>>>
>>>>> What is the intended behavior of the err_handler() callback?
>>>>
>>>> Hi Markus,
>>>>
>>>> Thanks for reviewing! The intended behavior for err_handler() is to set errp and report the error on false/failure. With the above code, I also intended fall through to vm_stop() when err_handler() fails.
>>>>
>>>> I think I misunderstood the errp error handling, it seems like the correct way to do what I intended would be
>>>>
>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>> index b02a974954..630de46c90 100644
>>>> --- a/hw/vfio/pci.c
>>>> +++ b/hw/vfio/pci.c
>>>> @@ -3070,10 +3070,11 @@ static void vfio_err_notifier_handler(void *opaque)
>>>> }
>>>>
>>>> if (vdev->err_handler) {
>>>> - if (vdev->err_handler(vdev, &err)) {
>>>> + if (!vdev->err_handler(vdev, &err)) {
>>>> + error_report_err(err);
>>>> + } else {
>>>> return;
>>>> }
>>>> - error_report_err(err);
>>>> }
>>>>
>>>> Please correct me if I missed anything.
>>>
>>> Resulting function:
>>>
>>> static void vfio_err_notifier_handler(void *opaque)
>>> {
>>> VFIOPCIDevice *vdev = opaque;
>>> Error *err = NULL;
>>>
>>> if (!event_notifier_test_and_clear(&vdev->err_notifier)) {
>>> return;
>>> }
>>>
>>> if (vdev->err_handler) {
>>> if (!vdev->err_handler(vdev, &err)) {
>>> error_report_err(err);
>>> } else {
>>> return;
>>> }
>>> }
>>>
>>> /*
>>> * TBD. Retrieve the error details and decide what action
>>> * needs to be taken. One of the actions could be to pass
>>> * the error to the guest and have the guest driver recover
>>> * from the error. This requires that PCIe capabilities be
>>> * exposed to the guest. For now, we just terminate the
>>> * guest to contain the error.
>>> */
>>>
>>> error_report("%s(%s) Unrecoverable error detected. Please collect any data possible and then kill the guest", __func__, vdev->vbasedev.name);
>>>
>>> vm_stop(RUN_STATE_INTERNAL_ERROR);
>>> }
>>>
>>> Slighly rearranged for clearer control flow:
>>>
>>> static void vfio_err_notifier_handler(void *opaque)
>>> {
>>> VFIOPCIDevice *vdev = opaque;
>>> Error *err = NULL;
>>>
>>> if (!event_notifier_test_and_clear(&vdev->err_notifier)) {
>>> return;
>>> }
>>>
>>> if (vdev->err_handler) {
>>> if (vdev->err_handler(vdev, &err)) {
>>> /* Error successfully handled */
>>> return;
>>> }
>>> error_report_err(err);
>>> }
>
> Yes, this is what i intended to do with my patch and provide a clearer flow. Though the compiler error reported by Cedric, is a little confusing, need to understand why that happens.
>
>
>>>
>>> /*
>>> * TBD. Retrieve the error details and decide what action
>>> * needs to be taken. One of the actions could be to pass
>>> * the error to the guest and have the guest driver recover
>>> * from the error. This requires that PCIe capabilities be
>>> * exposed to the guest. For now, we just terminate the
>>> * guest to contain the error.
>>> */
>>>
>>> error_report("%s(%s) Unrecoverable error detected. Please collect any data possible and then kill the guest", __func__, vdev->vbasedev.name);
>>>
>>> vm_stop(RUN_STATE_INTERNAL_ERROR);
>>> }
>>>
>>> Questions / issues:
>>>
>>> * Is the comment still accurate?
>
> This comment would still apply for vfio-pci devices on other architectures except for s390x. We are trying to change this behavior for s390x.
The comment is about things that should be done to handle the error.
Would these things be done here, or in a suitable ->err_handler()?
>>>
>>> * When ->err_handler() fails, we report the error twice. Would it make
>>> sense to combine the two error reports into one?
>>
>> Yes. It was my request too.
>>
>> Thanks,
>>
>> C.
>
> I was a little hesitant about changing the existing error message as its been there for almost 12 years (since commit 7b4b0e9eda ("vfio: QEMU-AER: Qemu changes to support AER for VFIO-PCI devices")). Nothing should ever dependent on specific error messages, but still.. .If the preference is to combine/change the message I can do that.
Don't hesitate to improve error messages.
>>> * Preexisting: the second error message is ugly.
>>>
>>> Error messages should be short and to the point: single phrase, with
>>> no newline or trailing punctuation. The "please collect ..." part
>>> does not belong to the error message proper, it's advice on what to
>>> do. Better: report the error, then print advice:
>>>
>>> error_report("%s(%s) Unrecoverable error detected",
>>> __func__, vdev->vbasedev.name);
>>> error_printf("Please collect any data possible and then kill the guest.");
>>>
>>> Including __func__ in an error message is an anti-pattern. Look at
>>>
>>> vfio_err_notifier_handler(fred) Unrecoverable error detected
>>>
>>> with a user's eyes: "vfio_err_notifier_handler" is programmer
>>> gobbledygook, the device name "fred" is useful once you realize what
>>> it is, "Unrecoverable error detected" lacks detail.
>>>
>>> [...]
>>>
>>
> How about "(device) Unrecoverable PCIe error detected for device"
Suggest "for device %s", where %s identifies the device to the user.
next prev parent reply other threads:[~2025-09-30 9:21 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-25 17:48 [PATCH v3 0/5] Error recovery for zPCI passthrough devices Farhan Ali
2025-09-25 17:48 ` [PATCH v3 1/5] [NOTFORMERGE] linux-headers: Update for zpci vfio device Farhan Ali
2025-09-25 17:48 ` [PATCH v3 2/5] vfio/pci: Add an error handler callback Farhan Ali
2025-09-26 4:57 ` Markus Armbruster
2025-09-26 7:40 ` Cédric Le Goater
2025-09-26 18:44 ` Farhan Ali
2025-09-26 17:53 ` Farhan Ali
2025-09-27 5:59 ` Markus Armbruster
2025-09-27 7:05 ` Cédric Le Goater
2025-09-29 17:20 ` Farhan Ali
2025-09-30 9:20 ` Markus Armbruster [this message]
2025-09-30 17:15 ` Farhan Ali
2025-10-01 4:52 ` Markus Armbruster
2025-10-01 18:21 ` Farhan Ali
2025-10-06 6:06 ` Markus Armbruster
2025-09-25 17:48 ` [PATCH v3 3/5] vfio: Introduce vfio_device_feature helper function Farhan Ali
2025-09-25 17:48 ` [PATCH v3 4/5] s390x/pci: Add PCI error handling for vfio pci devices Farhan Ali
2025-09-25 17:48 ` [PATCH v3 5/5] s390x/pci: Reset a device in error state Farhan Ali
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87qzvo9tzk.fsf@pond.sub.org \
--to=armbru@redhat.com \
--cc=alex.williamson@redhat.com \
--cc=alifm@linux.ibm.com \
--cc=clg@redhat.com \
--cc=mjrosato@linux.ibm.com \
--cc=qemu-devel@nongnu.org \
--cc=qemu-s390x@nongnu.org \
--cc=thuth@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.