qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Farhan Ali <alifm@linux.ibm.com>
To: Markus Armbruster <armbru@redhat.com>
Cc: "Cédric Le Goater" <clg@redhat.com>,
	qemu-devel@nongnu.org, qemu-s390x@nongnu.org,
	mjrosato@linux.ibm.com, thuth@redhat.com,
	alex.williamson@redhat.com
Subject: Re: [PATCH v3 2/5] vfio/pci: Add an error handler callback
Date: Tue, 30 Sep 2025 10:15:28 -0700	[thread overview]
Message-ID: <5ebeb8ec-395b-46ca-b6d4-b2c78ae0465f@linux.ibm.com> (raw)
In-Reply-To: <87qzvo9tzk.fsf@pond.sub.org>


On 9/30/2025 2:20 AM, Markus Armbruster wrote:
> Farhan Ali <alifm@linux.ibm.com> writes:
>
>> On 9/27/2025 12:05 AM, Cédric Le Goater wrote:
>>> On 9/27/25 07:59, Markus Armbruster wrote:
>>>> Farhan Ali <alifm@linux.ibm.com> writes:
>>>>
>>>>> On 9/25/2025 9:57 PM, Markus Armbruster wrote:
>>>>>> Farhan Ali <alifm@linux.ibm.com> writes:
>>>>>>
>>>>>>> Provide a vfio error handling callback, that can be used by devices to
>>>>>>> handle PCI errors for passthrough devices.
>>>>>>>
>>>>>>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
>>>>>>> ---
>>>>>>>     hw/vfio/pci.c | 8 ++++++++
>>>>>>>     hw/vfio/pci.h | 1 +
>>>>>>>     2 files changed, 9 insertions(+)
>>>>>>>
>>>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>>>> index bc0b4c4d56..b02a974954 100644
>>>>>>> --- a/hw/vfio/pci.c
>>>>>>> +++ b/hw/vfio/pci.c
>>>>>>> @@ -3063,11 +3063,19 @@ void vfio_pci_put_device(VFIOPCIDevice *vdev)
>>>>>>>    static void vfio_err_notifier_handler(void *opaque)
>>>>>>>    {
>>>>>>>        VFIOPCIDevice *vdev = opaque;
>>>>>>> +    Error *err = NULL;
>>>>>>>
>>>>>>>        if (!event_notifier_test_and_clear(&vdev->err_notifier)) {
>>>>>>>            return;
>>>>>>>        }
>>>>>>>
>>>>>>> +    if (vdev->err_handler) {
>>>>>>> +        if (vdev->err_handler(vdev, &err)) {
>>>>>>> +            return;
>>>>>>> +        }
>>>>>>> +        error_report_err(err);
>>>>>>> +    }
>>>>>> This is unusual.
>>>>>>
>>>>>> Functions taking an Error ** argument usually do so to report errors.
>>>>>> The rules spelled out in qapi/error.h apply.  In particular:
>>>>>>
>>>>>>     * - On success, the function should not touch *errp.  On failure, it
>>>>>>     *   should set a new error, e.g. with error_setg(errp, ...), or
>>>>>>     *   propagate an existing one, e.g. with error_propagate(errp, ...).
>>>>>>     *
>>>>>>     * - Whenever practical, also return a value that indicates success /
>>>>>>     *   failure.  This can make the error checking more concise, and can
>>>>>>     *   avoid useless error object creation and destruction. Note that
>>>>>>     *   we still have many functions returning void.  We recommend
>>>>>>     *   • bool-valued functions return true on success / false on failure,
>>>>>>
>>>>>> If ->err_handler() behaved that way, it @err would be null after it
>>>>>> returns false.  We'd call error_report_err(NULL), and crash.
>>>>>>
>>>>>> Functions with unusual behavior need a contract: a comment spelling out
>>>>>> their behavior.
>>>>>>
>>>>>> What is the intended behavior of the err_handler() callback?
>>>>> Hi Markus,
>>>>>
>>>>> Thanks for reviewing! The intended behavior for err_handler() is to set errp and report the error on false/failure. With the above code, I also intended fall through to vm_stop() when err_handler() fails.
>>>>>
>>>>> I think I misunderstood the errp error handling, it seems like the correct way to do what I intended would be
>>>>>
>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>> index b02a974954..630de46c90 100644
>>>>> --- a/hw/vfio/pci.c
>>>>> +++ b/hw/vfio/pci.c
>>>>> @@ -3070,10 +3070,11 @@ static void vfio_err_notifier_handler(void *opaque)
>>>>>        }
>>>>>
>>>>>        if (vdev->err_handler) {
>>>>> -        if (vdev->err_handler(vdev, &err)) {
>>>>> +        if (!vdev->err_handler(vdev, &err)) {
>>>>> +            error_report_err(err);
>>>>> +        } else {
>>>>>                return;
>>>>>            }
>>>>> -        error_report_err(err);
>>>>>        }
>>>>>
>>>>> Please correct me if I missed anything.
>>>> Resulting function:
>>>>
>>>>      static void vfio_err_notifier_handler(void *opaque)
>>>>      {
>>>>          VFIOPCIDevice *vdev = opaque;
>>>>          Error *err = NULL;
>>>>
>>>>          if (!event_notifier_test_and_clear(&vdev->err_notifier)) {
>>>>              return;
>>>>          }
>>>>
>>>>          if (vdev->err_handler) {
>>>>              if (!vdev->err_handler(vdev, &err)) {
>>>>                  error_report_err(err);
>>>>              } else {
>>>>                  return;
>>>>              }
>>>>          }
>>>>
>>>>          /*
>>>>           * TBD. Retrieve the error details and decide what action
>>>>           * needs to be taken. One of the actions could be to pass
>>>>           * the error to the guest and have the guest driver recover
>>>>           * from the error. This requires that PCIe capabilities be
>>>>           * exposed to the guest. For now, we just terminate the
>>>>           * guest to contain the error.
>>>>           */
>>>>
>>>>          error_report("%s(%s) Unrecoverable error detected. Please collect any data possible and then kill the guest", __func__, vdev->vbasedev.name);
>>>>
>>>>          vm_stop(RUN_STATE_INTERNAL_ERROR);
>>>>      }
>>>>
>>>> Slighly rearranged for clearer control flow:
>>>>
>>>>      static void vfio_err_notifier_handler(void *opaque)
>>>>      {
>>>>          VFIOPCIDevice *vdev = opaque;
>>>>          Error *err = NULL;
>>>>
>>>>          if (!event_notifier_test_and_clear(&vdev->err_notifier)) {
>>>>              return;
>>>>          }
>>>>
>>>>          if (vdev->err_handler) {
>>>>              if (vdev->err_handler(vdev, &err)) {
>>>>                  /* Error successfully handled */
>>>>                  return;
>>>>              }
>>>>              error_report_err(err);
>>>>          }
>> Yes, this is what i intended to do with my patch and provide a clearer flow. Though the compiler error reported by Cedric, is a little confusing, need to understand why that happens.
>>
>>
>>>>          /*
>>>>           * TBD. Retrieve the error details and decide what action
>>>>           * needs to be taken. One of the actions could be to pass
>>>>           * the error to the guest and have the guest driver recover
>>>>           * from the error. This requires that PCIe capabilities be
>>>>           * exposed to the guest. For now, we just terminate the
>>>>           * guest to contain the error.
>>>>           */
>>>>
>>>>          error_report("%s(%s) Unrecoverable error detected. Please collect any data possible and then kill the guest", __func__, vdev->vbasedev.name);
>>>>
>>>>          vm_stop(RUN_STATE_INTERNAL_ERROR);
>>>>      }
>>>>
>>>> Questions / issues:
>>>>
>>>> * Is the comment still accurate?
>> This comment would still apply for vfio-pci devices on other architectures except for s390x. We are trying to change this behavior for s390x.
> The comment is about things that should be done to handle the error.
> Would these things be done here, or in a suitable ->err_handler()?

Ideally in the err_handler(). And for s390x we try do what the comment 
mentions, which is inject the error into the guest through s390x 
architecture specific mechanism. I can remove the comment block.


>
>>>> * When ->err_handler() fails, we report the error twice. Would it make
>>>>     sense to combine the two error reports into one?
>>> Yes. It was my request too.
>>>
>>> Thanks,
>>>
>>> C.
>> I was a little hesitant about changing the existing error message as its been there for almost 12 years (since commit 7b4b0e9eda ("vfio: QEMU-AER: Qemu changes to support AER for VFIO-PCI devices")). Nothing should ever dependent on specific error messages, but still.. .If the preference is to combine/change the message I can do that.
> Don't hesitate to improve error messages.
>
>>>> * Preexisting: the second error message is ugly.
>>>>
>>>>     Error messages should be short and to the point: single phrase, with
>>>>     no newline or trailing punctuation.  The "please collect ..." part
>>>>     does not belong to the error message proper, it's advice on what to
>>>>     do.  Better: report the error, then print advice:
>>>>
>>>>          error_report("%s(%s) Unrecoverable error detected",
>>>>                       __func__, vdev->vbasedev.name);
>>>>          error_printf("Please collect any data possible and then kill the guest.");
>>>>
>>>>     Including __func__ in an error message is an anti-pattern. Look at
>>>>
>>>>       vfio_err_notifier_handler(fred) Unrecoverable error detected
>>>>
>>>>     with a user's eyes: "vfio_err_notifier_handler" is programmer
>>>>     gobbledygook, the device name "fred" is useful once you realize what
>>>>     it is, "Unrecoverable error detected" lacks detail.
>>>>
>>>> [...]
>>>>
>> How about "(device) Unrecoverable PCIe error detected for device"
> Suggest "for device %s", where %s identifies the device to the user.

Sure, I can make the change.

Thanks

Farhan

>


  reply	other threads:[~2025-09-30 17:17 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-25 17:48 [PATCH v3 0/5] Error recovery for zPCI passthrough devices Farhan Ali
2025-09-25 17:48 ` [PATCH v3 1/5] [NOTFORMERGE] linux-headers: Update for zpci vfio device Farhan Ali
2025-09-25 17:48 ` [PATCH v3 2/5] vfio/pci: Add an error handler callback Farhan Ali
2025-09-26  4:57   ` Markus Armbruster
2025-09-26  7:40     ` Cédric Le Goater
2025-09-26 18:44       ` Farhan Ali
2025-09-26 17:53     ` Farhan Ali
2025-09-27  5:59       ` Markus Armbruster
2025-09-27  7:05         ` Cédric Le Goater
2025-09-29 17:20           ` Farhan Ali
2025-09-30  9:20             ` Markus Armbruster
2025-09-30 17:15               ` Farhan Ali [this message]
2025-10-01  4:52                 ` Markus Armbruster
2025-10-01 18:21                   ` Farhan Ali
2025-10-06  6:06                     ` Markus Armbruster
2025-09-25 17:48 ` [PATCH v3 3/5] vfio: Introduce vfio_device_feature helper function Farhan Ali
2025-09-25 17:48 ` [PATCH v3 4/5] s390x/pci: Add PCI error handling for vfio pci devices Farhan Ali
2025-09-25 17:48 ` [PATCH v3 5/5] s390x/pci: Reset a device in error state Farhan Ali

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5ebeb8ec-395b-46ca-b6d4-b2c78ae0465f@linux.ibm.com \
    --to=alifm@linux.ibm.com \
    --cc=alex.williamson@redhat.com \
    --cc=armbru@redhat.com \
    --cc=clg@redhat.com \
    --cc=mjrosato@linux.ibm.com \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-s390x@nongnu.org \
    --cc=thuth@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).