From: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
To: "Ghimiray, Himal Prasad" <himal.prasad.ghimiray@intel.com>,
"Welty, Brian" <brian.welty@intel.com>,
"intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>
Subject: Re: [Intel-xe] [PATCH v2 05/10] drm/xe: Notify userspace about GSC HW errors.
Date: Thu, 19 Oct 2023 11:32:00 +0530 [thread overview]
Message-ID: <b5456a04-312e-5ea9-d0d4-fa628beeb557@linux.intel.com> (raw)
In-Reply-To: <MW4PR11MB705661DB0BF942C87D8CBA75B3D4A@MW4PR11MB7056.namprd11.prod.outlook.com>
On 19/10/23 11:06, Ghimiray, Himal Prasad wrote:
> Hi Brian,
>
>> -----Original Message-----
>> From: Welty, Brian <brian.welty@intel.com>
>> Sent: 19 October 2023 06:22
>> To: Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>; intel-
>> xe@lists.freedesktop.org
>> Subject: Re: [Intel-xe] [PATCH v2 05/10] drm/xe: Notify userspace about GSC
>> HW errors.
>>
>>
>>
>> On 10/17/2023 9:00 PM, Himal Prasad Ghimiray wrote:
>>> Send uevent incase of nonfatal errors reported by gsc.
>>>
>>> v2
>>> - No need to provide tile info in uevent because error is accepted
>>> only in tile0. (Aravind)
>> This is user API. Don't we need to demonstrate user-space consumer of this
>> before merge? Do we at least have IGT? Also see question below.
> https://one-api.gitlab-pages.devtools.intel.com/level_zero/sysman/api.html#_CPPv441ZES_EVENT_TYPE_FLAG_DEVICE_RESET_REQUIRED is the consumer for this uevent.
> The hardware error injection via IGT is not possible. It explicitly needs python SV and ITP.
>>> Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
>>> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>>> ---
>>> drivers/gpu/drm/xe/xe_device_types.h | 3 +++
>>> drivers/gpu/drm/xe/xe_hw_error.c | 16 ++++++++++++++++
>>> drivers/gpu/drm/xe/xe_hw_error.h | 3 ++-
>>> drivers/gpu/drm/xe/xe_irq.c | 4 ++++
>>> include/uapi/drm/xe_drm.h | 8 ++++++++
>>> 5 files changed, 33 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_device_types.h
>>> b/drivers/gpu/drm/xe/xe_device_types.h
>>> index 2998ee517f0d..d2ee5549d20c 100644
>>> --- a/drivers/gpu/drm/xe/xe_device_types.h
>>> +++ b/drivers/gpu/drm/xe/xe_device_types.h
>>> @@ -196,6 +196,9 @@ struct xe_tile {
>>> struct tile_hw_errors {
>>> struct xarray hw_error;
>>> } errors;
>>> +
>>> + /** @gsc_hw_err_work: worker for uevent to report GSC HW errors
>> */
>>> + struct work_struct gsc_hw_err_work;
>>> };
>>>
>>> /**
>>> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c
>>> b/drivers/gpu/drm/xe/xe_hw_error.c
>>> index 1e94ee72a34f..9ac817c1dd03 100644
>>> --- a/drivers/gpu/drm/xe/xe_hw_error.c
>>> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
>>> @@ -3,6 +3,8 @@
>>> * Copyright © 2023 Intel Corporation
>>> */
>>>
>>> +#include <drm/xe_drm.h>
>>> +
>>> #include "xe_hw_error.h"
>>>
>>> #include "regs/xe_regs.h"
>>> @@ -387,6 +389,19 @@ xe_gt_hw_error_handler(struct xe_gt *gt, const
>> enum hardware_error hw_err)
>>> xe_gt_hw_error_log_vector_reg(gt, hw_err);
>>> }
>>>
>>> +void xe_gsc_hw_error_work(struct work_struct *work) {
>>> + struct xe_tile *tile = container_of(work, typeof(*tile),
>> gsc_hw_err_work);
>>> + char *csc_hw_error_event[3];
>>> +
>>> + csc_hw_error_event[0] = XE_GSC_HW_HEALTH_UEVENT "=1";
>>> + csc_hw_error_event[1] = "RESET_REQUIRED=1";
>>> + csc_hw_error_event[2] = NULL;
>>> +
>>> + kobject_uevent_env(&tile->xe->drm.primary->kdev->kobj,
>> KOBJ_CHANGE,
>>> + csc_hw_error_event);
>>> +}
>>> +
>> ...
>>> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
>>> index d48d8e3c898c..06f6fce1531d 100644
>>> --- a/include/uapi/drm/xe_drm.h
>>> +++ b/include/uapi/drm/xe_drm.h
>>> @@ -16,6 +16,14 @@ extern "C" {
>>> * subject to backwards-compatibility constraints.
>>> */
>>>
>>> +/**
>>> + * DOC: uevent generated by xe on it's tile node.
>> What does 'tile node' mean? If I'm not mistaken, isn't
>> tile->xe->drm.primary->kdev->kobj just the device node?
> Miss from my end. It should be device node.
>>> + *
>>> + * XE_GSC_HW_HEALTH_UEVENT - Event is generated when GSC reports
>> HW
>>> + * errors. The value supplied with the event is always
>> "RESET_REQUIRED=1".
>>> + */
>>> +#define XE_GSC_HW_HEALTH_UEVENT "DEVICE_STATUS"
>> So both XE_RESET_FAILED_UEVENT and XE_GSC_HW_HEALTH_UEVENT are
>> defined as "DEVICE_STATUS". Should be unique or isn't this going to be
>> confusing to the user reading these events?
> The values provided are different and uevents are on different nodes.
> XE_RESET_FAILED_UEVENT is at pci node of xe device whereas XE_GSC_HW_HEALTH_UEVENT
> Is at device node.
the reset that you expect here to recover the device is a PCIe level reset same as in
XE_RESET_FAILED_UEVENTso why should this be with drm device node.
also the uevent doesn't seem to convey the cause.
with XE_GSC_HW_HEALTH_UEVENT userspace will see:
DEVICE_STATUS=1
RESET_REQUIRED=1
TILE_ID=0
while for XE_RESET_FAILED_UEVENT:
DEVICE_STATUS=NEEDS_RESET
RESET_REQUIRED=1
TILE_ID=0
GT_ID=0
I would rather say we define one UEVENT UAPI
XE_RESET_REQUIRED "RESET_REQUIRED"
add event specific details like:
CSC_HW_ERROR = 1
or
GT_RESET_FAILED = 1
or like RESET_REASON="CSC_HW_ERROR/GT_RESET_FAILED"
and add additional details like tile and GT.
Thanks,
Aravind.
>
> BR
> Himal
>> -Brian
>>
>>
>>> +
>>> /**
>>> * DOC: uevent generated by xe on it's pci node.
>>> *
next prev parent reply other threads:[~2023-10-19 5:59 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-18 4:00 [Intel-xe] [PATCH v9 00/10] Supporting RAS on XE Himal Prasad Ghimiray
2023-10-18 3:57 ` [Intel-xe] ✓ CI.Patch_applied: success for " Patchwork
2023-10-18 3:57 ` [Intel-xe] ✗ CI.checkpatch: warning " Patchwork
2023-10-18 3:59 ` [Intel-xe] ✓ CI.KUnit: success " Patchwork
2023-10-18 4:00 ` [Intel-xe] [PATCH v8 01/10] drm/xe: Handle errors from various components Himal Prasad Ghimiray
2023-10-19 8:23 ` Aravind Iddamsetty
2023-10-19 13:23 ` Upadhyay, Tejas
2023-10-18 4:00 ` [Intel-xe] [PATCH v7 02/10] drm/xe: Log and count the GT hardware errors Himal Prasad Ghimiray
2023-10-19 8:24 ` Aravind Iddamsetty
2023-10-18 4:00 ` [Intel-xe] [PATCH v6 03/10] drm/xe: Support GT hardware error reporting for PVC Himal Prasad Ghimiray
2023-10-19 8:25 ` Aravind Iddamsetty
2023-10-18 4:00 ` [Intel-xe] [PATCH v2 04/10] drm/xe: Support GSC " Himal Prasad Ghimiray
2023-10-19 8:25 ` Aravind Iddamsetty
2023-10-18 4:00 ` [Intel-xe] [PATCH v2 05/10] drm/xe: Notify userspace about GSC HW errors Himal Prasad Ghimiray
2023-10-19 0:52 ` Welty, Brian
2023-10-19 5:36 ` Ghimiray, Himal Prasad
2023-10-19 6:02 ` Aravind Iddamsetty [this message]
2023-10-19 6:36 ` Ghimiray, Himal Prasad
2023-10-18 4:00 ` [Intel-xe] [PATCH v3 06/10] drm/xe: Support SOC FATAL error handling for PVC Himal Prasad Ghimiray
2023-10-19 8:25 ` Aravind Iddamsetty
2023-10-18 4:00 ` [Intel-xe] [PATCH v2 07/10] drm/xe: Support SOC NONFATAL " Himal Prasad Ghimiray
2023-10-19 8:26 ` Aravind Iddamsetty
2023-10-18 4:00 ` [Intel-xe] [PATCH v2 08/10] drm/xe: Handle MDFI error severity Himal Prasad Ghimiray
2023-10-19 8:26 ` Aravind Iddamsetty
2023-10-18 4:00 ` [Intel-xe] [PATCH v2 09/10] drm/xe: Clear SOC CORRECTABLE error registers Himal Prasad Ghimiray
2023-10-19 8:26 ` Aravind Iddamsetty
2023-10-18 4:00 ` [Intel-xe] [PATCH v4 10/10] drm/xe: Clear all SoC errors post warm reset Himal Prasad Ghimiray
2023-10-19 8:26 ` Aravind Iddamsetty
2023-10-18 4:07 ` [Intel-xe] ✓ CI.Build: success for Supporting RAS on XE Patchwork
2023-10-18 4:08 ` [Intel-xe] ✓ CI.Hooks: " Patchwork
2023-10-18 4:09 ` [Intel-xe] ✓ CI.checksparse: " Patchwork
2023-10-18 4:45 ` [Intel-xe] ✓ CI.BAT: " Patchwork
-- strict thread matches above, loose matches on Subject: below --
2023-10-18 2:57 [Intel-xe] [PATCH v8 00/10] " Himal Prasad Ghimiray
2023-10-18 2:57 ` [Intel-xe] [PATCH v2 05/10] drm/xe: Notify userspace about GSC HW errors Himal Prasad Ghimiray
2023-10-18 2:48 [Intel-xe] [PATCH v8 00/10] *Supporting RAS on XE Himal Prasad Ghimiray
2023-10-18 2:48 ` [Intel-xe] [PATCH v2 05/10] drm/xe: Notify userspace about GSC HW errors Himal Prasad Ghimiray
2023-10-17 5:09 [Intel-xe] [PATCH v6 00/10] Supporting RAS on XE Himal Prasad Ghimiray
2023-10-17 5:09 ` [Intel-xe] [PATCH v2 05/10] drm/xe: Notify userspace about GSC HW errors Himal Prasad Ghimiray
2023-10-17 4:15 [Intel-xe] [PATCH v6 00/10] Supporting RAS on XE Himal Prasad Ghimiray
2023-10-17 4:15 ` [Intel-xe] [PATCH v2 05/10] drm/xe: Notify userspace about GSC HW errors Himal Prasad Ghimiray
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b5456a04-312e-5ea9-d0d4-fa628beeb557@linux.intel.com \
--to=aravind.iddamsetty@linux.intel.com \
--cc=brian.welty@intel.com \
--cc=himal.prasad.ghimiray@intel.com \
--cc=intel-xe@lists.freedesktop.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.