* [Intel-xe] [PATCH v5 0/2] Notify userspace about uevent failure
@ 2023-07-18 13:32 Himal Prasad Ghimiray
2023-07-18 13:32 ` [Intel-xe] [PATCH v5 1/2] drm/xe: Notify Userspace when gt reset fails Himal Prasad Ghimiray
` (4 more replies)
0 siblings, 5 replies; 13+ messages in thread
From: Himal Prasad Ghimiray @ 2023-07-18 13:32 UTC (permalink / raw)
To: intel-xe; +Cc: Lucas De Marchi, Rodrigo Vivi
series sends an uevent in case of gt reset failure.
since we have no mechanism to ensure gt reset failure from
hw perspective introducing the fault injection to validate the uevent.
https://gitlab.freedesktop.org/drm/xe/ci/-/commit/5ab4b369758a57b5a799b93c1d5b556c98d9d557
is enabling the config to ensure debugfs creations are succesful.
Tested-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Testcase: https://patchwork.freedesktop.org/series/120913/
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Himal Prasad Ghimiray (2):
drm/xe: Notify Userspace when gt reset fails
drm/xe: Introduce fault injection for gt reset
drivers/gpu/drm/xe/xe_debugfs.c | 10 ++++++++++
drivers/gpu/drm/xe/xe_gt.c | 26 +++++++++++++++++++++++++-
drivers/gpu/drm/xe/xe_gt.h | 14 ++++++++++++++
include/uapi/drm/xe_drm.h | 8 ++++++++
4 files changed, 57 insertions(+), 1 deletion(-)
--
2.25.1
^ permalink raw reply [flat|nested] 13+ messages in thread* [Intel-xe] [PATCH v5 1/2] drm/xe: Notify Userspace when gt reset fails 2023-07-18 13:32 [Intel-xe] [PATCH v5 0/2] Notify userspace about uevent failure Himal Prasad Ghimiray @ 2023-07-18 13:32 ` Himal Prasad Ghimiray 2023-07-18 15:01 ` Nilawar, Badal 2023-07-18 23:52 ` Roper, Matthew D 2023-07-18 13:32 ` [Intel-xe] [PATCH v5 2/2] drm/xe: Introduce fault injection for gt reset Himal Prasad Ghimiray ` (3 subsequent siblings) 4 siblings, 2 replies; 13+ messages in thread From: Himal Prasad Ghimiray @ 2023-07-18 13:32 UTC (permalink / raw) To: intel-xe; +Cc: Rodrigo Vivi Send uevent in case of gt reset failure. This intimation can be used by userspace monitoring tool to do the device level reset/reboot when GT reset fails. udevadm can be used to monitor the uevents. v2: -Add NULL check for xe_gt_hw_engine return(Aravind) -Arrange variables in Christmas tree order(Tejas) -Check GUC_GSC_OTHER_CLASS(Tejas) v3: - Rebase - Remove notification for engine reset failure (Rodrigo) v4 - Rectify the comments in header file. Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com> Cc: Tejas Upadhyay <tejas.upadhyay@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Signed-off-by: Himal Prasad Ghimiray<himal.prasad.ghimiray@intel.com> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> --- drivers/gpu/drm/xe/xe_gt.c | 18 ++++++++++++++++++ include/uapi/drm/xe_drm.h | 8 ++++++++ 2 files changed, 26 insertions(+) diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c index a21d44bfe9e8..1db4d610f2fd 100644 --- a/drivers/gpu/drm/xe/xe_gt.c +++ b/drivers/gpu/drm/xe/xe_gt.c @@ -8,6 +8,7 @@ #include <linux/minmax.h> #include <drm/drm_managed.h> +#include <drm/xe_drm.h> #include "regs/xe_gt_regs.h" #include "xe_bb.h" @@ -500,6 +501,20 @@ static int do_gt_restart(struct xe_gt *gt) return 0; } +static void xe_uevent_gt_reset_failure(struct xe_device *xe, u8 id) +{ + char *reset_event[5]; + + reset_event[0] = XE_RESET_FAILED_UEVENT "=1"; + reset_event[1] = "RESET_ENABLED=1"; + reset_event[2] = "RESET_UNIT=gt"; + reset_event[3] = kasprintf(GFP_KERNEL, "RESET_ID=%d", id); + reset_event[4] = NULL; + kobject_uevent_env(&xe->drm.primary->kdev->kobj, KOBJ_CHANGE, reset_event); + + kfree(reset_event[3]); +} + static int gt_reset(struct xe_gt *gt) { int err; @@ -549,6 +564,9 @@ static int gt_reset(struct xe_gt *gt) xe_device_mem_access_put(gt_to_xe(gt)); xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err)); + /* Notify userspace about gt reset failure */ + xe_uevent_gt_reset_failure(gt_to_xe(gt), gt->info.id); + return err; } diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h index 347351a8f618..c28bb54812e5 100644 --- a/include/uapi/drm/xe_drm.h +++ b/include/uapi/drm/xe_drm.h @@ -16,6 +16,14 @@ extern "C" { * subject to backwards-compatibility constraints. */ +/* + * Uevents generated by xe on it's device node. + * + * XE_RESET_FAILED_UEVENT - Event is generated when attempt to reset + * gt fails. The value supplied with the event is always 1. + */ +#define XE_RESET_FAILED_UEVENT "RESET_FAILED" + /** * struct xe_user_extension - Base class for defining a chain of extensions * -- 2.25.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [Intel-xe] [PATCH v5 1/2] drm/xe: Notify Userspace when gt reset fails 2023-07-18 13:32 ` [Intel-xe] [PATCH v5 1/2] drm/xe: Notify Userspace when gt reset fails Himal Prasad Ghimiray @ 2023-07-18 15:01 ` Nilawar, Badal 2023-07-18 23:52 ` Roper, Matthew D 1 sibling, 0 replies; 13+ messages in thread From: Nilawar, Badal @ 2023-07-18 15:01 UTC (permalink / raw) To: Himal Prasad Ghimiray, intel-xe; +Cc: Rodrigo Vivi On 18-07-2023 19:02, Himal Prasad Ghimiray wrote: > Send uevent in case of gt reset failure. This intimation can be used by > userspace monitoring tool to do the device level reset/reboot > when GT reset fails. udevadm can be used to monitor the uevents. > > v2: > -Add NULL check for xe_gt_hw_engine return(Aravind) > -Arrange variables in Christmas tree order(Tejas) > -Check GUC_GSC_OTHER_CLASS(Tejas) Commit message 1 and 3 are not applicable any more. > > v3: > - Rebase > - Remove notification for engine reset failure (Rodrigo) > > v4 > - Rectify the comments in header file. > > Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com> > Cc: Tejas Upadhyay <tejas.upadhyay@intel.com> > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> > Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> > Signed-off-by: Himal Prasad Ghimiray<himal.prasad.ghimiray@intel.com> > Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Remove duplicate Signed-off-by: > --- > drivers/gpu/drm/xe/xe_gt.c | 18 ++++++++++++++++++ > include/uapi/drm/xe_drm.h | 8 ++++++++ > 2 files changed, 26 insertions(+) > > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c > index a21d44bfe9e8..1db4d610f2fd 100644 > --- a/drivers/gpu/drm/xe/xe_gt.c > +++ b/drivers/gpu/drm/xe/xe_gt.c > @@ -8,6 +8,7 @@ > #include <linux/minmax.h> > > #include <drm/drm_managed.h> > +#include <drm/xe_drm.h> > > #include "regs/xe_gt_regs.h" > #include "xe_bb.h" > @@ -500,6 +501,20 @@ static int do_gt_restart(struct xe_gt *gt) > return 0; > } > > +static void xe_uevent_gt_reset_failure(struct xe_device *xe, u8 id) > +{ > + char *reset_event[5]; > + > + reset_event[0] = XE_RESET_FAILED_UEVENT "=1"; > + reset_event[1] = "RESET_ENABLED=1"; > + reset_event[2] = "RESET_UNIT=gt"; > + reset_event[3] = kasprintf(GFP_KERNEL, "RESET_ID=%d", id); > + reset_event[4] = NULL; May be add empty line here. > + kobject_uevent_env(&xe->drm.primary->kdev->kobj, KOBJ_CHANGE, reset_event); > + > + kfree(reset_event[3]); > +} > + > static int gt_reset(struct xe_gt *gt) > { > int err; > @@ -549,6 +564,9 @@ static int gt_reset(struct xe_gt *gt) > xe_device_mem_access_put(gt_to_xe(gt)); > xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err)); > > + /* Notify userspace about gt reset failure */ > + xe_uevent_gt_reset_failure(gt_to_xe(gt), gt->info.id); > + > return err; > } > > diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h > index 347351a8f618..c28bb54812e5 100644 > --- a/include/uapi/drm/xe_drm.h > +++ b/include/uapi/drm/xe_drm.h > @@ -16,6 +16,14 @@ extern "C" { > * subject to backwards-compatibility constraints. > */ > > +/* > + * Uevents generated by xe on it's device node. s/Uevents/Uevent/ ? > + * > + * XE_RESET_FAILED_UEVENT - Event is generated when attempt to reset > + * gt fails. The value supplied with the event is always 1. > + */ > +#define XE_RESET_FAILED_UEVENT "RESET_FAILED" > + Regards, Badal > /** > * struct xe_user_extension - Base class for defining a chain of extensions > * ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Intel-xe] [PATCH v5 1/2] drm/xe: Notify Userspace when gt reset fails 2023-07-18 13:32 ` [Intel-xe] [PATCH v5 1/2] drm/xe: Notify Userspace when gt reset fails Himal Prasad Ghimiray 2023-07-18 15:01 ` Nilawar, Badal @ 2023-07-18 23:52 ` Roper, Matthew D 2023-07-19 1:51 ` Ghimiray, Himal Prasad 1 sibling, 1 reply; 13+ messages in thread From: Roper, Matthew D @ 2023-07-18 23:52 UTC (permalink / raw) To: Ghimiray, Himal Prasad; +Cc: intel-xe@lists.freedesktop.org, Vivi, Rodrigo On Tue, Jul 18, 2023 at 07:02:15PM +0530, Himal Prasad Ghimiray wrote: > Send uevent in case of gt reset failure. This intimation can be used by > userspace monitoring tool to do the device level reset/reboot > when GT reset fails. udevadm can be used to monitor the uevents. This seems a bit questionable. In theory a GT reset shouldn't fail; if it does it either means we've either got a major software bug or a really catastrophic hardware failure. Letting the kernel driver just blindly move forward after reporting a uevent doesn't seem like a proper response to a huge problem like that. Relying on userspace to pick up the pieces seems insufficient. Right now the return code from gt_reset() is just ignored and no meaningful action is taken within the driver. It seems like we should at least be putting the device into some kind of 'unusable' state (similar to i915's 'wedged') so that other parts of the driver know that everything is completely broken and that they should stop trying to use the device (or some subset of the device). We could also potentially attempt recovery ourselves by escalating to a DriverFLR. I'm not sure why any userspace would ever care about the specific condition of "gt reset failed" --- it seems like it would pretty much only care about the device being in one of three states and wouldn't care about the driver-internal details about how/why it's in one of those: * Everything is good; can use device as normal * Device partially broken; it's no longer possible to use these specific GTs, engines, etc., but other units should still function as expected. * Absolutely everything is broken and nothing works. You _might_ be able to unload the driver, issue a PCI FLR, and reload the driver, but otherwise all hope is lost and the device cannot be used anymore. Even if we decide that some uevent is appropriate here, this is new uapi and will need a real userspace consumer too. The fact that we can watch any uevents raised by the device with udevadm doesn't seem like it would be sufficient for that purpose; that doesn't feel much different than claiming 'cat' as the userspace consumer of a sysfs node or something. What we really need for new uapi is a meaningful top-to-bottom solution that gives us confidence that we have an appropriate and maintainable interface defined. Matt > > v2: > -Add NULL check for xe_gt_hw_engine return(Aravind) > -Arrange variables in Christmas tree order(Tejas) > -Check GUC_GSC_OTHER_CLASS(Tejas) > > v3: > - Rebase > - Remove notification for engine reset failure (Rodrigo) > > v4 > - Rectify the comments in header file. > > Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com> > Cc: Tejas Upadhyay <tejas.upadhyay@intel.com> > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> > Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> > Signed-off-by: Himal Prasad Ghimiray<himal.prasad.ghimiray@intel.com> > Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> > --- > drivers/gpu/drm/xe/xe_gt.c | 18 ++++++++++++++++++ > include/uapi/drm/xe_drm.h | 8 ++++++++ > 2 files changed, 26 insertions(+) > > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c > index a21d44bfe9e8..1db4d610f2fd 100644 > --- a/drivers/gpu/drm/xe/xe_gt.c > +++ b/drivers/gpu/drm/xe/xe_gt.c > @@ -8,6 +8,7 @@ > #include <linux/minmax.h> > > #include <drm/drm_managed.h> > +#include <drm/xe_drm.h> > > #include "regs/xe_gt_regs.h" > #include "xe_bb.h" > @@ -500,6 +501,20 @@ static int do_gt_restart(struct xe_gt *gt) > return 0; > } > > +static void xe_uevent_gt_reset_failure(struct xe_device *xe, u8 id) > +{ > + char *reset_event[5]; > + > + reset_event[0] = XE_RESET_FAILED_UEVENT "=1"; > + reset_event[1] = "RESET_ENABLED=1"; > + reset_event[2] = "RESET_UNIT=gt"; > + reset_event[3] = kasprintf(GFP_KERNEL, "RESET_ID=%d", id); > + reset_event[4] = NULL; > + kobject_uevent_env(&xe->drm.primary->kdev->kobj, KOBJ_CHANGE, reset_event); > + > + kfree(reset_event[3]); > +} > + > static int gt_reset(struct xe_gt *gt) > { > int err; > @@ -549,6 +564,9 @@ static int gt_reset(struct xe_gt *gt) > xe_device_mem_access_put(gt_to_xe(gt)); > xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err)); > > + /* Notify userspace about gt reset failure */ > + xe_uevent_gt_reset_failure(gt_to_xe(gt), gt->info.id); > + > return err; > } > > diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h > index 347351a8f618..c28bb54812e5 100644 > --- a/include/uapi/drm/xe_drm.h > +++ b/include/uapi/drm/xe_drm.h > @@ -16,6 +16,14 @@ extern "C" { > * subject to backwards-compatibility constraints. > */ > > +/* > + * Uevents generated by xe on it's device node. > + * > + * XE_RESET_FAILED_UEVENT - Event is generated when attempt to reset > + * gt fails. The value supplied with the event is always 1. > + */ > +#define XE_RESET_FAILED_UEVENT "RESET_FAILED" > + > /** > * struct xe_user_extension - Base class for defining a chain of extensions > * > -- > 2.25.1 > -- Matt Roper Graphics Software Engineer Linux GPU Platform Enablement Intel Corporation ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Intel-xe] [PATCH v5 1/2] drm/xe: Notify Userspace when gt reset fails 2023-07-18 23:52 ` Roper, Matthew D @ 2023-07-19 1:51 ` Ghimiray, Himal Prasad 2023-07-20 16:32 ` Rodrigo Vivi 0 siblings, 1 reply; 13+ messages in thread From: Ghimiray, Himal Prasad @ 2023-07-19 1:51 UTC (permalink / raw) To: Roper, Matthew D; +Cc: intel-xe@lists.freedesktop.org, Vivi, Rodrigo Hi Matt, > -----Original Message----- > From: Roper, Matthew D <matthew.d.roper@intel.com> > Sent: 19 July 2023 05:23 > To: Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com> > Cc: intel-xe@lists.freedesktop.org; Vivi, Rodrigo <rodrigo.vivi@intel.com> > Subject: Re: [Intel-xe] [PATCH v5 1/2] drm/xe: Notify Userspace when gt reset > fails > > On Tue, Jul 18, 2023 at 07:02:15PM +0530, Himal Prasad Ghimiray wrote: > > Send uevent in case of gt reset failure. This intimation can be used > > by userspace monitoring tool to do the device level reset/reboot when > > GT reset fails. udevadm can be used to monitor the uevents. > > This seems a bit questionable. In theory a GT reset shouldn't fail; if it does it > either means we've either got a major software bug or a really catastrophic > hardware failure. Letting the kernel driver just blindly move forward after > reporting a uevent doesn't seem like a proper response to a huge problem > like that. Relying on userspace to pick up the pieces seems insufficient. > > Right now the return code from gt_reset() is just ignored and no meaningful > action is taken within the driver. It seems like we should at least be putting > the device into some kind of 'unusable' state (similar to i915's 'wedged') so > that other parts of the driver know that everything is completely broken and > that they should stop trying to use the device (or some subset of the device). > We could also potentially attempt recovery ourselves by escalating to a > DriverFLR. Agreed. These feature needs to be there in the driver. > > I'm not sure why any userspace would ever care about the specific condition > of "gt reset failed" --- it seems like it would pretty much only care about the > device being in one of three states and wouldn't care about the driver- > internal details about how/why it's in one of > those: > > * Everything is good; can use device as normal > * Device partially broken; it's no longer possible to use these > specific GTs, engines, etc., but other units should still function as > expected. > * Absolutely everything is broken and nothing works. You _might_ be > able to unload the driver, issue a PCI FLR, and reload the driver, > but otherwise all hope is lost and the device cannot be used anymore. > > Even if we decide that some uevent is appropriate here, this is new uapi and > will need a real userspace consumer too. The fact that we can watch any > uevents raised by the device with udevadm doesn't seem like it would be > sufficient for that purpose; that doesn't feel much different than claiming > 'cat' as the userspace consumer of a sysfs node or something. > What we really need for new uapi is a meaningful top-to-bottom solution > that gives us confidence that we have an appropriate and maintainable > interface defined. https://spec.oneapi.io/level-zero/latest/sysman/api.html#_CPPv441ZES_EVENT_TYPE_FLAG_DEVICE_RESET_REQUIRED is the consumer for this uapi. The application can register for events ZES_EVENT_TYPE_FLAG_DEVICE_RESET_REQUIRED with Sysman and start listening to this event. Then Sysman would send ZES_EVENT_TYPE_FLAG_DEVICE_RESET_REQUIRED event to application, incase sysman recieves uevent "RESET_FAILED=1" or "RESET_REQUIRED=1" or if Sysman detects a repair have occurred. Then after getting ZES_EVENT_TYPE_FLAG_DEVICE_RESET_REQUIRED event application could query zesDeviceGetState to get the reason of RESET. And Sysman would tell the reason of reset as : ZES_RESET_REASON_FLAG_WEDGED: Sysman would try to create a context and if context creation fails due to EIO error, then device is determined to be wedged and reset reason is declared as WEDGED And sysman triggers a SBR to reset the device. BR Himal > > > Matt > > > > > v2: > > -Add NULL check for xe_gt_hw_engine return(Aravind) -Arrange variables > > in Christmas tree order(Tejas) -Check GUC_GSC_OTHER_CLASS(Tejas) > > > > v3: > > - Rebase > > - Remove notification for engine reset failure (Rodrigo) > > > > v4 > > - Rectify the comments in header file. > > > > Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com> > > Cc: Tejas Upadhyay <tejas.upadhyay@intel.com> > > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> > > Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> > > Signed-off-by: Himal Prasad Ghimiray<himal.prasad.ghimiray@intel.com> > > Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> > > --- > > drivers/gpu/drm/xe/xe_gt.c | 18 ++++++++++++++++++ > > include/uapi/drm/xe_drm.h | 8 ++++++++ > > 2 files changed, 26 insertions(+) > > > > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c > > index a21d44bfe9e8..1db4d610f2fd 100644 > > --- a/drivers/gpu/drm/xe/xe_gt.c > > +++ b/drivers/gpu/drm/xe/xe_gt.c > > @@ -8,6 +8,7 @@ > > #include <linux/minmax.h> > > > > #include <drm/drm_managed.h> > > +#include <drm/xe_drm.h> > > > > #include "regs/xe_gt_regs.h" > > #include "xe_bb.h" > > @@ -500,6 +501,20 @@ static int do_gt_restart(struct xe_gt *gt) > > return 0; > > } > > > > +static void xe_uevent_gt_reset_failure(struct xe_device *xe, u8 id) { > > + char *reset_event[5]; > > + > > + reset_event[0] = XE_RESET_FAILED_UEVENT "=1"; > > + reset_event[1] = "RESET_ENABLED=1"; > > + reset_event[2] = "RESET_UNIT=gt"; > > + reset_event[3] = kasprintf(GFP_KERNEL, "RESET_ID=%d", id); > > + reset_event[4] = NULL; > > + kobject_uevent_env(&xe->drm.primary->kdev->kobj, KOBJ_CHANGE, > > +reset_event); > > + > > + kfree(reset_event[3]); > > +} > > + > > static int gt_reset(struct xe_gt *gt) { > > int err; > > @@ -549,6 +564,9 @@ static int gt_reset(struct xe_gt *gt) > > xe_device_mem_access_put(gt_to_xe(gt)); > > xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err)); > > > > + /* Notify userspace about gt reset failure */ > > + xe_uevent_gt_reset_failure(gt_to_xe(gt), gt->info.id); > > + > > return err; > > } > > > > diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h > > index 347351a8f618..c28bb54812e5 100644 > > --- a/include/uapi/drm/xe_drm.h > > +++ b/include/uapi/drm/xe_drm.h > > @@ -16,6 +16,14 @@ extern "C" { > > * subject to backwards-compatibility constraints. > > */ > > > > +/* > > + * Uevents generated by xe on it's device node. > > + * > > + * XE_RESET_FAILED_UEVENT - Event is generated when attempt to reset > > + * gt fails. The value supplied with the event is always 1. > > + */ > > +#define XE_RESET_FAILED_UEVENT "RESET_FAILED" > > + > > /** > > * struct xe_user_extension - Base class for defining a chain of extensions > > * > > -- > > 2.25.1 > > > > -- > Matt Roper > Graphics Software Engineer > Linux GPU Platform Enablement > Intel Corporation ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Intel-xe] [PATCH v5 1/2] drm/xe: Notify Userspace when gt reset fails 2023-07-19 1:51 ` Ghimiray, Himal Prasad @ 2023-07-20 16:32 ` Rodrigo Vivi 2023-07-24 8:35 ` Ghimiray, Himal Prasad 0 siblings, 1 reply; 13+ messages in thread From: Rodrigo Vivi @ 2023-07-20 16:32 UTC (permalink / raw) To: Ghimiray, Himal Prasad Cc: Joonas Lahtinen, Roper, Matthew D, intel-xe@lists.freedesktop.org On Wed, Jul 19, 2023 at 01:51:37AM +0000, Ghimiray, Himal Prasad wrote: > Hi Matt, > > > -----Original Message----- > > From: Roper, Matthew D <matthew.d.roper@intel.com> > > Sent: 19 July 2023 05:23 > > To: Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com> > > Cc: intel-xe@lists.freedesktop.org; Vivi, Rodrigo <rodrigo.vivi@intel.com> > > Subject: Re: [Intel-xe] [PATCH v5 1/2] drm/xe: Notify Userspace when gt reset > > fails > > > > On Tue, Jul 18, 2023 at 07:02:15PM +0530, Himal Prasad Ghimiray wrote: > > > Send uevent in case of gt reset failure. This intimation can be used > > > by userspace monitoring tool to do the device level reset/reboot when > > > GT reset fails. udevadm can be used to monitor the uevents. > > > > This seems a bit questionable. In theory a GT reset shouldn't fail; if it does it > > either means we've either got a major software bug or a really catastrophic > > hardware failure. Letting the kernel driver just blindly move forward after > > reporting a uevent doesn't seem like a proper response to a huge problem > > like that. Relying on userspace to pick up the pieces seems insufficient. > > > > Right now the return code from gt_reset() is just ignored and no meaningful > > action is taken within the driver. It seems like we should at least be putting > > the device into some kind of 'unusable' state (similar to i915's 'wedged') so > > that other parts of the driver know that everything is completely broken and > > that they should stop trying to use the device (or some subset of the device). > > We could also potentially attempt recovery ourselves by escalating to a > > DriverFLR. > > Agreed. These feature needs to be there in the driver. indeed. one possibility is to remove the drm card but keep the pci up waiting for the reset. Another is to 'wedge' like i915. > > > > > I'm not sure why any userspace would ever care about the specific condition > > of "gt reset failed" --- it seems like it would pretty much only care about the > > device being in one of three states and wouldn't care about the driver- > > internal details about how/why it's in one of > > those: > > > > * Everything is good; can use device as normal > > * Device partially broken; it's no longer possible to use these > > specific GTs, engines, etc., but other units should still function as > > expected. > > * Absolutely everything is broken and nothing works. You _might_ be > > able to unload the driver, issue a PCI FLR, and reload the driver, > > but otherwise all hope is lost and the device cannot be used anymore. > > > > Even if we decide that some uevent is appropriate here, this is new uapi and > > will need a real userspace consumer too. The fact that we can watch any > > uevents raised by the device with udevadm doesn't seem like it would be > > sufficient for that purpose; that doesn't feel much different than claiming > > 'cat' as the userspace consumer of a sysfs node or something. > > What we really need for new uapi is a meaningful top-to-bottom solution > > that gives us confidence that we have an appropriate and maintainable > > interface defined. > > https://spec.oneapi.io/level-zero/latest/sysman/api.html#_CPPv441ZES_EVENT_TYPE_FLAG_DEVICE_RESET_REQUIRED is the consumer for this uapi. > The application can register for events ZES_EVENT_TYPE_FLAG_DEVICE_RESET_REQUIRED with Sysman and start listening to this event. > Then Sysman would send ZES_EVENT_TYPE_FLAG_DEVICE_RESET_REQUIRED event to application, incase sysman recieves uevent "RESET_FAILED=1" or "RESET_REQUIRED=1" or if Sysman detects a repair have occurred. > > Then after getting ZES_EVENT_TYPE_FLAG_DEVICE_RESET_REQUIRED event application could query zesDeviceGetState to get the reason of RESET. And Sysman would tell the reason of reset as : > ZES_RESET_REASON_FLAG_WEDGED: Sysman would try to create a context and if context creation fails due to EIO error, then device is determined to be wedged and reset reason is declared as WEDGED > > And sysman triggers a SBR to reset the device. yeap, this is already good by itself, even independent of that mode. But we need to think about the possibility of removing the drm card above. Maybe it would be better to add the uevent at the pci level rather then the drm. And maybe with a general device_status= variable. +Joonas > > BR > Himal > > > > > > Matt > > > > > > > > v2: > > > -Add NULL check for xe_gt_hw_engine return(Aravind) -Arrange variables > > > in Christmas tree order(Tejas) -Check GUC_GSC_OTHER_CLASS(Tejas) > > > > > > v3: > > > - Rebase > > > - Remove notification for engine reset failure (Rodrigo) > > > > > > v4 > > > - Rectify the comments in header file. > > > > > > Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com> > > > Cc: Tejas Upadhyay <tejas.upadhyay@intel.com> > > > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> > > > Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> > > > Signed-off-by: Himal Prasad Ghimiray<himal.prasad.ghimiray@intel.com> > > > Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> > > > --- > > > drivers/gpu/drm/xe/xe_gt.c | 18 ++++++++++++++++++ > > > include/uapi/drm/xe_drm.h | 8 ++++++++ > > > 2 files changed, 26 insertions(+) > > > > > > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c > > > index a21d44bfe9e8..1db4d610f2fd 100644 > > > --- a/drivers/gpu/drm/xe/xe_gt.c > > > +++ b/drivers/gpu/drm/xe/xe_gt.c > > > @@ -8,6 +8,7 @@ > > > #include <linux/minmax.h> > > > > > > #include <drm/drm_managed.h> > > > +#include <drm/xe_drm.h> > > > > > > #include "regs/xe_gt_regs.h" > > > #include "xe_bb.h" > > > @@ -500,6 +501,20 @@ static int do_gt_restart(struct xe_gt *gt) > > > return 0; > > > } > > > > > > +static void xe_uevent_gt_reset_failure(struct xe_device *xe, u8 id) { > > > + char *reset_event[5]; > > > + > > > + reset_event[0] = XE_RESET_FAILED_UEVENT "=1"; > > > + reset_event[1] = "RESET_ENABLED=1"; > > > + reset_event[2] = "RESET_UNIT=gt"; > > > + reset_event[3] = kasprintf(GFP_KERNEL, "RESET_ID=%d", id); > > > + reset_event[4] = NULL; > > > + kobject_uevent_env(&xe->drm.primary->kdev->kobj, KOBJ_CHANGE, > > > +reset_event); > > > + > > > + kfree(reset_event[3]); > > > +} > > > + > > > static int gt_reset(struct xe_gt *gt) { > > > int err; > > > @@ -549,6 +564,9 @@ static int gt_reset(struct xe_gt *gt) > > > xe_device_mem_access_put(gt_to_xe(gt)); > > > xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err)); > > > > > > + /* Notify userspace about gt reset failure */ > > > + xe_uevent_gt_reset_failure(gt_to_xe(gt), gt->info.id); > > > + > > > return err; > > > } > > > > > > diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h > > > index 347351a8f618..c28bb54812e5 100644 > > > --- a/include/uapi/drm/xe_drm.h > > > +++ b/include/uapi/drm/xe_drm.h > > > @@ -16,6 +16,14 @@ extern "C" { > > > * subject to backwards-compatibility constraints. > > > */ > > > > > > +/* > > > + * Uevents generated by xe on it's device node. > > > + * > > > + * XE_RESET_FAILED_UEVENT - Event is generated when attempt to reset > > > + * gt fails. The value supplied with the event is always 1. > > > + */ > > > +#define XE_RESET_FAILED_UEVENT "RESET_FAILED" > > > + > > > /** > > > * struct xe_user_extension - Base class for defining a chain of extensions > > > * > > > -- > > > 2.25.1 > > > > > > > -- > > Matt Roper > > Graphics Software Engineer > > Linux GPU Platform Enablement > > Intel Corporation ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Intel-xe] [PATCH v5 1/2] drm/xe: Notify Userspace when gt reset fails 2023-07-20 16:32 ` Rodrigo Vivi @ 2023-07-24 8:35 ` Ghimiray, Himal Prasad 0 siblings, 0 replies; 13+ messages in thread From: Ghimiray, Himal Prasad @ 2023-07-24 8:35 UTC (permalink / raw) To: Vivi, Rodrigo Cc: Joonas Lahtinen, Roper, Matthew D, intel-xe@lists.freedesktop.org > -----Original Message----- > From: Vivi, Rodrigo <rodrigo.vivi@intel.com> > Sent: 20 July 2023 22:03 > To: Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com> > Cc: Roper, Matthew D <matthew.d.roper@intel.com>; intel- > xe@lists.freedesktop.org; Joonas Lahtinen > <joonas.lahtinen@linux.intel.com> > Subject: Re: [Intel-xe] [PATCH v5 1/2] drm/xe: Notify Userspace when gt reset > fails > > On Wed, Jul 19, 2023 at 01:51:37AM +0000, Ghimiray, Himal Prasad wrote: > > Hi Matt, > > > > > -----Original Message----- > > > From: Roper, Matthew D <matthew.d.roper@intel.com> > > > Sent: 19 July 2023 05:23 > > > To: Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com> > > > Cc: intel-xe@lists.freedesktop.org; Vivi, Rodrigo > > > <rodrigo.vivi@intel.com> > > > Subject: Re: [Intel-xe] [PATCH v5 1/2] drm/xe: Notify Userspace when > > > gt reset fails > > > > > > On Tue, Jul 18, 2023 at 07:02:15PM +0530, Himal Prasad Ghimiray wrote: > > > > Send uevent in case of gt reset failure. This intimation can be > > > > used by userspace monitoring tool to do the device level > > > > reset/reboot when GT reset fails. udevadm can be used to monitor the > uevents. > > > > > > This seems a bit questionable. In theory a GT reset shouldn't fail; > > > if it does it either means we've either got a major software bug or > > > a really catastrophic hardware failure. Letting the kernel driver > > > just blindly move forward after reporting a uevent doesn't seem like > > > a proper response to a huge problem like that. Relying on userspace to > pick up the pieces seems insufficient. > > > > > > Right now the return code from gt_reset() is just ignored and no > > > meaningful action is taken within the driver. It seems like we > > > should at least be putting the device into some kind of 'unusable' > > > state (similar to i915's 'wedged') so that other parts of the driver > > > know that everything is completely broken and that they should stop > trying to use the device (or some subset of the device). > > > We could also potentially attempt recovery ourselves by escalating > > > to a DriverFLR. > > > > Agreed. These feature needs to be there in the driver. > > indeed. one possibility is to remove the drm card but keep the pci up waiting > for the reset. Another is to 'wedge' like i915. > > > > > > > > > I'm not sure why any userspace would ever care about the specific > > > condition of "gt reset failed" --- it seems like it would pretty > > > much only care about the device being in one of three states and > > > wouldn't care about the driver- internal details about how/why it's > > > in one of > > > those: > > > > > > * Everything is good; can use device as normal > > > * Device partially broken; it's no longer possible to use these > > > specific GTs, engines, etc., but other units should still function as > > > expected. > > > * Absolutely everything is broken and nothing works. You _might_ be > > > able to unload the driver, issue a PCI FLR, and reload the driver, > > > but otherwise all hope is lost and the device cannot be used anymore. > > > > > > Even if we decide that some uevent is appropriate here, this is new > > > uapi and will need a real userspace consumer too. The fact that we > > > can watch any uevents raised by the device with udevadm doesn't seem > > > like it would be sufficient for that purpose; that doesn't feel much > > > different than claiming 'cat' as the userspace consumer of a sysfs node or > something. > > > What we really need for new uapi is a meaningful top-to-bottom > > > solution that gives us confidence that we have an appropriate and > > > maintainable interface defined. > > > > https://spec.oneapi.io/level- > zero/latest/sysman/api.html#_CPPv441ZES_EVENT_TYPE_FLAG_DEVICE_RES > ET_REQUIRED is the consumer for this uapi. > > The application can register for events > ZES_EVENT_TYPE_FLAG_DEVICE_RESET_REQUIRED with Sysman and start > listening to this event. > > Then Sysman would send > ZES_EVENT_TYPE_FLAG_DEVICE_RESET_REQUIRED event to application, > incase sysman recieves uevent "RESET_FAILED=1" or "RESET_REQUIRED=1" > or if Sysman detects a repair have occurred. > > > > Then after getting ZES_EVENT_TYPE_FLAG_DEVICE_RESET_REQUIRED event > application could query zesDeviceGetState to get the reason of RESET. And > Sysman would tell the reason of reset as : > > ZES_RESET_REASON_FLAG_WEDGED: Sysman would try to create a context > and > > if context creation fails due to EIO error, then device is determined > > to be wedged and reset reason is declared as WEDGED > > > > And sysman triggers a SBR to reset the device. > > yeap, this is already good by itself, even independent of that mode. > > But we need to think about the possibility of removing the drm card above. > Maybe it would be better to add the uevent at the pci level rather then the > drm. > And maybe with a general device_status= variable. Sure. Sounds logical. Will be uploading new patch for using pcie kobj instead of drm and will be going ahead with DEVICE_STATUS="NEEDS_RESET" RESET_FAILED=gt_id. Thanks Joonas for the offline inputs. > > +Joonas > > > > > BR > > Himal > > > > > > > > > Matt > > > > > > > > > > > v2: > > > > -Add NULL check for xe_gt_hw_engine return(Aravind) -Arrange > > > > variables in Christmas tree order(Tejas) -Check > > > > GUC_GSC_OTHER_CLASS(Tejas) > > > > > > > > v3: > > > > - Rebase > > > > - Remove notification for engine reset failure (Rodrigo) > > > > > > > > v4 > > > > - Rectify the comments in header file. > > > > > > > > Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com> > > > > Cc: Tejas Upadhyay <tejas.upadhyay@intel.com> > > > > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> > > > > Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> > > > > Signed-off-by: Himal Prasad > > > > Ghimiray<himal.prasad.ghimiray@intel.com> > > > > Signed-off-by: Himal Prasad Ghimiray > > > > <himal.prasad.ghimiray@intel.com> > > > > --- > > > > drivers/gpu/drm/xe/xe_gt.c | 18 ++++++++++++++++++ > > > > include/uapi/drm/xe_drm.h | 8 ++++++++ > > > > 2 files changed, 26 insertions(+) > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_gt.c > > > > b/drivers/gpu/drm/xe/xe_gt.c index a21d44bfe9e8..1db4d610f2fd > > > > 100644 > > > > --- a/drivers/gpu/drm/xe/xe_gt.c > > > > +++ b/drivers/gpu/drm/xe/xe_gt.c > > > > @@ -8,6 +8,7 @@ > > > > #include <linux/minmax.h> > > > > > > > > #include <drm/drm_managed.h> > > > > +#include <drm/xe_drm.h> > > > > > > > > #include "regs/xe_gt_regs.h" > > > > #include "xe_bb.h" > > > > @@ -500,6 +501,20 @@ static int do_gt_restart(struct xe_gt *gt) > > > > return 0; > > > > } > > > > > > > > +static void xe_uevent_gt_reset_failure(struct xe_device *xe, u8 id) { > > > > + char *reset_event[5]; > > > > + > > > > + reset_event[0] = XE_RESET_FAILED_UEVENT "=1"; > > > > + reset_event[1] = "RESET_ENABLED=1"; > > > > + reset_event[2] = "RESET_UNIT=gt"; > > > > + reset_event[3] = kasprintf(GFP_KERNEL, "RESET_ID=%d", id); > > > > + reset_event[4] = NULL; > > > > + kobject_uevent_env(&xe->drm.primary->kdev->kobj, KOBJ_CHANGE, > > > > +reset_event); > > > > + > > > > + kfree(reset_event[3]); > > > > +} > > > > + > > > > static int gt_reset(struct xe_gt *gt) { > > > > int err; > > > > @@ -549,6 +564,9 @@ static int gt_reset(struct xe_gt *gt) > > > > xe_device_mem_access_put(gt_to_xe(gt)); > > > > xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err)); > > > > > > > > + /* Notify userspace about gt reset failure */ > > > > + xe_uevent_gt_reset_failure(gt_to_xe(gt), gt->info.id); > > > > + > > > > return err; > > > > } > > > > > > > > diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h > > > > index 347351a8f618..c28bb54812e5 100644 > > > > --- a/include/uapi/drm/xe_drm.h > > > > +++ b/include/uapi/drm/xe_drm.h > > > > @@ -16,6 +16,14 @@ extern "C" { > > > > * subject to backwards-compatibility constraints. > > > > */ > > > > > > > > +/* > > > > + * Uevents generated by xe on it's device node. > > > > + * > > > > + * XE_RESET_FAILED_UEVENT - Event is generated when attempt to > > > > +reset > > > > + * gt fails. The value supplied with the event is always 1. > > > > + */ > > > > +#define XE_RESET_FAILED_UEVENT "RESET_FAILED" > > > > + > > > > /** > > > > * struct xe_user_extension - Base class for defining a chain of > extensions > > > > * > > > > -- > > > > 2.25.1 > > > > > > > > > > -- > > > Matt Roper > > > Graphics Software Engineer > > > Linux GPU Platform Enablement > > > Intel Corporation ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Intel-xe] [PATCH v5 2/2] drm/xe: Introduce fault injection for gt reset 2023-07-18 13:32 [Intel-xe] [PATCH v5 0/2] Notify userspace about uevent failure Himal Prasad Ghimiray 2023-07-18 13:32 ` [Intel-xe] [PATCH v5 1/2] drm/xe: Notify Userspace when gt reset fails Himal Prasad Ghimiray @ 2023-07-18 13:32 ` Himal Prasad Ghimiray 2023-07-20 16:28 ` Rodrigo Vivi 2023-07-18 15:52 ` [Intel-xe] ✓ CI.Patch_applied: success for Notify userspace about uevent failure Patchwork ` (2 subsequent siblings) 4 siblings, 1 reply; 13+ messages in thread From: Himal Prasad Ghimiray @ 2023-07-18 13:32 UTC (permalink / raw) To: intel-xe; +Cc: Lucas De Marchi, Rodrigo Vivi To trigger gt reset failure: echo 100 > /sys/kernel/debug/dri/<cardX>/fail_gt_reset/probability echo 2 > /sys/kernel/debug/dri/<cardX>/fail_gt_reset/times Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> --- drivers/gpu/drm/xe/xe_debugfs.c | 10 ++++++++++ drivers/gpu/drm/xe/xe_gt.c | 8 +++++++- drivers/gpu/drm/xe/xe_gt.h | 14 ++++++++++++++ 3 files changed, 31 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/xe/xe_debugfs.c b/drivers/gpu/drm/xe/xe_debugfs.c index 7827a785b020..08d5bdf4cf61 100644 --- a/drivers/gpu/drm/xe/xe_debugfs.c +++ b/drivers/gpu/drm/xe/xe_debugfs.c @@ -5,6 +5,7 @@ #include "xe_debugfs.h" +#include <linux/fault-inject.h> #include <linux/string_helpers.h> #include <drm/drm_debugfs.h> @@ -20,6 +21,10 @@ #include "xe_vm.h" #endif +#ifdef CONFIG_FAULT_INJECTION +DECLARE_FAULT_ATTR(gt_reset_failure); +#endif + static struct xe_device *node_to_xe(struct drm_info_node *node) { return to_xe_device(node->minor->dev); @@ -131,4 +136,9 @@ void xe_debugfs_register(struct xe_device *xe) for_each_gt(gt, xe, id) xe_gt_debugfs_register(gt); + +#ifdef CONFIG_FAULT_INJECTION + fault_create_debugfs_attr("fail_gt_reset", root, >_reset_failure); +#endif + } diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c index 1db4d610f2fd..370d4b96e616 100644 --- a/drivers/gpu/drm/xe/xe_gt.c +++ b/drivers/gpu/drm/xe/xe_gt.c @@ -525,6 +525,11 @@ static int gt_reset(struct xe_gt *gt) xe_gt_info(gt, "reset started\n"); + if (xe_fault_inject_gt_reset()) { + err = -ECANCELED; + goto err_fail; + } + xe_gt_sanitize(gt); xe_device_mem_access_get(gt_to_xe(gt)); @@ -562,6 +567,7 @@ static int gt_reset(struct xe_gt *gt) err_msg: XE_WARN_ON(xe_uc_start(>->uc)); xe_device_mem_access_put(gt_to_xe(gt)); +err_fail: xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err)); /* Notify userspace about gt reset failure */ @@ -582,7 +588,7 @@ void xe_gt_reset_async(struct xe_gt *gt) xe_gt_info(gt, "trying reset\n"); /* Don't do a reset while one is already in flight */ - if (xe_uc_reset_prepare(>->uc)) + if (!xe_fault_inject_gt_reset() && xe_uc_reset_prepare(>->uc)) return; xe_gt_info(gt, "reset queued\n"); diff --git a/drivers/gpu/drm/xe/xe_gt.h b/drivers/gpu/drm/xe/xe_gt.h index 7298653a73de..caded203a8a0 100644 --- a/drivers/gpu/drm/xe/xe_gt.h +++ b/drivers/gpu/drm/xe/xe_gt.h @@ -7,6 +7,7 @@ #define _XE_GT_H_ #include <drm/drm_util.h> +#include <linux/fault-inject.h> #include "xe_device_types.h" #include "xe_hw_engine.h" @@ -16,6 +17,19 @@ for_each_if(((hwe__) = (gt__)->hw_engines + (id__)) && \ xe_hw_engine_is_valid((hwe__))) +#ifdef CONFIG_FAULT_INJECTION +extern struct fault_attr gt_reset_failure; +static inline bool xe_fault_inject_gt_reset(void) +{ + return should_fail(>_reset_failure, 1); +} +#else +static inline bool xe_fault_inject_gt_reset(void) +{ + return false; +} +#endif + struct xe_gt *xe_gt_alloc(struct xe_tile *tile); int xe_gt_init_early(struct xe_gt *gt); int xe_gt_init(struct xe_gt *gt); -- 2.25.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [Intel-xe] [PATCH v5 2/2] drm/xe: Introduce fault injection for gt reset 2023-07-18 13:32 ` [Intel-xe] [PATCH v5 2/2] drm/xe: Introduce fault injection for gt reset Himal Prasad Ghimiray @ 2023-07-20 16:28 ` Rodrigo Vivi 2023-07-24 8:38 ` Ghimiray, Himal Prasad 0 siblings, 1 reply; 13+ messages in thread From: Rodrigo Vivi @ 2023-07-20 16:28 UTC (permalink / raw) To: Himal Prasad Ghimiray; +Cc: Lucas De Marchi, intel-xe On Tue, Jul 18, 2023 at 07:02:16PM +0530, Himal Prasad Ghimiray wrote: > To trigger gt reset failure: > echo 100 > /sys/kernel/debug/dri/<cardX>/fail_gt_reset/probability > echo 2 > /sys/kernel/debug/dri/<cardX>/fail_gt_reset/times why 2 and not 1? anyway, neat solution! Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> > > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> > Cc: Lucas De Marchi <lucas.demarchi@intel.com> > Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> > --- > drivers/gpu/drm/xe/xe_debugfs.c | 10 ++++++++++ > drivers/gpu/drm/xe/xe_gt.c | 8 +++++++- > drivers/gpu/drm/xe/xe_gt.h | 14 ++++++++++++++ > 3 files changed, 31 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/xe/xe_debugfs.c b/drivers/gpu/drm/xe/xe_debugfs.c > index 7827a785b020..08d5bdf4cf61 100644 > --- a/drivers/gpu/drm/xe/xe_debugfs.c > +++ b/drivers/gpu/drm/xe/xe_debugfs.c > @@ -5,6 +5,7 @@ > > #include "xe_debugfs.h" > > +#include <linux/fault-inject.h> > #include <linux/string_helpers.h> > > #include <drm/drm_debugfs.h> > @@ -20,6 +21,10 @@ > #include "xe_vm.h" > #endif > > +#ifdef CONFIG_FAULT_INJECTION > +DECLARE_FAULT_ATTR(gt_reset_failure); > +#endif > + > static struct xe_device *node_to_xe(struct drm_info_node *node) > { > return to_xe_device(node->minor->dev); > @@ -131,4 +136,9 @@ void xe_debugfs_register(struct xe_device *xe) > > for_each_gt(gt, xe, id) > xe_gt_debugfs_register(gt); > + > +#ifdef CONFIG_FAULT_INJECTION > + fault_create_debugfs_attr("fail_gt_reset", root, >_reset_failure); > +#endif > + > } > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c > index 1db4d610f2fd..370d4b96e616 100644 > --- a/drivers/gpu/drm/xe/xe_gt.c > +++ b/drivers/gpu/drm/xe/xe_gt.c > @@ -525,6 +525,11 @@ static int gt_reset(struct xe_gt *gt) > > xe_gt_info(gt, "reset started\n"); > > + if (xe_fault_inject_gt_reset()) { > + err = -ECANCELED; > + goto err_fail; > + } > + > xe_gt_sanitize(gt); > > xe_device_mem_access_get(gt_to_xe(gt)); > @@ -562,6 +567,7 @@ static int gt_reset(struct xe_gt *gt) > err_msg: > XE_WARN_ON(xe_uc_start(>->uc)); > xe_device_mem_access_put(gt_to_xe(gt)); > +err_fail: > xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err)); > > /* Notify userspace about gt reset failure */ > @@ -582,7 +588,7 @@ void xe_gt_reset_async(struct xe_gt *gt) > xe_gt_info(gt, "trying reset\n"); > > /* Don't do a reset while one is already in flight */ > - if (xe_uc_reset_prepare(>->uc)) > + if (!xe_fault_inject_gt_reset() && xe_uc_reset_prepare(>->uc)) > return; > > xe_gt_info(gt, "reset queued\n"); > diff --git a/drivers/gpu/drm/xe/xe_gt.h b/drivers/gpu/drm/xe/xe_gt.h > index 7298653a73de..caded203a8a0 100644 > --- a/drivers/gpu/drm/xe/xe_gt.h > +++ b/drivers/gpu/drm/xe/xe_gt.h > @@ -7,6 +7,7 @@ > #define _XE_GT_H_ > > #include <drm/drm_util.h> > +#include <linux/fault-inject.h> > > #include "xe_device_types.h" > #include "xe_hw_engine.h" > @@ -16,6 +17,19 @@ > for_each_if(((hwe__) = (gt__)->hw_engines + (id__)) && \ > xe_hw_engine_is_valid((hwe__))) > > +#ifdef CONFIG_FAULT_INJECTION > +extern struct fault_attr gt_reset_failure; > +static inline bool xe_fault_inject_gt_reset(void) > +{ > + return should_fail(>_reset_failure, 1); > +} > +#else > +static inline bool xe_fault_inject_gt_reset(void) > +{ > + return false; > +} > +#endif > + > struct xe_gt *xe_gt_alloc(struct xe_tile *tile); > int xe_gt_init_early(struct xe_gt *gt); > int xe_gt_init(struct xe_gt *gt); > -- > 2.25.1 > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Intel-xe] [PATCH v5 2/2] drm/xe: Introduce fault injection for gt reset 2023-07-20 16:28 ` Rodrigo Vivi @ 2023-07-24 8:38 ` Ghimiray, Himal Prasad 0 siblings, 0 replies; 13+ messages in thread From: Ghimiray, Himal Prasad @ 2023-07-24 8:38 UTC (permalink / raw) To: Vivi, Rodrigo; +Cc: De Marchi, Lucas, intel-xe@lists.freedesktop.org Hi Rodrigo, > -----Original Message----- > From: Vivi, Rodrigo <rodrigo.vivi@intel.com> > Sent: 20 July 2023 21:59 > To: Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com> > Cc: intel-xe@lists.freedesktop.org; De Marchi, Lucas > <lucas.demarchi@intel.com> > Subject: Re: [Intel-xe] [PATCH v5 2/2] drm/xe: Introduce fault injection for gt > reset > > On Tue, Jul 18, 2023 at 07:02:16PM +0530, Himal Prasad Ghimiray wrote: > > To trigger gt reset failure: > > echo 100 > /sys/kernel/debug/dri/<cardX>/fail_gt_reset/probability > > echo 2 > /sys/kernel/debug/dri/<cardX>/fail_gt_reset/times > > why 2 and not 1? We are relying on 2 times should_fail returning true. 1st to skip the check for xe_uc_reset_prepare(>->uc)). and 2nd to send uevent. > > anyway, neat solution! > > Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Thanks for the review. BR Himal > > > > > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> > > Cc: Lucas De Marchi <lucas.demarchi@intel.com> > > Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> > > --- > > drivers/gpu/drm/xe/xe_debugfs.c | 10 ++++++++++ > > drivers/gpu/drm/xe/xe_gt.c | 8 +++++++- > > drivers/gpu/drm/xe/xe_gt.h | 14 ++++++++++++++ > > 3 files changed, 31 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/gpu/drm/xe/xe_debugfs.c > > b/drivers/gpu/drm/xe/xe_debugfs.c index 7827a785b020..08d5bdf4cf61 > > 100644 > > --- a/drivers/gpu/drm/xe/xe_debugfs.c > > +++ b/drivers/gpu/drm/xe/xe_debugfs.c > > @@ -5,6 +5,7 @@ > > > > #include "xe_debugfs.h" > > > > +#include <linux/fault-inject.h> > > #include <linux/string_helpers.h> > > > > #include <drm/drm_debugfs.h> > > @@ -20,6 +21,10 @@ > > #include "xe_vm.h" > > #endif > > > > +#ifdef CONFIG_FAULT_INJECTION > > +DECLARE_FAULT_ATTR(gt_reset_failure); > > +#endif > > + > > static struct xe_device *node_to_xe(struct drm_info_node *node) { > > return to_xe_device(node->minor->dev); @@ -131,4 +136,9 @@ > void > > xe_debugfs_register(struct xe_device *xe) > > > > for_each_gt(gt, xe, id) > > xe_gt_debugfs_register(gt); > > + > > +#ifdef CONFIG_FAULT_INJECTION > > + fault_create_debugfs_attr("fail_gt_reset", root, >_reset_failure); > > +#endif > > + > > } > > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c > > index 1db4d610f2fd..370d4b96e616 100644 > > --- a/drivers/gpu/drm/xe/xe_gt.c > > +++ b/drivers/gpu/drm/xe/xe_gt.c > > @@ -525,6 +525,11 @@ static int gt_reset(struct xe_gt *gt) > > > > xe_gt_info(gt, "reset started\n"); > > > > + if (xe_fault_inject_gt_reset()) { > > + err = -ECANCELED; > > + goto err_fail; > > + } > > + > > xe_gt_sanitize(gt); > > > > xe_device_mem_access_get(gt_to_xe(gt)); > > @@ -562,6 +567,7 @@ static int gt_reset(struct xe_gt *gt) > > err_msg: > > XE_WARN_ON(xe_uc_start(>->uc)); > > xe_device_mem_access_put(gt_to_xe(gt)); > > +err_fail: > > xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err)); > > > > /* Notify userspace about gt reset failure */ @@ -582,7 +588,7 @@ > > void xe_gt_reset_async(struct xe_gt *gt) > > xe_gt_info(gt, "trying reset\n"); > > > > /* Don't do a reset while one is already in flight */ > > - if (xe_uc_reset_prepare(>->uc)) > > + if (!xe_fault_inject_gt_reset() && xe_uc_reset_prepare(>->uc)) > > return; > > > > xe_gt_info(gt, "reset queued\n"); > > diff --git a/drivers/gpu/drm/xe/xe_gt.h b/drivers/gpu/drm/xe/xe_gt.h > > index 7298653a73de..caded203a8a0 100644 > > --- a/drivers/gpu/drm/xe/xe_gt.h > > +++ b/drivers/gpu/drm/xe/xe_gt.h > > @@ -7,6 +7,7 @@ > > #define _XE_GT_H_ > > > > #include <drm/drm_util.h> > > +#include <linux/fault-inject.h> > > > > #include "xe_device_types.h" > > #include "xe_hw_engine.h" > > @@ -16,6 +17,19 @@ > > for_each_if(((hwe__) = (gt__)->hw_engines + (id__)) && \ > > xe_hw_engine_is_valid((hwe__))) > > > > +#ifdef CONFIG_FAULT_INJECTION > > +extern struct fault_attr gt_reset_failure; static inline bool > > +xe_fault_inject_gt_reset(void) { > > + return should_fail(>_reset_failure, 1); } #else static inline bool > > +xe_fault_inject_gt_reset(void) { > > + return false; > > +} > > +#endif > > + > > struct xe_gt *xe_gt_alloc(struct xe_tile *tile); int > > xe_gt_init_early(struct xe_gt *gt); int xe_gt_init(struct xe_gt *gt); > > -- > > 2.25.1 > > ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Intel-xe] ✓ CI.Patch_applied: success for Notify userspace about uevent failure 2023-07-18 13:32 [Intel-xe] [PATCH v5 0/2] Notify userspace about uevent failure Himal Prasad Ghimiray 2023-07-18 13:32 ` [Intel-xe] [PATCH v5 1/2] drm/xe: Notify Userspace when gt reset fails Himal Prasad Ghimiray 2023-07-18 13:32 ` [Intel-xe] [PATCH v5 2/2] drm/xe: Introduce fault injection for gt reset Himal Prasad Ghimiray @ 2023-07-18 15:52 ` Patchwork 2023-07-18 15:52 ` [Intel-xe] ✗ CI.checkpatch: warning " Patchwork 2023-07-18 15:52 ` [Intel-xe] ✗ CI.KUnit: failure " Patchwork 4 siblings, 0 replies; 13+ messages in thread From: Patchwork @ 2023-07-18 15:52 UTC (permalink / raw) To: Himal Prasad Ghimiray; +Cc: intel-xe == Series Details == Series: Notify userspace about uevent failure URL : https://patchwork.freedesktop.org/series/120919/ State : success == Summary == === Applying kernel patches on branch 'drm-xe-next' with base: === Base commit: a43fd52b7 drm/xe: Cleanup style warnings === git am output follows === Applying: drm/xe: Notify Userspace when gt reset fails Applying: drm/xe: Introduce fault injection for gt reset ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Intel-xe] ✗ CI.checkpatch: warning for Notify userspace about uevent failure 2023-07-18 13:32 [Intel-xe] [PATCH v5 0/2] Notify userspace about uevent failure Himal Prasad Ghimiray ` (2 preceding siblings ...) 2023-07-18 15:52 ` [Intel-xe] ✓ CI.Patch_applied: success for Notify userspace about uevent failure Patchwork @ 2023-07-18 15:52 ` Patchwork 2023-07-18 15:52 ` [Intel-xe] ✗ CI.KUnit: failure " Patchwork 4 siblings, 0 replies; 13+ messages in thread From: Patchwork @ 2023-07-18 15:52 UTC (permalink / raw) To: Himal Prasad Ghimiray; +Cc: intel-xe == Series Details == Series: Notify userspace about uevent failure URL : https://patchwork.freedesktop.org/series/120919/ State : warning == Summary == + KERNEL=/kernel + git clone https://gitlab.freedesktop.org/drm/maintainer-tools mt Cloning into 'mt'... warning: redirecting to https://gitlab.freedesktop.org/drm/maintainer-tools.git/ + git -C mt rev-list -n1 origin/master c7d32770e3cd31d9fc134ce41f329b10aa33ee15 + cd /kernel + git config --global --add safe.directory /kernel + git log -n1 commit 0165f5d1bb25154886381f0b2c996c37a9ad73a8 Author: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Date: Tue Jul 18 19:02:16 2023 +0530 drm/xe: Introduce fault injection for gt reset To trigger gt reset failure: echo 100 > /sys/kernel/debug/dri/<cardX>/fail_gt_reset/probability echo 2 > /sys/kernel/debug/dri/<cardX>/fail_gt_reset/times Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> + /mt/dim checkpatch a43fd52b7ab2009631518ead5ca0afe9bc44515a drm-intel ce6aaa08f drm/xe: Notify Userspace when gt reset fails -:27: WARNING:BAD_SIGN_OFF: Duplicate signature #27: Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> total: 0 errors, 1 warnings, 0 checks, 50 lines checked 0165f5d1b drm/xe: Introduce fault injection for gt reset -:93: WARNING:SUSPECT_CODE_INDENT: suspect code indent for conditional statements (16, 0) #93: FILE: drivers/gpu/drm/xe/xe_gt.h:17: for_each_if(((hwe__) = (gt__)->hw_engines + (id__)) && \ [...] +extern struct fault_attr gt_reset_failure; total: 0 errors, 1 warnings, 0 checks, 78 lines checked ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Intel-xe] ✗ CI.KUnit: failure for Notify userspace about uevent failure 2023-07-18 13:32 [Intel-xe] [PATCH v5 0/2] Notify userspace about uevent failure Himal Prasad Ghimiray ` (3 preceding siblings ...) 2023-07-18 15:52 ` [Intel-xe] ✗ CI.checkpatch: warning " Patchwork @ 2023-07-18 15:52 ` Patchwork 4 siblings, 0 replies; 13+ messages in thread From: Patchwork @ 2023-07-18 15:52 UTC (permalink / raw) To: Himal Prasad Ghimiray; +Cc: intel-xe == Series Details == Series: Notify userspace about uevent failure URL : https://patchwork.freedesktop.org/series/120919/ State : failure == Summary == + trap cleanup EXIT + /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/xe/.kunitconfig ERROR:root:In file included from ../drivers/gpu/drm/xe/xe_debugfs.c:8: ../include/linux/fault-inject.h:94:1: error: unknown type name ‘bool’ 94 | bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order); | ^~~~ ../include/linux/fault-inject.h:94:29: error: unknown type name ‘gfp_t’ 94 | bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order); | ^~~~~ ../include/linux/fault-inject.h:96:43: error: unknown type name ‘gfp_t’ 96 | int should_failslab(struct kmem_cache *s, gfp_t gfpflags); | ^~~~~ ../include/linux/fault-inject.h:100:15: error: unknown type name ‘bool’ 100 | static inline bool __should_failslab(struct kmem_cache *s, gfp_t gfpflags) | ^~~~ ../include/linux/fault-inject.h:100:60: error: unknown type name ‘gfp_t’ 100 | static inline bool __should_failslab(struct kmem_cache *s, gfp_t gfpflags) | ^~~~~ make[6]: *** [../scripts/Makefile.build:252: drivers/gpu/drm/xe/xe_debugfs.o] Error 1 make[6]: *** Waiting for unfinished jobs.... make[5]: *** [../scripts/Makefile.build:494: drivers/gpu/drm/xe] Error 2 make[4]: *** [../scripts/Makefile.build:494: drivers/gpu/drm] Error 2 make[3]: *** [../scripts/Makefile.build:494: drivers/gpu] Error 2 make[3]: *** Waiting for unfinished jobs.... make[2]: *** [../scripts/Makefile.build:494: drivers] Error 2 make[2]: *** Waiting for unfinished jobs.... make[1]: *** [/kernel/Makefile:2026: .] Error 2 make: *** [Makefile:226: __sub-make] Error 2 [15:52:35] Configuring KUnit Kernel ... Generating .config ... Populating config with: $ make ARCH=um O=.kunit olddefconfig [15:52:40] Building KUnit Kernel ... Populating config with: $ make ARCH=um O=.kunit olddefconfig Building with: $ make ARCH=um O=.kunit --jobs=48 + cleanup ++ stat -c %u:%g /kernel + chown -R 1003:1003 /kernel ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2023-07-24 8:38 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-07-18 13:32 [Intel-xe] [PATCH v5 0/2] Notify userspace about uevent failure Himal Prasad Ghimiray 2023-07-18 13:32 ` [Intel-xe] [PATCH v5 1/2] drm/xe: Notify Userspace when gt reset fails Himal Prasad Ghimiray 2023-07-18 15:01 ` Nilawar, Badal 2023-07-18 23:52 ` Roper, Matthew D 2023-07-19 1:51 ` Ghimiray, Himal Prasad 2023-07-20 16:32 ` Rodrigo Vivi 2023-07-24 8:35 ` Ghimiray, Himal Prasad 2023-07-18 13:32 ` [Intel-xe] [PATCH v5 2/2] drm/xe: Introduce fault injection for gt reset Himal Prasad Ghimiray 2023-07-20 16:28 ` Rodrigo Vivi 2023-07-24 8:38 ` Ghimiray, Himal Prasad 2023-07-18 15:52 ` [Intel-xe] ✓ CI.Patch_applied: success for Notify userspace about uevent failure Patchwork 2023-07-18 15:52 ` [Intel-xe] ✗ CI.checkpatch: warning " Patchwork 2023-07-18 15:52 ` [Intel-xe] ✗ CI.KUnit: failure " Patchwork
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.