* [PATCH v5] drm/xe: Add helper function to inject fault into ct_dead_capture()
@ 2025-05-22 8:00 Satyanarayana K V P
2025-05-22 8:27 ` ✓ CI.Patch_applied: success for drm/xe: Add helper function to inject fault into ct_dead_capture() (rev5) Patchwork
` (3 more replies)
0 siblings, 4 replies; 8+ messages in thread
From: Satyanarayana K V P @ 2025-05-22 8:00 UTC (permalink / raw)
To: intel-xe
Cc: Satyanarayana K V P, John Harrison, Michal Wajdeczko, Jani Nikula
When injecting fault to xe_guc_ct_send_recv() & xe_guc_mmio_send_recv()
functions, the CI test systems are going out of space and crashing. To
avoid this issue, a new helper function is created and when fault is
injected into this xe_inject_fault() helper function, ct dead capture
is avoided which suppresses ct dumps in the log.
Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Suggested-by: John Harrison <John.C.Harrison@Intel.com>
---
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Jani Nikula <jani.nikula@intel.com>
V4 -> V5:
- Fixed review comments.
V3 -> V4:
- Updated the name of helper function and moved to xe_device.h file.
V2 -> V3:
- Added inline function to avoid compilation error in the absence of
CONFIG_FUNCTION_ERROR_INJECTION.
V1 -> V2:
- Fixed review comments.
---
drivers/gpu/drm/xe/xe_device.c | 18 ++++++++++++++++++
drivers/gpu/drm/xe/xe_device.h | 14 ++++++++++++++
drivers/gpu/drm/xe/xe_guc_ct.c | 6 ++++++
3 files changed, 38 insertions(+)
diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 660b0c5126dc..e1b4087a2974 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -1120,6 +1120,24 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
xe_pm_runtime_put(xe);
}
+/**
+ * xe_is_injection_active() - Helper function to test whether a fault
+ * inject test is running.
+ *
+ * This is a helper function that the driver can use to detect whether
+ * a fault injection test is running in order to suppress excessive debug
+ * output. By default, the return value is fixed as zero but it can be modified
+ * by the fault injection framework to return an error.
+ *
+ * Returns:
+ * 0 if no fault is injected.
+ * Injected error if fault is injected.
+ */
+int xe_is_injection_active(void)
+{
+ return xe_inject_fault();
+}
+
/**
* xe_device_declare_wedged - Declare device wedged
* @xe: xe device instance
diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
index 0bc3bc8e6803..13a43fe4fb64 100644
--- a/drivers/gpu/drm/xe/xe_device.h
+++ b/drivers/gpu/drm/xe/xe_device.h
@@ -195,6 +195,20 @@ void xe_device_declare_wedged(struct xe_device *xe);
struct xe_file *xe_file_get(struct xe_file *xef);
void xe_file_put(struct xe_file *xef);
+#ifdef CONFIG_FUNCTION_ERROR_INJECTION
+/*
+ * As fault is injected using this function, need to make sure that
+ * the compiler does not optimize and make it as a inline function.
+ * To prevent compile optimization, "noinline" is added.
+ */
+static noinline int xe_inject_fault(void) { return 0; }
+ALLOW_ERROR_INJECTION(xe_inject_fault, ERRNO);
+#else
+static inline int xe_inject_fault(void) { return 0; }
+#endif
+
+int xe_is_injection_active(void);
+
/*
* Occasionally it is seen that the G2H worker starts running after a delay of more than
* a second even after being queued and activated by the Linux workqueue subsystem. This
diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
index 822f4c33f730..89f992feba31 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.c
+++ b/drivers/gpu/drm/xe/xe_guc_ct.c
@@ -2040,6 +2040,12 @@ static void ct_dead_capture(struct xe_guc_ct *ct, struct guc_ctb *ctb, u32 reaso
if (ctb)
ctb->info.broken = true;
+ /*
+ * Huge dump is getting generated when injecting error for guc CT/MMIO
+ * functions. So, let us suppress the dump when fault is injected.
+ */
+ if (xe_is_injection_active())
+ return;
/* Ignore further errors after the first dump until a reset */
if (ct->dead.reported)
--
2.34.1
^ permalink raw reply related [flat|nested] 8+ messages in thread* ✓ CI.Patch_applied: success for drm/xe: Add helper function to inject fault into ct_dead_capture() (rev5) 2025-05-22 8:00 [PATCH v5] drm/xe: Add helper function to inject fault into ct_dead_capture() Satyanarayana K V P @ 2025-05-22 8:27 ` Patchwork 2025-05-22 8:27 ` ✓ CI.checkpatch: " Patchwork ` (2 subsequent siblings) 3 siblings, 0 replies; 8+ messages in thread From: Patchwork @ 2025-05-22 8:27 UTC (permalink / raw) To: Satyanarayana K V P; +Cc: intel-xe == Series Details == Series: drm/xe: Add helper function to inject fault into ct_dead_capture() (rev5) URL : https://patchwork.freedesktop.org/series/148415/ State : success == Summary == === Applying kernel patches on branch 'drm-tip' with base: === Base commit: 09130f929f7c drm-tip: 2025y-05m-22d-06h-26m-11s UTC integration manifest === git am output follows === Applying: drm/xe: Add helper function to inject fault into ct_dead_capture() ^ permalink raw reply [flat|nested] 8+ messages in thread
* ✓ CI.checkpatch: success for drm/xe: Add helper function to inject fault into ct_dead_capture() (rev5) 2025-05-22 8:00 [PATCH v5] drm/xe: Add helper function to inject fault into ct_dead_capture() Satyanarayana K V P 2025-05-22 8:27 ` ✓ CI.Patch_applied: success for drm/xe: Add helper function to inject fault into ct_dead_capture() (rev5) Patchwork @ 2025-05-22 8:27 ` Patchwork 2025-05-22 8:28 ` ✗ CI.KUnit: failure " Patchwork 2025-05-22 19:38 ` [PATCH v5] drm/xe: Add helper function to inject fault into ct_dead_capture() John Harrison 3 siblings, 0 replies; 8+ messages in thread From: Patchwork @ 2025-05-22 8:27 UTC (permalink / raw) To: Satyanarayana K V P; +Cc: intel-xe == Series Details == Series: drm/xe: Add helper function to inject fault into ct_dead_capture() (rev5) URL : https://patchwork.freedesktop.org/series/148415/ State : success == Summary == + KERNEL=/kernel + git clone https://gitlab.freedesktop.org/drm/maintainer-tools mt Cloning into 'mt'... warning: redirecting to https://gitlab.freedesktop.org/drm/maintainer-tools.git/ + git -C mt rev-list -n1 origin/master 202708c00696422fd217223bb679a353a5936e23 + cd /kernel + git config --global --add safe.directory /kernel + git log -n1 commit 74f8ca53b9a9ba7e8cee46bb86d0bf0828682bb7 Author: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Date: Thu May 22 08:00:20 2025 +0000 drm/xe: Add helper function to inject fault into ct_dead_capture() When injecting fault to xe_guc_ct_send_recv() & xe_guc_mmio_send_recv() functions, the CI test systems are going out of space and crashing. To avoid this issue, a new helper function is created and when fault is injected into this xe_inject_fault() helper function, ct dead capture is avoided which suppresses ct dumps in the log. Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Suggested-by: John Harrison <John.C.Harrison@Intel.com> + /mt/dim checkpatch 09130f929f7c79440cc8890bcce976843f439f01 drm-intel 74f8ca53b9a9 drm/xe: Add helper function to inject fault into ct_dead_capture() ^ permalink raw reply [flat|nested] 8+ messages in thread
* ✗ CI.KUnit: failure for drm/xe: Add helper function to inject fault into ct_dead_capture() (rev5) 2025-05-22 8:00 [PATCH v5] drm/xe: Add helper function to inject fault into ct_dead_capture() Satyanarayana K V P 2025-05-22 8:27 ` ✓ CI.Patch_applied: success for drm/xe: Add helper function to inject fault into ct_dead_capture() (rev5) Patchwork 2025-05-22 8:27 ` ✓ CI.checkpatch: " Patchwork @ 2025-05-22 8:28 ` Patchwork 2025-05-22 19:38 ` [PATCH v5] drm/xe: Add helper function to inject fault into ct_dead_capture() John Harrison 3 siblings, 0 replies; 8+ messages in thread From: Patchwork @ 2025-05-22 8:28 UTC (permalink / raw) To: Satyanarayana K V P; +Cc: intel-xe == Series Details == Series: drm/xe: Add helper function to inject fault into ct_dead_capture() (rev5) URL : https://patchwork.freedesktop.org/series/148415/ State : failure == Summary == + trap cleanup EXIT + /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/xe/.kunitconfig ERROR:root:../drivers/gpu/drm/drm_bridge.c:1406:6: error: redefinition of ‘devm_drm_put_bridge’ 1406 | void devm_drm_put_bridge(struct device *dev, struct drm_bridge *bridge) | ^~~~~~~~~~~~~~~~~~~ In file included from ../drivers/gpu/drm/drm_bridge.c:31: ../include/drm/drm_bridge.h:1314:20: note: previous definition of ‘devm_drm_put_bridge’ with type ‘void(struct device *, struct drm_bridge *)’ 1314 | static inline void devm_drm_put_bridge(struct device *dev, struct drm_bridge *bridge) {} | ^~~~~~~~~~~~~~~~~~~ make[6]: *** [../scripts/Makefile.build:203: drivers/gpu/drm/drm_bridge.o] Error 1 make[6]: *** Waiting for unfinished jobs.... make[5]: *** [../scripts/Makefile.build:461: drivers/gpu/drm] Error 2 make[4]: *** [../scripts/Makefile.build:461: drivers/gpu] Error 2 make[3]: *** [../scripts/Makefile.build:461: drivers] Error 2 make[2]: *** [/kernel/Makefile:2003: .] Error 2 make[1]: *** [/kernel/Makefile:248: __sub-make] Error 2 make: *** [Makefile:248: __sub-make] Error 2 [08:27:53] Configuring KUnit Kernel ... Generating .config ... Populating config with: $ make ARCH=um O=.kunit olddefconfig [08:27:57] Building KUnit Kernel ... Populating config with: $ make ARCH=um O=.kunit olddefconfig Building with: $ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48 + cleanup ++ stat -c %u:%g /kernel + chown -R 1003:1003 /kernel ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v5] drm/xe: Add helper function to inject fault into ct_dead_capture() 2025-05-22 8:00 [PATCH v5] drm/xe: Add helper function to inject fault into ct_dead_capture() Satyanarayana K V P ` (2 preceding siblings ...) 2025-05-22 8:28 ` ✗ CI.KUnit: failure " Patchwork @ 2025-05-22 19:38 ` John Harrison 2025-05-23 5:33 ` K V P, Satyanarayana 3 siblings, 1 reply; 8+ messages in thread From: John Harrison @ 2025-05-22 19:38 UTC (permalink / raw) To: Satyanarayana K V P, intel-xe; +Cc: Michal Wajdeczko, Jani Nikula On 5/22/2025 1:00 AM, Satyanarayana K V P wrote: > When injecting fault to xe_guc_ct_send_recv() & xe_guc_mmio_send_recv() > functions, the CI test systems are going out of space and crashing. To > avoid this issue, a new helper function is created and when fault is > injected into this xe_inject_fault() helper function, ct dead capture > is avoided which suppresses ct dumps in the log. > > Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> > Suggested-by: John Harrison <John.C.Harrison@Intel.com> > --- > Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> > Cc: Jani Nikula <jani.nikula@intel.com> > > V4 -> V5: > - Fixed review comments. > > V3 -> V4: > - Updated the name of helper function and moved to xe_device.h file. > > V2 -> V3: > - Added inline function to avoid compilation error in the absence of > CONFIG_FUNCTION_ERROR_INJECTION. > > V1 -> V2: > - Fixed review comments. > --- > drivers/gpu/drm/xe/xe_device.c | 18 ++++++++++++++++++ > drivers/gpu/drm/xe/xe_device.h | 14 ++++++++++++++ > drivers/gpu/drm/xe/xe_guc_ct.c | 6 ++++++ > 3 files changed, 38 insertions(+) > > diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c > index 660b0c5126dc..e1b4087a2974 100644 > --- a/drivers/gpu/drm/xe/xe_device.c > +++ b/drivers/gpu/drm/xe/xe_device.c > @@ -1120,6 +1120,24 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg) > xe_pm_runtime_put(xe); > } > > +/** > + * xe_is_injection_active() - Helper function to test whether a fault > + * inject test is running. > + * > + * This is a helper function that the driver can use to detect whether > + * a fault injection test is running in order to suppress excessive debug > + * output. By default, the return value is fixed as zero but it can be modified > + * by the fault injection framework to return an error. > + * > + * Returns: > + * 0 if no fault is injected. > + * Injected error if fault is injected. > + */ > +int xe_is_injection_active(void) > +{ > + return xe_inject_fault(); > +} > + I'm confused. What is the point of this wrapper? > /** > * xe_device_declare_wedged - Declare device wedged > * @xe: xe device instance > diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h > index 0bc3bc8e6803..13a43fe4fb64 100644 > --- a/drivers/gpu/drm/xe/xe_device.h > +++ b/drivers/gpu/drm/xe/xe_device.h > @@ -195,6 +195,20 @@ void xe_device_declare_wedged(struct xe_device *xe); > struct xe_file *xe_file_get(struct xe_file *xef); > void xe_file_put(struct xe_file *xef); > > +#ifdef CONFIG_FUNCTION_ERROR_INJECTION > +/* > + * As fault is injected using this function, need to make sure that > + * the compiler does not optimize and make it as a inline function. > + * To prevent compile optimization, "noinline" is added. > + */ > +static noinline int xe_inject_fault(void) { return 0; } > +ALLOW_ERROR_INJECTION(xe_inject_fault, ERRNO); > +#else > +static inline int xe_inject_fault(void) { return 0; } Why not name this xe_is_injection_active() and just call it directly? John. > +#endif > + > +int xe_is_injection_active(void); > + > /* > * Occasionally it is seen that the G2H worker starts running after a delay of more than > * a second even after being queued and activated by the Linux workqueue subsystem. This > diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c > index 822f4c33f730..89f992feba31 100644 > --- a/drivers/gpu/drm/xe/xe_guc_ct.c > +++ b/drivers/gpu/drm/xe/xe_guc_ct.c > @@ -2040,6 +2040,12 @@ static void ct_dead_capture(struct xe_guc_ct *ct, struct guc_ctb *ctb, u32 reaso > > if (ctb) > ctb->info.broken = true; > + /* > + * Huge dump is getting generated when injecting error for guc CT/MMIO > + * functions. So, let us suppress the dump when fault is injected. > + */ > + if (xe_is_injection_active()) > + return; > > /* Ignore further errors after the first dump until a reset */ > if (ct->dead.reported) ^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: [PATCH v5] drm/xe: Add helper function to inject fault into ct_dead_capture() 2025-05-22 19:38 ` [PATCH v5] drm/xe: Add helper function to inject fault into ct_dead_capture() John Harrison @ 2025-05-23 5:33 ` K V P, Satyanarayana 2025-05-23 16:51 ` John Harrison 0 siblings, 1 reply; 8+ messages in thread From: K V P, Satyanarayana @ 2025-05-23 5:33 UTC (permalink / raw) To: Harrison, John C, intel-xe@lists.freedesktop.org Cc: Wajdeczko, Michal, Nikula, Jani, K V P, Satyanarayana Hi > -----Original Message----- > From: Harrison, John C <john.c.harrison@intel.com> > Sent: Friday, May 23, 2025 1:09 AM > To: K V P, Satyanarayana <satyanarayana.k.v.p@intel.com>; intel- > xe@lists.freedesktop.org > Cc: Wajdeczko, Michal <Michal.Wajdeczko@intel.com>; Nikula, Jani > <jani.nikula@intel.com> > Subject: Re: [PATCH v5] drm/xe: Add helper function to inject fault into > ct_dead_capture() > > On 5/22/2025 1:00 AM, Satyanarayana K V P wrote: > > When injecting fault to xe_guc_ct_send_recv() & xe_guc_mmio_send_recv() > > functions, the CI test systems are going out of space and crashing. To > > avoid this issue, a new helper function is created and when fault is > > injected into this xe_inject_fault() helper function, ct dead capture > > is avoided which suppresses ct dumps in the log. > > > > Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> > > Suggested-by: John Harrison <John.C.Harrison@Intel.com> > > --- > > Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> > > Cc: Jani Nikula <jani.nikula@intel.com> > > > > V4 -> V5: > > - Fixed review comments. > > > > V3 -> V4: > > - Updated the name of helper function and moved to xe_device.h file. > > > > V2 -> V3: > > - Added inline function to avoid compilation error in the absence of > > CONFIG_FUNCTION_ERROR_INJECTION. > > > > V1 -> V2: > > - Fixed review comments. > > --- > > drivers/gpu/drm/xe/xe_device.c | 18 ++++++++++++++++++ > > drivers/gpu/drm/xe/xe_device.h | 14 ++++++++++++++ > > drivers/gpu/drm/xe/xe_guc_ct.c | 6 ++++++ > > 3 files changed, 38 insertions(+) > > > > diff --git a/drivers/gpu/drm/xe/xe_device.c > b/drivers/gpu/drm/xe/xe_device.c > > index 660b0c5126dc..e1b4087a2974 100644 > > --- a/drivers/gpu/drm/xe/xe_device.c > > +++ b/drivers/gpu/drm/xe/xe_device.c > > @@ -1120,6 +1120,24 @@ static void xe_device_wedged_fini(struct > drm_device *drm, void *arg) > > xe_pm_runtime_put(xe); > > } > > > > +/** > > + * xe_is_injection_active() - Helper function to test whether a fault > > + * inject test is running. > > + * > > + * This is a helper function that the driver can use to detect whether > > + * a fault injection test is running in order to suppress excessive debug > > + * output. By default, the return value is fixed as zero but it can be modified > > + * by the fault injection framework to return an error. > > + * > > + * Returns: > > + * 0 if no fault is injected. > > + * Injected error if fault is injected. > > + */ > > +int xe_is_injection_active(void) > > +{ > > + return xe_inject_fault(); > > +} > > + > I'm confused. What is the point of this wrapper? We are injecting fault into xe_inject_fault() function which is generic. The fault injection status is obtained from xe_is_injection_active() function. This is to make function names meaningful. > > > /** > > * xe_device_declare_wedged - Declare device wedged > > * @xe: xe device instance > > diff --git a/drivers/gpu/drm/xe/xe_device.h > b/drivers/gpu/drm/xe/xe_device.h > > index 0bc3bc8e6803..13a43fe4fb64 100644 > > --- a/drivers/gpu/drm/xe/xe_device.h > > +++ b/drivers/gpu/drm/xe/xe_device.h > > @@ -195,6 +195,20 @@ void xe_device_declare_wedged(struct xe_device > *xe); > > struct xe_file *xe_file_get(struct xe_file *xef); > > void xe_file_put(struct xe_file *xef); > > > > +#ifdef CONFIG_FUNCTION_ERROR_INJECTION > > +/* > > + * As fault is injected using this function, need to make sure that > > + * the compiler does not optimize and make it as a inline function. > > + * To prevent compile optimization, "noinline" is added. > > + */ > > +static noinline int xe_inject_fault(void) { return 0; } > > +ALLOW_ERROR_INJECTION(xe_inject_fault, ERRNO); > > +#else > > +static inline int xe_inject_fault(void) { return 0; } > Why not name this xe_is_injection_active() and just call it directly? > > John. > > > +#endif > > + > > +int xe_is_injection_active(void); > > + > > /* > > * Occasionally it is seen that the G2H worker starts running after a delay of > more than > > * a second even after being queued and activated by the Linux workqueue > subsystem. This > > diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c > b/drivers/gpu/drm/xe/xe_guc_ct.c > > index 822f4c33f730..89f992feba31 100644 > > --- a/drivers/gpu/drm/xe/xe_guc_ct.c > > +++ b/drivers/gpu/drm/xe/xe_guc_ct.c > > @@ -2040,6 +2040,12 @@ static void ct_dead_capture(struct xe_guc_ct > *ct, struct guc_ctb *ctb, u32 reaso > > > > if (ctb) > > ctb->info.broken = true; > > + /* > > + * Huge dump is getting generated when injecting error for guc > CT/MMIO > > + * functions. So, let us suppress the dump when fault is injected. > > + */ > > + if (xe_is_injection_active()) > > + return; > > > > /* Ignore further errors after the first dump until a reset */ > > if (ct->dead.reported) ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v5] drm/xe: Add helper function to inject fault into ct_dead_capture() 2025-05-23 5:33 ` K V P, Satyanarayana @ 2025-05-23 16:51 ` John Harrison 2025-05-26 11:20 ` Jani Nikula 0 siblings, 1 reply; 8+ messages in thread From: John Harrison @ 2025-05-23 16:51 UTC (permalink / raw) To: K V P, Satyanarayana, intel-xe@lists.freedesktop.org Cc: Wajdeczko, Michal, Nikula, Jani On 5/22/2025 10:33 PM, K V P, Satyanarayana wrote: > Hi >> -----Original Message----- >> From: Harrison, John C <john.c.harrison@intel.com> >> Sent: Friday, May 23, 2025 1:09 AM >> To: K V P, Satyanarayana <satyanarayana.k.v.p@intel.com>; intel- >> xe@lists.freedesktop.org >> Cc: Wajdeczko, Michal <Michal.Wajdeczko@intel.com>; Nikula, Jani >> <jani.nikula@intel.com> >> Subject: Re: [PATCH v5] drm/xe: Add helper function to inject fault into >> ct_dead_capture() >> >> On 5/22/2025 1:00 AM, Satyanarayana K V P wrote: >>> When injecting fault to xe_guc_ct_send_recv() & xe_guc_mmio_send_recv() >>> functions, the CI test systems are going out of space and crashing. To >>> avoid this issue, a new helper function is created and when fault is >>> injected into this xe_inject_fault() helper function, ct dead capture >>> is avoided which suppresses ct dumps in the log. >>> >>> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> >>> Suggested-by: John Harrison <John.C.Harrison@Intel.com> >>> --- >>> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> >>> Cc: Jani Nikula <jani.nikula@intel.com> >>> >>> V4 -> V5: >>> - Fixed review comments. >>> >>> V3 -> V4: >>> - Updated the name of helper function and moved to xe_device.h file. >>> >>> V2 -> V3: >>> - Added inline function to avoid compilation error in the absence of >>> CONFIG_FUNCTION_ERROR_INJECTION. >>> >>> V1 -> V2: >>> - Fixed review comments. >>> --- >>> drivers/gpu/drm/xe/xe_device.c | 18 ++++++++++++++++++ >>> drivers/gpu/drm/xe/xe_device.h | 14 ++++++++++++++ >>> drivers/gpu/drm/xe/xe_guc_ct.c | 6 ++++++ >>> 3 files changed, 38 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/xe/xe_device.c >> b/drivers/gpu/drm/xe/xe_device.c >>> index 660b0c5126dc..e1b4087a2974 100644 >>> --- a/drivers/gpu/drm/xe/xe_device.c >>> +++ b/drivers/gpu/drm/xe/xe_device.c >>> @@ -1120,6 +1120,24 @@ static void xe_device_wedged_fini(struct >> drm_device *drm, void *arg) >>> xe_pm_runtime_put(xe); >>> } >>> >>> +/** >>> + * xe_is_injection_active() - Helper function to test whether a fault >>> + * inject test is running. >>> + * >>> + * This is a helper function that the driver can use to detect whether >>> + * a fault injection test is running in order to suppress excessive debug >>> + * output. By default, the return value is fixed as zero but it can be modified >>> + * by the fault injection framework to return an error. >>> + * >>> + * Returns: >>> + * 0 if no fault is injected. >>> + * Injected error if fault is injected. >>> + */ >>> +int xe_is_injection_active(void) >>> +{ >>> + return xe_inject_fault(); >>> +} >>> + >> I'm confused. What is the point of this wrapper? > We are injecting fault into xe_inject_fault() function which is generic. The fault injection status is obtained from > xe_is_injection_active() function. This is to make function names meaningful. But why? Why not just have a single function that is named "xe_is_injection_active" and has the fault injected into it directly? It has a meaningful name, it does what it needs to do, it keeps things simple. Why have an extra level of function wrapper for no benefit? And in terms of meaningful names, "xe_inject_fault" sounds like a function which is going to actually inject a fault somewhere. Whereas, you are wanting something more like "xe_stub_function_for_injecting_fault_into". But as above, why bother? The comment tells you exactly what is happening here. There is no need for extra levels of wrapping just to change the name of a function. John. >>> /** >>> * xe_device_declare_wedged - Declare device wedged >>> * @xe: xe device instance >>> diff --git a/drivers/gpu/drm/xe/xe_device.h >> b/drivers/gpu/drm/xe/xe_device.h >>> index 0bc3bc8e6803..13a43fe4fb64 100644 >>> --- a/drivers/gpu/drm/xe/xe_device.h >>> +++ b/drivers/gpu/drm/xe/xe_device.h >>> @@ -195,6 +195,20 @@ void xe_device_declare_wedged(struct xe_device >> *xe); >>> struct xe_file *xe_file_get(struct xe_file *xef); >>> void xe_file_put(struct xe_file *xef); >>> >>> +#ifdef CONFIG_FUNCTION_ERROR_INJECTION >>> +/* >>> + * As fault is injected using this function, need to make sure that >>> + * the compiler does not optimize and make it as a inline function. >>> + * To prevent compile optimization, "noinline" is added. >>> + */ >>> +static noinline int xe_inject_fault(void) { return 0; } >>> +ALLOW_ERROR_INJECTION(xe_inject_fault, ERRNO); >>> +#else >>> +static inline int xe_inject_fault(void) { return 0; } >> Why not name this xe_is_injection_active() and just call it directly? >> >> John. >> >>> +#endif >>> + >>> +int xe_is_injection_active(void); >>> + >>> /* >>> * Occasionally it is seen that the G2H worker starts running after a delay of >> more than >>> * a second even after being queued and activated by the Linux workqueue >> subsystem. This >>> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c >> b/drivers/gpu/drm/xe/xe_guc_ct.c >>> index 822f4c33f730..89f992feba31 100644 >>> --- a/drivers/gpu/drm/xe/xe_guc_ct.c >>> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c >>> @@ -2040,6 +2040,12 @@ static void ct_dead_capture(struct xe_guc_ct >> *ct, struct guc_ctb *ctb, u32 reaso >>> if (ctb) >>> ctb->info.broken = true; >>> + /* >>> + * Huge dump is getting generated when injecting error for guc >> CT/MMIO >>> + * functions. So, let us suppress the dump when fault is injected. >>> + */ >>> + if (xe_is_injection_active()) >>> + return; >>> >>> /* Ignore further errors after the first dump until a reset */ >>> if (ct->dead.reported) ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v5] drm/xe: Add helper function to inject fault into ct_dead_capture() 2025-05-23 16:51 ` John Harrison @ 2025-05-26 11:20 ` Jani Nikula 0 siblings, 0 replies; 8+ messages in thread From: Jani Nikula @ 2025-05-26 11:20 UTC (permalink / raw) To: John Harrison, K V P, Satyanarayana, intel-xe@lists.freedesktop.org Cc: Wajdeczko, Michal On Fri, 23 May 2025, John Harrison <john.c.harrison@intel.com> wrote: > On 5/22/2025 10:33 PM, K V P, Satyanarayana wrote: >> Hi >>> -----Original Message----- >>> From: Harrison, John C <john.c.harrison@intel.com> >>> Sent: Friday, May 23, 2025 1:09 AM >>> To: K V P, Satyanarayana <satyanarayana.k.v.p@intel.com>; intel- >>> xe@lists.freedesktop.org >>> Cc: Wajdeczko, Michal <Michal.Wajdeczko@intel.com>; Nikula, Jani >>> <jani.nikula@intel.com> >>> Subject: Re: [PATCH v5] drm/xe: Add helper function to inject fault into >>> ct_dead_capture() >>> >>> On 5/22/2025 1:00 AM, Satyanarayana K V P wrote: >>>> When injecting fault to xe_guc_ct_send_recv() & xe_guc_mmio_send_recv() >>>> functions, the CI test systems are going out of space and crashing. To >>>> avoid this issue, a new helper function is created and when fault is >>>> injected into this xe_inject_fault() helper function, ct dead capture >>>> is avoided which suppresses ct dumps in the log. >>>> >>>> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> >>>> Suggested-by: John Harrison <John.C.Harrison@Intel.com> >>>> --- >>>> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> >>>> Cc: Jani Nikula <jani.nikula@intel.com> >>>> >>>> V4 -> V5: >>>> - Fixed review comments. >>>> >>>> V3 -> V4: >>>> - Updated the name of helper function and moved to xe_device.h file. >>>> >>>> V2 -> V3: >>>> - Added inline function to avoid compilation error in the absence of >>>> CONFIG_FUNCTION_ERROR_INJECTION. >>>> >>>> V1 -> V2: >>>> - Fixed review comments. >>>> --- >>>> drivers/gpu/drm/xe/xe_device.c | 18 ++++++++++++++++++ >>>> drivers/gpu/drm/xe/xe_device.h | 14 ++++++++++++++ >>>> drivers/gpu/drm/xe/xe_guc_ct.c | 6 ++++++ >>>> 3 files changed, 38 insertions(+) >>>> >>>> diff --git a/drivers/gpu/drm/xe/xe_device.c >>> b/drivers/gpu/drm/xe/xe_device.c >>>> index 660b0c5126dc..e1b4087a2974 100644 >>>> --- a/drivers/gpu/drm/xe/xe_device.c >>>> +++ b/drivers/gpu/drm/xe/xe_device.c >>>> @@ -1120,6 +1120,24 @@ static void xe_device_wedged_fini(struct >>> drm_device *drm, void *arg) >>>> xe_pm_runtime_put(xe); >>>> } >>>> >>>> +/** >>>> + * xe_is_injection_active() - Helper function to test whether a fault >>>> + * inject test is running. >>>> + * >>>> + * This is a helper function that the driver can use to detect whether >>>> + * a fault injection test is running in order to suppress excessive debug >>>> + * output. By default, the return value is fixed as zero but it can be modified >>>> + * by the fault injection framework to return an error. >>>> + * >>>> + * Returns: >>>> + * 0 if no fault is injected. >>>> + * Injected error if fault is injected. >>>> + */ >>>> +int xe_is_injection_active(void) >>>> +{ >>>> + return xe_inject_fault(); >>>> +} >>>> + >>> I'm confused. What is the point of this wrapper? >> We are injecting fault into xe_inject_fault() function which is generic. The fault injection status is obtained from >> xe_is_injection_active() function. This is to make function names meaningful. > But why? > > Why not just have a single function that is named > "xe_is_injection_active" and has the fault injected into it directly? It > has a meaningful name, it does what it needs to do, it keeps things > simple. Why have an extra level of function wrapper for no benefit? > > And in terms of meaningful names, "xe_inject_fault" sounds like a > function which is going to actually inject a fault somewhere. Whereas, > you are wanting something more like > "xe_stub_function_for_injecting_fault_into". But as above, why bother? > The comment tells you exactly what is happening here. There is no need > for extra levels of wrapping just to change the name of a function. I think all of this begs the question, does this belong in the kernel at all? The problem is all in userspace ("CI test systems are going out of space and crashing"). Userspace asks for error injection. Why does the kernel need to apply the policy of not logging something specific in this case? What else should the kernel not log, and who's going to decide? Personally I think logging core dumps in dmesg even in CI setting is questionable, but let's just assume it's needed. Then there needs to be a (debugfs?) knob for userspace to enable/disable that behaviour. The fault injection test is very specific. Have the test disable dmesg coredumps during fault injection tests. That's all. Don't make the kernel more complicated and ridden with special casing for userspace problems. BR, Jani. > > John. > >>>> /** >>>> * xe_device_declare_wedged - Declare device wedged >>>> * @xe: xe device instance >>>> diff --git a/drivers/gpu/drm/xe/xe_device.h >>> b/drivers/gpu/drm/xe/xe_device.h >>>> index 0bc3bc8e6803..13a43fe4fb64 100644 >>>> --- a/drivers/gpu/drm/xe/xe_device.h >>>> +++ b/drivers/gpu/drm/xe/xe_device.h >>>> @@ -195,6 +195,20 @@ void xe_device_declare_wedged(struct xe_device >>> *xe); >>>> struct xe_file *xe_file_get(struct xe_file *xef); >>>> void xe_file_put(struct xe_file *xef); >>>> >>>> +#ifdef CONFIG_FUNCTION_ERROR_INJECTION >>>> +/* >>>> + * As fault is injected using this function, need to make sure that >>>> + * the compiler does not optimize and make it as a inline function. >>>> + * To prevent compile optimization, "noinline" is added. >>>> + */ >>>> +static noinline int xe_inject_fault(void) { return 0; } >>>> +ALLOW_ERROR_INJECTION(xe_inject_fault, ERRNO); >>>> +#else >>>> +static inline int xe_inject_fault(void) { return 0; } >>> Why not name this xe_is_injection_active() and just call it directly? >>> >>> John. >>> >>>> +#endif >>>> + >>>> +int xe_is_injection_active(void); >>>> + >>>> /* >>>> * Occasionally it is seen that the G2H worker starts running after a delay of >>> more than >>>> * a second even after being queued and activated by the Linux workqueue >>> subsystem. This >>>> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c >>> b/drivers/gpu/drm/xe/xe_guc_ct.c >>>> index 822f4c33f730..89f992feba31 100644 >>>> --- a/drivers/gpu/drm/xe/xe_guc_ct.c >>>> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c >>>> @@ -2040,6 +2040,12 @@ static void ct_dead_capture(struct xe_guc_ct >>> *ct, struct guc_ctb *ctb, u32 reaso >>>> if (ctb) >>>> ctb->info.broken = true; >>>> + /* >>>> + * Huge dump is getting generated when injecting error for guc >>> CT/MMIO >>>> + * functions. So, let us suppress the dump when fault is injected. >>>> + */ >>>> + if (xe_is_injection_active()) >>>> + return; >>>> >>>> /* Ignore further errors after the first dump until a reset */ >>>> if (ct->dead.reported) > -- Jani Nikula, Intel ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-05-26 11:20 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-05-22 8:00 [PATCH v5] drm/xe: Add helper function to inject fault into ct_dead_capture() Satyanarayana K V P 2025-05-22 8:27 ` ✓ CI.Patch_applied: success for drm/xe: Add helper function to inject fault into ct_dead_capture() (rev5) Patchwork 2025-05-22 8:27 ` ✓ CI.checkpatch: " Patchwork 2025-05-22 8:28 ` ✗ CI.KUnit: failure " Patchwork 2025-05-22 19:38 ` [PATCH v5] drm/xe: Add helper function to inject fault into ct_dead_capture() John Harrison 2025-05-23 5:33 ` K V P, Satyanarayana 2025-05-23 16:51 ` John Harrison 2025-05-26 11:20 ` Jani Nikula
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox