[PATCH v5] drm/xe: Add helper function to inject fault into ct_dead

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v5] drm/xe: Add helper function to inject fault into ct_dead_capture()
@ 2025-05-22  8:00 Satyanarayana K V P
  2025-05-22  8:27 ` ✓ CI.Patch_applied: success for drm/xe: Add helper function to inject fault into ct_dead_capture() (rev5) Patchwork
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Satyanarayana K V P @ 2025-05-22  8:00 UTC (permalink / raw)
  To: intel-xe
  Cc: Satyanarayana K V P, John Harrison, Michal Wajdeczko, Jani Nikula

When injecting fault to xe_guc_ct_send_recv() & xe_guc_mmio_send_recv()
functions, the CI test systems are going out of space and crashing. To
avoid this issue, a new helper function is created and when fault is
injected into this xe_inject_fault() helper function, ct dead capture
is avoided which suppresses ct dumps in the log.

Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Suggested-by: John Harrison <John.C.Harrison@Intel.com>
---
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Jani Nikula <jani.nikula@intel.com>

V4 -> V5:
- Fixed review comments.

V3 -> V4:
- Updated the name of helper function and moved to xe_device.h file.

V2 -> V3:
- Added inline function to avoid compilation error in the absence of
CONFIG_FUNCTION_ERROR_INJECTION.

V1 -> V2:
- Fixed review comments.
---
 drivers/gpu/drm/xe/xe_device.c | 18 ++++++++++++++++++
 drivers/gpu/drm/xe/xe_device.h | 14 ++++++++++++++
 drivers/gpu/drm/xe/xe_guc_ct.c |  6 ++++++
 3 files changed, 38 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 660b0c5126dc..e1b4087a2974 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -1120,6 +1120,24 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
 	xe_pm_runtime_put(xe);
 }
 
+/**
+ * xe_is_injection_active() - Helper function to test whether a fault
+ * inject test is running.
+ *
+ * This is a helper function that the driver can use to detect whether
+ * a fault injection test is running in order to suppress excessive debug
+ * output. By default, the return value is fixed as zero but it can be modified
+ * by the fault injection framework to return an error.
+ *
+ * Returns:
+ * 0 if no fault is injected.
+ * Injected error if fault is injected.
+ */
+int xe_is_injection_active(void)
+{
+	return xe_inject_fault();
+}
+
 /**
  * xe_device_declare_wedged - Declare device wedged
  * @xe: xe device instance
diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
index 0bc3bc8e6803..13a43fe4fb64 100644
--- a/drivers/gpu/drm/xe/xe_device.h
+++ b/drivers/gpu/drm/xe/xe_device.h
@@ -195,6 +195,20 @@ void xe_device_declare_wedged(struct xe_device *xe);
 struct xe_file *xe_file_get(struct xe_file *xef);
 void xe_file_put(struct xe_file *xef);
 
+#ifdef CONFIG_FUNCTION_ERROR_INJECTION
+/*
+ * As fault is injected using this function, need to make sure that
+ * the compiler does not optimize and make it as a inline function.
+ * To prevent compile optimization, "noinline" is added.
+ */
+static noinline int xe_inject_fault(void) { return 0; }
+ALLOW_ERROR_INJECTION(xe_inject_fault, ERRNO);
+#else
+static inline int xe_inject_fault(void) { return 0; }
+#endif
+
+int xe_is_injection_active(void);
+
 /*
  * Occasionally it is seen that the G2H worker starts running after a delay of more than
  * a second even after being queued and activated by the Linux workqueue subsystem. This
diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
index 822f4c33f730..89f992feba31 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.c
+++ b/drivers/gpu/drm/xe/xe_guc_ct.c
@@ -2040,6 +2040,12 @@ static void ct_dead_capture(struct xe_guc_ct *ct, struct guc_ctb *ctb, u32 reaso
 
 	if (ctb)
 		ctb->info.broken = true;
+	/*
+	 * Huge dump is getting generated when injecting error for guc CT/MMIO
+	 * functions. So, let us suppress the dump when fault is injected.
+	 */
+	if (xe_is_injection_active())
+		return;
 
 	/* Ignore further errors after the first dump until a reset */
 	if (ct->dead.reported)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* ✓ CI.Patch_applied: success for drm/xe: Add helper function to inject fault into ct_dead_capture() (rev5)
  2025-05-22  8:00 [PATCH v5] drm/xe: Add helper function to inject fault into ct_dead_capture() Satyanarayana K V P
@ 2025-05-22  8:27 ` Patchwork
  2025-05-22  8:27 ` ✓ CI.checkpatch: " Patchwork
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Patchwork @ 2025-05-22  8:27 UTC (permalink / raw)
  To: Satyanarayana K V P; +Cc: intel-xe

== Series Details ==

Series: drm/xe: Add helper function to inject fault into ct_dead_capture() (rev5)
URL   : https://patchwork.freedesktop.org/series/148415/
State : success

== Summary ==

=== Applying kernel patches on branch 'drm-tip' with base: ===
Base commit: 09130f929f7c drm-tip: 2025y-05m-22d-06h-26m-11s UTC integration manifest
=== git am output follows ===
Applying: drm/xe: Add helper function to inject fault into ct_dead_capture()



^ permalink raw reply	[flat|nested] 8+ messages in thread

* ✓ CI.checkpatch: success for drm/xe: Add helper function to inject fault into ct_dead_capture() (rev5)
  2025-05-22  8:00 [PATCH v5] drm/xe: Add helper function to inject fault into ct_dead_capture() Satyanarayana K V P
  2025-05-22  8:27 ` ✓ CI.Patch_applied: success for drm/xe: Add helper function to inject fault into ct_dead_capture() (rev5) Patchwork
@ 2025-05-22  8:27 ` Patchwork
  2025-05-22  8:28 ` ✗ CI.KUnit: failure " Patchwork
  2025-05-22 19:38 ` [PATCH v5] drm/xe: Add helper function to inject fault into ct_dead_capture() John Harrison
  3 siblings, 0 replies; 8+ messages in thread
From: Patchwork @ 2025-05-22  8:27 UTC (permalink / raw)
  To: Satyanarayana K V P; +Cc: intel-xe

== Series Details ==

Series: drm/xe: Add helper function to inject fault into ct_dead_capture() (rev5)
URL   : https://patchwork.freedesktop.org/series/148415/
State : success

== Summary ==

+ KERNEL=/kernel
+ git clone https://gitlab.freedesktop.org/drm/maintainer-tools mt
Cloning into 'mt'...
warning: redirecting to https://gitlab.freedesktop.org/drm/maintainer-tools.git/
+ git -C mt rev-list -n1 origin/master
202708c00696422fd217223bb679a353a5936e23
+ cd /kernel
+ git config --global --add safe.directory /kernel
+ git log -n1
commit 74f8ca53b9a9ba7e8cee46bb86d0bf0828682bb7
Author: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Date:   Thu May 22 08:00:20 2025 +0000

    drm/xe: Add helper function to inject fault into ct_dead_capture()
    
    When injecting fault to xe_guc_ct_send_recv() & xe_guc_mmio_send_recv()
    functions, the CI test systems are going out of space and crashing. To
    avoid this issue, a new helper function is created and when fault is
    injected into this xe_inject_fault() helper function, ct dead capture
    is avoided which suppresses ct dumps in the log.
    
    Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
    Suggested-by: John Harrison <John.C.Harrison@Intel.com>
+ /mt/dim checkpatch 09130f929f7c79440cc8890bcce976843f439f01 drm-intel
74f8ca53b9a9 drm/xe: Add helper function to inject fault into ct_dead_capture()



^ permalink raw reply	[flat|nested] 8+ messages in thread

* ✗ CI.KUnit: failure for drm/xe: Add helper function to inject fault into ct_dead_capture() (rev5)
  2025-05-22  8:00 [PATCH v5] drm/xe: Add helper function to inject fault into ct_dead_capture() Satyanarayana K V P
  2025-05-22  8:27 ` ✓ CI.Patch_applied: success for drm/xe: Add helper function to inject fault into ct_dead_capture() (rev5) Patchwork
  2025-05-22  8:27 ` ✓ CI.checkpatch: " Patchwork
@ 2025-05-22  8:28 ` Patchwork
  2025-05-22 19:38 ` [PATCH v5] drm/xe: Add helper function to inject fault into ct_dead_capture() John Harrison
  3 siblings, 0 replies; 8+ messages in thread
From: Patchwork @ 2025-05-22  8:28 UTC (permalink / raw)
  To: Satyanarayana K V P; +Cc: intel-xe

== Series Details ==

Series: drm/xe: Add helper function to inject fault into ct_dead_capture() (rev5)
URL   : https://patchwork.freedesktop.org/series/148415/
State : failure

== Summary ==

+ trap cleanup EXIT
+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/xe/.kunitconfig
ERROR:root:../drivers/gpu/drm/drm_bridge.c:1406:6: error: redefinition of ‘devm_drm_put_bridge’
 1406 | void devm_drm_put_bridge(struct device *dev, struct drm_bridge *bridge)
      |      ^~~~~~~~~~~~~~~~~~~
In file included from ../drivers/gpu/drm/drm_bridge.c:31:
../include/drm/drm_bridge.h:1314:20: note: previous definition of ‘devm_drm_put_bridge’ with type ‘void(struct device *, struct drm_bridge *)’
 1314 | static inline void devm_drm_put_bridge(struct device *dev, struct drm_bridge *bridge) {}
      |                    ^~~~~~~~~~~~~~~~~~~
make[6]: *** [../scripts/Makefile.build:203: drivers/gpu/drm/drm_bridge.o] Error 1
make[6]: *** Waiting for unfinished jobs....
make[5]: *** [../scripts/Makefile.build:461: drivers/gpu/drm] Error 2
make[4]: *** [../scripts/Makefile.build:461: drivers/gpu] Error 2
make[3]: *** [../scripts/Makefile.build:461: drivers] Error 2
make[2]: *** [/kernel/Makefile:2003: .] Error 2
make[1]: *** [/kernel/Makefile:248: __sub-make] Error 2
make: *** [Makefile:248: __sub-make] Error 2

[08:27:53] Configuring KUnit Kernel ...
Generating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[08:27:57] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
+ cleanup
++ stat -c %u:%g /kernel
+ chown -R 1003:1003 /kernel



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v5] drm/xe: Add helper function to inject fault into ct_dead_capture()
  2025-05-22  8:00 [PATCH v5] drm/xe: Add helper function to inject fault into ct_dead_capture() Satyanarayana K V P
                   ` (2 preceding siblings ...)
  2025-05-22  8:28 ` ✗ CI.KUnit: failure " Patchwork
@ 2025-05-22 19:38 ` John Harrison
  2025-05-23  5:33   ` K V P, Satyanarayana
  3 siblings, 1 reply; 8+ messages in thread
From: John Harrison @ 2025-05-22 19:38 UTC (permalink / raw)
  To: Satyanarayana K V P, intel-xe; +Cc: Michal Wajdeczko, Jani Nikula

On 5/22/2025 1:00 AM, Satyanarayana K V P wrote:
> When injecting fault to xe_guc_ct_send_recv() & xe_guc_mmio_send_recv()
> functions, the CI test systems are going out of space and crashing. To
> avoid this issue, a new helper function is created and when fault is
> injected into this xe_inject_fault() helper function, ct dead capture
> is avoided which suppresses ct dumps in the log.
>
> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
> Suggested-by: John Harrison <John.C.Harrison@Intel.com>
> ---
> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> Cc: Jani Nikula <jani.nikula@intel.com>
>
> V4 -> V5:
> - Fixed review comments.
>
> V3 -> V4:
> - Updated the name of helper function and moved to xe_device.h file.
>
> V2 -> V3:
> - Added inline function to avoid compilation error in the absence of
> CONFIG_FUNCTION_ERROR_INJECTION.
>
> V1 -> V2:
> - Fixed review comments.
> ---
>   drivers/gpu/drm/xe/xe_device.c | 18 ++++++++++++++++++
>   drivers/gpu/drm/xe/xe_device.h | 14 ++++++++++++++
>   drivers/gpu/drm/xe/xe_guc_ct.c |  6 ++++++
>   3 files changed, 38 insertions(+)
>
> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> index 660b0c5126dc..e1b4087a2974 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -1120,6 +1120,24 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
>   	xe_pm_runtime_put(xe);
>   }
>   
> +/**
> + * xe_is_injection_active() - Helper function to test whether a fault
> + * inject test is running.
> + *
> + * This is a helper function that the driver can use to detect whether
> + * a fault injection test is running in order to suppress excessive debug
> + * output. By default, the return value is fixed as zero but it can be modified
> + * by the fault injection framework to return an error.
> + *
> + * Returns:
> + * 0 if no fault is injected.
> + * Injected error if fault is injected.
> + */
> +int xe_is_injection_active(void)
> +{
> +	return xe_inject_fault();
> +}
> +
I'm confused. What is the point of this wrapper?

>   /**
>    * xe_device_declare_wedged - Declare device wedged
>    * @xe: xe device instance
> diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
> index 0bc3bc8e6803..13a43fe4fb64 100644
> --- a/drivers/gpu/drm/xe/xe_device.h
> +++ b/drivers/gpu/drm/xe/xe_device.h
> @@ -195,6 +195,20 @@ void xe_device_declare_wedged(struct xe_device *xe);
>   struct xe_file *xe_file_get(struct xe_file *xef);
>   void xe_file_put(struct xe_file *xef);
>   
> +#ifdef CONFIG_FUNCTION_ERROR_INJECTION
> +/*
> + * As fault is injected using this function, need to make sure that
> + * the compiler does not optimize and make it as a inline function.
> + * To prevent compile optimization, "noinline" is added.
> + */
> +static noinline int xe_inject_fault(void) { return 0; }
> +ALLOW_ERROR_INJECTION(xe_inject_fault, ERRNO);
> +#else
> +static inline int xe_inject_fault(void) { return 0; }
Why not name this xe_is_injection_active() and just call it directly?

John.

> +#endif
> +
> +int xe_is_injection_active(void);
> +
>   /*
>    * Occasionally it is seen that the G2H worker starts running after a delay of more than
>    * a second even after being queued and activated by the Linux workqueue subsystem. This
> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> index 822f4c33f730..89f992feba31 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> @@ -2040,6 +2040,12 @@ static void ct_dead_capture(struct xe_guc_ct *ct, struct guc_ctb *ctb, u32 reaso
>   
>   	if (ctb)
>   		ctb->info.broken = true;
> +	/*
> +	 * Huge dump is getting generated when injecting error for guc CT/MMIO
> +	 * functions. So, let us suppress the dump when fault is injected.
> +	 */
> +	if (xe_is_injection_active())
> +		return;
>   
>   	/* Ignore further errors after the first dump until a reset */
>   	if (ct->dead.reported)


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH v5] drm/xe: Add helper function to inject fault into ct_dead_capture()
  2025-05-22 19:38 ` [PATCH v5] drm/xe: Add helper function to inject fault into ct_dead_capture() John Harrison
@ 2025-05-23  5:33   ` K V P, Satyanarayana
  2025-05-23 16:51     ` John Harrison
  0 siblings, 1 reply; 8+ messages in thread
From: K V P, Satyanarayana @ 2025-05-23  5:33 UTC (permalink / raw)
  To: Harrison, John C, intel-xe@lists.freedesktop.org
  Cc: Wajdeczko, Michal, Nikula, Jani, K V P, Satyanarayana

Hi
> -----Original Message-----
> From: Harrison, John C <john.c.harrison@intel.com>
> Sent: Friday, May 23, 2025 1:09 AM
> To: K V P, Satyanarayana <satyanarayana.k.v.p@intel.com>; intel-
> xe@lists.freedesktop.org
> Cc: Wajdeczko, Michal <Michal.Wajdeczko@intel.com>; Nikula, Jani
> <jani.nikula@intel.com>
> Subject: Re: [PATCH v5] drm/xe: Add helper function to inject fault into
> ct_dead_capture()
> 
> On 5/22/2025 1:00 AM, Satyanarayana K V P wrote:
> > When injecting fault to xe_guc_ct_send_recv() & xe_guc_mmio_send_recv()
> > functions, the CI test systems are going out of space and crashing. To
> > avoid this issue, a new helper function is created and when fault is
> > injected into this xe_inject_fault() helper function, ct dead capture
> > is avoided which suppresses ct dumps in the log.
> >
> > Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
> > Suggested-by: John Harrison <John.C.Harrison@Intel.com>
> > ---
> > Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> > Cc: Jani Nikula <jani.nikula@intel.com>
> >
> > V4 -> V5:
> > - Fixed review comments.
> >
> > V3 -> V4:
> > - Updated the name of helper function and moved to xe_device.h file.
> >
> > V2 -> V3:
> > - Added inline function to avoid compilation error in the absence of
> > CONFIG_FUNCTION_ERROR_INJECTION.
> >
> > V1 -> V2:
> > - Fixed review comments.
> > ---
> >   drivers/gpu/drm/xe/xe_device.c | 18 ++++++++++++++++++
> >   drivers/gpu/drm/xe/xe_device.h | 14 ++++++++++++++
> >   drivers/gpu/drm/xe/xe_guc_ct.c |  6 ++++++
> >   3 files changed, 38 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_device.c
> b/drivers/gpu/drm/xe/xe_device.c
> > index 660b0c5126dc..e1b4087a2974 100644
> > --- a/drivers/gpu/drm/xe/xe_device.c
> > +++ b/drivers/gpu/drm/xe/xe_device.c
> > @@ -1120,6 +1120,24 @@ static void xe_device_wedged_fini(struct
> drm_device *drm, void *arg)
> >   	xe_pm_runtime_put(xe);
> >   }
> >
> > +/**
> > + * xe_is_injection_active() - Helper function to test whether a fault
> > + * inject test is running.
> > + *
> > + * This is a helper function that the driver can use to detect whether
> > + * a fault injection test is running in order to suppress excessive debug
> > + * output. By default, the return value is fixed as zero but it can be modified
> > + * by the fault injection framework to return an error.
> > + *
> > + * Returns:
> > + * 0 if no fault is injected.
> > + * Injected error if fault is injected.
> > + */
> > +int xe_is_injection_active(void)
> > +{
> > +	return xe_inject_fault();
> > +}
> > +
> I'm confused. What is the point of this wrapper?
We are injecting fault into xe_inject_fault() function which is generic. The fault injection status is obtained from
xe_is_injection_active() function. This is to make function names meaningful.
> 
> >   /**
> >    * xe_device_declare_wedged - Declare device wedged
> >    * @xe: xe device instance
> > diff --git a/drivers/gpu/drm/xe/xe_device.h
> b/drivers/gpu/drm/xe/xe_device.h
> > index 0bc3bc8e6803..13a43fe4fb64 100644
> > --- a/drivers/gpu/drm/xe/xe_device.h
> > +++ b/drivers/gpu/drm/xe/xe_device.h
> > @@ -195,6 +195,20 @@ void xe_device_declare_wedged(struct xe_device
> *xe);
> >   struct xe_file *xe_file_get(struct xe_file *xef);
> >   void xe_file_put(struct xe_file *xef);
> >
> > +#ifdef CONFIG_FUNCTION_ERROR_INJECTION
> > +/*
> > + * As fault is injected using this function, need to make sure that
> > + * the compiler does not optimize and make it as a inline function.
> > + * To prevent compile optimization, "noinline" is added.
> > + */
> > +static noinline int xe_inject_fault(void) { return 0; }
> > +ALLOW_ERROR_INJECTION(xe_inject_fault, ERRNO);
> > +#else
> > +static inline int xe_inject_fault(void) { return 0; }
> Why not name this xe_is_injection_active() and just call it directly?
> 
> John.
> 
> > +#endif
> > +
> > +int xe_is_injection_active(void);
> > +
> >   /*
> >    * Occasionally it is seen that the G2H worker starts running after a delay of
> more than
> >    * a second even after being queued and activated by the Linux workqueue
> subsystem. This
> > diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c
> b/drivers/gpu/drm/xe/xe_guc_ct.c
> > index 822f4c33f730..89f992feba31 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> > @@ -2040,6 +2040,12 @@ static void ct_dead_capture(struct xe_guc_ct
> *ct, struct guc_ctb *ctb, u32 reaso
> >
> >   	if (ctb)
> >   		ctb->info.broken = true;
> > +	/*
> > +	 * Huge dump is getting generated when injecting error for guc
> CT/MMIO
> > +	 * functions. So, let us suppress the dump when fault is injected.
> > +	 */
> > +	if (xe_is_injection_active())
> > +		return;
> >
> >   	/* Ignore further errors after the first dump until a reset */
> >   	if (ct->dead.reported)


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v5] drm/xe: Add helper function to inject fault into ct_dead_capture()
  2025-05-23  5:33   ` K V P, Satyanarayana
@ 2025-05-23 16:51     ` John Harrison
  2025-05-26 11:20       ` Jani Nikula
  0 siblings, 1 reply; 8+ messages in thread
From: John Harrison @ 2025-05-23 16:51 UTC (permalink / raw)
  To: K V P, Satyanarayana, intel-xe@lists.freedesktop.org
  Cc: Wajdeczko, Michal, Nikula, Jani

On 5/22/2025 10:33 PM, K V P, Satyanarayana wrote:
> Hi
>> -----Original Message-----
>> From: Harrison, John C <john.c.harrison@intel.com>
>> Sent: Friday, May 23, 2025 1:09 AM
>> To: K V P, Satyanarayana <satyanarayana.k.v.p@intel.com>; intel-
>> xe@lists.freedesktop.org
>> Cc: Wajdeczko, Michal <Michal.Wajdeczko@intel.com>; Nikula, Jani
>> <jani.nikula@intel.com>
>> Subject: Re: [PATCH v5] drm/xe: Add helper function to inject fault into
>> ct_dead_capture()
>>
>> On 5/22/2025 1:00 AM, Satyanarayana K V P wrote:
>>> When injecting fault to xe_guc_ct_send_recv() & xe_guc_mmio_send_recv()
>>> functions, the CI test systems are going out of space and crashing. To
>>> avoid this issue, a new helper function is created and when fault is
>>> injected into this xe_inject_fault() helper function, ct dead capture
>>> is avoided which suppresses ct dumps in the log.
>>>
>>> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
>>> Suggested-by: John Harrison <John.C.Harrison@Intel.com>
>>> ---
>>> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
>>> Cc: Jani Nikula <jani.nikula@intel.com>
>>>
>>> V4 -> V5:
>>> - Fixed review comments.
>>>
>>> V3 -> V4:
>>> - Updated the name of helper function and moved to xe_device.h file.
>>>
>>> V2 -> V3:
>>> - Added inline function to avoid compilation error in the absence of
>>> CONFIG_FUNCTION_ERROR_INJECTION.
>>>
>>> V1 -> V2:
>>> - Fixed review comments.
>>> ---
>>>    drivers/gpu/drm/xe/xe_device.c | 18 ++++++++++++++++++
>>>    drivers/gpu/drm/xe/xe_device.h | 14 ++++++++++++++
>>>    drivers/gpu/drm/xe/xe_guc_ct.c |  6 ++++++
>>>    3 files changed, 38 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_device.c
>> b/drivers/gpu/drm/xe/xe_device.c
>>> index 660b0c5126dc..e1b4087a2974 100644
>>> --- a/drivers/gpu/drm/xe/xe_device.c
>>> +++ b/drivers/gpu/drm/xe/xe_device.c
>>> @@ -1120,6 +1120,24 @@ static void xe_device_wedged_fini(struct
>> drm_device *drm, void *arg)
>>>    	xe_pm_runtime_put(xe);
>>>    }
>>>
>>> +/**
>>> + * xe_is_injection_active() - Helper function to test whether a fault
>>> + * inject test is running.
>>> + *
>>> + * This is a helper function that the driver can use to detect whether
>>> + * a fault injection test is running in order to suppress excessive debug
>>> + * output. By default, the return value is fixed as zero but it can be modified
>>> + * by the fault injection framework to return an error.
>>> + *
>>> + * Returns:
>>> + * 0 if no fault is injected.
>>> + * Injected error if fault is injected.
>>> + */
>>> +int xe_is_injection_active(void)
>>> +{
>>> +	return xe_inject_fault();
>>> +}
>>> +
>> I'm confused. What is the point of this wrapper?
> We are injecting fault into xe_inject_fault() function which is generic. The fault injection status is obtained from
> xe_is_injection_active() function. This is to make function names meaningful.
But why?

Why not just have a single function that is named 
"xe_is_injection_active" and has the fault injected into it directly? It 
has a meaningful name, it does what it needs to do, it keeps things 
simple. Why have an extra level of function wrapper for no benefit?

And in terms of meaningful names, "xe_inject_fault" sounds like a 
function which is going to actually inject a fault somewhere. Whereas, 
you are wanting something more like 
"xe_stub_function_for_injecting_fault_into".  But as above, why bother? 
The comment tells you exactly what is happening here. There is no need 
for extra levels of wrapping just to change the name of a function.

John.

>>>    /**
>>>     * xe_device_declare_wedged - Declare device wedged
>>>     * @xe: xe device instance
>>> diff --git a/drivers/gpu/drm/xe/xe_device.h
>> b/drivers/gpu/drm/xe/xe_device.h
>>> index 0bc3bc8e6803..13a43fe4fb64 100644
>>> --- a/drivers/gpu/drm/xe/xe_device.h
>>> +++ b/drivers/gpu/drm/xe/xe_device.h
>>> @@ -195,6 +195,20 @@ void xe_device_declare_wedged(struct xe_device
>> *xe);
>>>    struct xe_file *xe_file_get(struct xe_file *xef);
>>>    void xe_file_put(struct xe_file *xef);
>>>
>>> +#ifdef CONFIG_FUNCTION_ERROR_INJECTION
>>> +/*
>>> + * As fault is injected using this function, need to make sure that
>>> + * the compiler does not optimize and make it as a inline function.
>>> + * To prevent compile optimization, "noinline" is added.
>>> + */
>>> +static noinline int xe_inject_fault(void) { return 0; }
>>> +ALLOW_ERROR_INJECTION(xe_inject_fault, ERRNO);
>>> +#else
>>> +static inline int xe_inject_fault(void) { return 0; }
>> Why not name this xe_is_injection_active() and just call it directly?
>>
>> John.
>>
>>> +#endif
>>> +
>>> +int xe_is_injection_active(void);
>>> +
>>>    /*
>>>     * Occasionally it is seen that the G2H worker starts running after a delay of
>> more than
>>>     * a second even after being queued and activated by the Linux workqueue
>> subsystem. This
>>> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c
>> b/drivers/gpu/drm/xe/xe_guc_ct.c
>>> index 822f4c33f730..89f992feba31 100644
>>> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
>>> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
>>> @@ -2040,6 +2040,12 @@ static void ct_dead_capture(struct xe_guc_ct
>> *ct, struct guc_ctb *ctb, u32 reaso
>>>    	if (ctb)
>>>    		ctb->info.broken = true;
>>> +	/*
>>> +	 * Huge dump is getting generated when injecting error for guc
>> CT/MMIO
>>> +	 * functions. So, let us suppress the dump when fault is injected.
>>> +	 */
>>> +	if (xe_is_injection_active())
>>> +		return;
>>>
>>>    	/* Ignore further errors after the first dump until a reset */
>>>    	if (ct->dead.reported)


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v5] drm/xe: Add helper function to inject fault into ct_dead_capture()
  2025-05-23 16:51     ` John Harrison
@ 2025-05-26 11:20       ` Jani Nikula
  0 siblings, 0 replies; 8+ messages in thread
From: Jani Nikula @ 2025-05-26 11:20 UTC (permalink / raw)
  To: John Harrison, K V P, Satyanarayana,
	intel-xe@lists.freedesktop.org
  Cc: Wajdeczko, Michal

On Fri, 23 May 2025, John Harrison <john.c.harrison@intel.com> wrote:
> On 5/22/2025 10:33 PM, K V P, Satyanarayana wrote:
>> Hi
>>> -----Original Message-----
>>> From: Harrison, John C <john.c.harrison@intel.com>
>>> Sent: Friday, May 23, 2025 1:09 AM
>>> To: K V P, Satyanarayana <satyanarayana.k.v.p@intel.com>; intel-
>>> xe@lists.freedesktop.org
>>> Cc: Wajdeczko, Michal <Michal.Wajdeczko@intel.com>; Nikula, Jani
>>> <jani.nikula@intel.com>
>>> Subject: Re: [PATCH v5] drm/xe: Add helper function to inject fault into
>>> ct_dead_capture()
>>>
>>> On 5/22/2025 1:00 AM, Satyanarayana K V P wrote:
>>>> When injecting fault to xe_guc_ct_send_recv() & xe_guc_mmio_send_recv()
>>>> functions, the CI test systems are going out of space and crashing. To
>>>> avoid this issue, a new helper function is created and when fault is
>>>> injected into this xe_inject_fault() helper function, ct dead capture
>>>> is avoided which suppresses ct dumps in the log.
>>>>
>>>> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
>>>> Suggested-by: John Harrison <John.C.Harrison@Intel.com>
>>>> ---
>>>> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
>>>> Cc: Jani Nikula <jani.nikula@intel.com>
>>>>
>>>> V4 -> V5:
>>>> - Fixed review comments.
>>>>
>>>> V3 -> V4:
>>>> - Updated the name of helper function and moved to xe_device.h file.
>>>>
>>>> V2 -> V3:
>>>> - Added inline function to avoid compilation error in the absence of
>>>> CONFIG_FUNCTION_ERROR_INJECTION.
>>>>
>>>> V1 -> V2:
>>>> - Fixed review comments.
>>>> ---
>>>>    drivers/gpu/drm/xe/xe_device.c | 18 ++++++++++++++++++
>>>>    drivers/gpu/drm/xe/xe_device.h | 14 ++++++++++++++
>>>>    drivers/gpu/drm/xe/xe_guc_ct.c |  6 ++++++
>>>>    3 files changed, 38 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_device.c
>>> b/drivers/gpu/drm/xe/xe_device.c
>>>> index 660b0c5126dc..e1b4087a2974 100644
>>>> --- a/drivers/gpu/drm/xe/xe_device.c
>>>> +++ b/drivers/gpu/drm/xe/xe_device.c
>>>> @@ -1120,6 +1120,24 @@ static void xe_device_wedged_fini(struct
>>> drm_device *drm, void *arg)
>>>>    	xe_pm_runtime_put(xe);
>>>>    }
>>>>
>>>> +/**
>>>> + * xe_is_injection_active() - Helper function to test whether a fault
>>>> + * inject test is running.
>>>> + *
>>>> + * This is a helper function that the driver can use to detect whether
>>>> + * a fault injection test is running in order to suppress excessive debug
>>>> + * output. By default, the return value is fixed as zero but it can be modified
>>>> + * by the fault injection framework to return an error.
>>>> + *
>>>> + * Returns:
>>>> + * 0 if no fault is injected.
>>>> + * Injected error if fault is injected.
>>>> + */
>>>> +int xe_is_injection_active(void)
>>>> +{
>>>> +	return xe_inject_fault();
>>>> +}
>>>> +
>>> I'm confused. What is the point of this wrapper?
>> We are injecting fault into xe_inject_fault() function which is generic. The fault injection status is obtained from
>> xe_is_injection_active() function. This is to make function names meaningful.
> But why?
>
> Why not just have a single function that is named 
> "xe_is_injection_active" and has the fault injected into it directly? It 
> has a meaningful name, it does what it needs to do, it keeps things 
> simple. Why have an extra level of function wrapper for no benefit?
>
> And in terms of meaningful names, "xe_inject_fault" sounds like a 
> function which is going to actually inject a fault somewhere. Whereas, 
> you are wanting something more like 
> "xe_stub_function_for_injecting_fault_into".  But as above, why bother? 
> The comment tells you exactly what is happening here. There is no need 
> for extra levels of wrapping just to change the name of a function.

I think all of this begs the question, does this belong in the kernel at
all? The problem is all in userspace ("CI test systems are going out of
space and crashing"). Userspace asks for error injection. Why does the
kernel need to apply the policy of not logging something specific in
this case? What else should the kernel not log, and who's going to
decide?

Personally I think logging core dumps in dmesg even in CI setting is
questionable, but let's just assume it's needed. Then there needs to be
a (debugfs?) knob for userspace to enable/disable that behaviour. The
fault injection test is very specific. Have the test disable dmesg
coredumps during fault injection tests. That's all. Don't make the
kernel more complicated and ridden with special casing for userspace
problems.


BR,
Jani.


>
> John.
>
>>>>    /**
>>>>     * xe_device_declare_wedged - Declare device wedged
>>>>     * @xe: xe device instance
>>>> diff --git a/drivers/gpu/drm/xe/xe_device.h
>>> b/drivers/gpu/drm/xe/xe_device.h
>>>> index 0bc3bc8e6803..13a43fe4fb64 100644
>>>> --- a/drivers/gpu/drm/xe/xe_device.h
>>>> +++ b/drivers/gpu/drm/xe/xe_device.h
>>>> @@ -195,6 +195,20 @@ void xe_device_declare_wedged(struct xe_device
>>> *xe);
>>>>    struct xe_file *xe_file_get(struct xe_file *xef);
>>>>    void xe_file_put(struct xe_file *xef);
>>>>
>>>> +#ifdef CONFIG_FUNCTION_ERROR_INJECTION
>>>> +/*
>>>> + * As fault is injected using this function, need to make sure that
>>>> + * the compiler does not optimize and make it as a inline function.
>>>> + * To prevent compile optimization, "noinline" is added.
>>>> + */
>>>> +static noinline int xe_inject_fault(void) { return 0; }
>>>> +ALLOW_ERROR_INJECTION(xe_inject_fault, ERRNO);
>>>> +#else
>>>> +static inline int xe_inject_fault(void) { return 0; }
>>> Why not name this xe_is_injection_active() and just call it directly?
>>>
>>> John.
>>>
>>>> +#endif
>>>> +
>>>> +int xe_is_injection_active(void);
>>>> +
>>>>    /*
>>>>     * Occasionally it is seen that the G2H worker starts running after a delay of
>>> more than
>>>>     * a second even after being queued and activated by the Linux workqueue
>>> subsystem. This
>>>> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c
>>> b/drivers/gpu/drm/xe/xe_guc_ct.c
>>>> index 822f4c33f730..89f992feba31 100644
>>>> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
>>>> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
>>>> @@ -2040,6 +2040,12 @@ static void ct_dead_capture(struct xe_guc_ct
>>> *ct, struct guc_ctb *ctb, u32 reaso
>>>>    	if (ctb)
>>>>    		ctb->info.broken = true;
>>>> +	/*
>>>> +	 * Huge dump is getting generated when injecting error for guc
>>> CT/MMIO
>>>> +	 * functions. So, let us suppress the dump when fault is injected.
>>>> +	 */
>>>> +	if (xe_is_injection_active())
>>>> +		return;
>>>>
>>>>    	/* Ignore further errors after the first dump until a reset */
>>>>    	if (ct->dead.reported)
>

-- 
Jani Nikula, Intel

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-05-26 11:20 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-22  8:00 [PATCH v5] drm/xe: Add helper function to inject fault into ct_dead_capture() Satyanarayana K V P
2025-05-22  8:27 ` ✓ CI.Patch_applied: success for drm/xe: Add helper function to inject fault into ct_dead_capture() (rev5) Patchwork
2025-05-22  8:27 ` ✓ CI.checkpatch: " Patchwork
2025-05-22  8:28 ` ✗ CI.KUnit: failure " Patchwork
2025-05-22 19:38 ` [PATCH v5] drm/xe: Add helper function to inject fault into ct_dead_capture() John Harrison
2025-05-23  5:33   ` K V P, Satyanarayana
2025-05-23 16:51     ` John Harrison
2025-05-26 11:20       ` Jani Nikula

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox