Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com>
To: "Ghimiray, Himal Prasad" <himal.prasad.ghimiray@intel.com>,
	"intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>
Cc: "De Marchi, Lucas" <lucas.demarchi@intel.com>,
	"Vivi, Rodrigo" <rodrigo.vivi@intel.com>
Subject: Re: [Intel-xe] [PATCH v9 3/3] drm/xe: Introduce fault injection for gt reset
Date: Tue, 1 Aug 2023 11:32:16 +0530	[thread overview]
Message-ID: <ZMif6Huf+E95GL3J@bvivekan-mobl> (raw)
In-Reply-To: <MW4PR11MB7056657DF4820D71CBB0DBE5B305A@MW4PR11MB7056.namprd11.prod.outlook.com>

On 31.07.2023 20:54, Ghimiray, Himal Prasad wrote:
> Hi Bala,
> 
> > -----Original Message-----
> > From: Vivekanandan, Balasubramani
> > <balasubramani.vivekanandan@intel.com>
> > Sent: 31 July 2023 20:00
> > To: Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>; intel-
> > xe@lists.freedesktop.org
> > Cc: De Marchi, Lucas <lucas.demarchi@intel.com>; Vivi, Rodrigo
> > <rodrigo.vivi@intel.com>
> > Subject: Re: [Intel-xe] [PATCH v9 3/3] drm/xe: Introduce fault injection for gt
> > reset
> > 
> > On 27.07.2023 04:56, Himal Prasad Ghimiray wrote:
> > > To trigger gt reset failure:
> > >  echo 100 >  /sys/kernel/debug/dri/<cardX>/fail_gt_reset/probability
> > >  echo 2 >  /sys/kernel/debug/dri/<cardX>/fail_gt_reset/times
> > >
> > > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > > Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> > >
> > > Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > > Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_debugfs.c | 10 ++++++++++
> > >  drivers/gpu/drm/xe/xe_gt.c      |  8 +++++++-
> > >  drivers/gpu/drm/xe/xe_gt.h      | 14 ++++++++++++++
> > >  3 files changed, 31 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_debugfs.c
> > > b/drivers/gpu/drm/xe/xe_debugfs.c index 047341d5689a..1fd016e6f7a0
> > > 100644
> > > --- a/drivers/gpu/drm/xe/xe_debugfs.c
> > > +++ b/drivers/gpu/drm/xe/xe_debugfs.c
> > > @@ -5,6 +5,7 @@
> > >
> > >  #include "xe_debugfs.h"
> > >
> > > +#include <linux/fault-inject.h>
> > >  #include <linux/string_helpers.h>
> > >
> > >  #include <drm/drm_debugfs.h>
> > > @@ -20,6 +21,10 @@
> > >  #include "xe_vm.h"
> > >  #endif
> > >
> > > +#ifdef CONFIG_FAULT_INJECTION
> > > +DECLARE_FAULT_ATTR(gt_reset_failure);
> > > +#endif
> > > +
> > >  static struct xe_device *node_to_xe(struct drm_info_node *node)  {
> > >  	return to_xe_device(node->minor->dev); @@ -135,4 +140,9 @@
> > void
> > > xe_debugfs_register(struct xe_device *xe)
> > >
> > >  	for_each_gt(gt, xe, id)
> > >  		xe_gt_debugfs_register(gt);
> > > +
> > > +#ifdef CONFIG_FAULT_INJECTION
> > > +	fault_create_debugfs_attr("fail_gt_reset", root, &gt_reset_failure);
> > > +#endif
> > > +
> > >  }
> > > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> > > index 5e70e486b27c..691e3baf97c9 100644
> > > --- a/drivers/gpu/drm/xe/xe_gt.c
> > > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > > @@ -525,6 +525,11 @@ static int gt_reset(struct xe_gt *gt)
> > >
> > >  	xe_gt_info(gt, "reset started\n");
> > >
> > > +	if (xe_fault_inject_gt_reset()) {
> > > +		err = -ECANCELED;
> > > +		goto err_fail;
> > > +	}
> > > +
> > >  	xe_gt_sanitize(gt);
> > >
> > >  	xe_device_mem_access_get(gt_to_xe(gt));
> > > @@ -563,6 +568,7 @@ static int gt_reset(struct xe_gt *gt)
> > >  err_msg:
> > >  	XE_WARN_ON(xe_uc_start(&gt->uc));
> > >  	xe_device_mem_access_put(gt_to_xe(gt));
> > > +err_fail:
> > >  	xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err));
> > >
> > >  	/* Notify userspace about gt reset failure */ @@ -584,7 +590,7 @@
> > > void xe_gt_reset_async(struct xe_gt *gt)
> > >  	xe_gt_info(gt, "trying reset\n");
> > >
> > >  	/* Don't do a reset while one is already in flight */
> > > -	if (xe_uc_reset_prepare(&gt->uc))
> > > +	if (!xe_fault_inject_gt_reset() && xe_uc_reset_prepare(&gt->uc))
> > 
> > When `fail_gt_reset/probability` is set to a less than 100 value, then
> > xe_fault_inject_gt_reset() will not always return true. So if the
> > xe_fault_inject_gt_reset() returns differenet values when invoked from
> > xe_gt_reset_async() and gt_reset(), we will have unexpected behaviour.
> > 
> > We should avoid calling xe_fault_inject_gt_reset() more than once in a single
> > reset cycle.
> > We could exit immediately in xe_gt_reset_async() if fault injection is enabled.
> Intention here is to cause next reset fail. Unless you make the probability 100 that is not guaranteed. 
> And no.of times more than 1 are perfectly fine as long as probability is 100. 
> Even in commit message it is mentioned clearly to use probability as 100.

Through the commit message it is understood that we need to supply 100 and >1
values for probability and no of times debugfs entries. But it doesn't
prevent anyone from trying other values. In such cases, it is fine for
the fault injection to not work, but it should not cause any issues in
the driver.
I think if xe_fault_inject_gt_reset() returns false in
xe_gt_reset_async() and true in gt_reset(), then we will end up in
xe_uc_reset_prepare() invoked but no gt reset completely executed.
Thought I haven't analyzed how it would impact driver, I think the
driver would be left in some unexpected state. We should avoid this.

Wrong values for the debugfs entries should have no impact on the
driver.

Regards,
Bala

> 
> BR
> Himal 
> > 
> > Regards,
> > Bala
> > >  		return;
> > >
> > >  	xe_gt_info(gt, "reset queued\n");
> > > diff --git a/drivers/gpu/drm/xe/xe_gt.h b/drivers/gpu/drm/xe/xe_gt.h
> > > index 7298653a73de..caded203a8a0 100644
> > > --- a/drivers/gpu/drm/xe/xe_gt.h
> > > +++ b/drivers/gpu/drm/xe/xe_gt.h
> > > @@ -7,6 +7,7 @@
> > >  #define _XE_GT_H_
> > >
> > >  #include <drm/drm_util.h>
> > > +#include <linux/fault-inject.h>
> > >
> > >  #include "xe_device_types.h"
> > >  #include "xe_hw_engine.h"
> > > @@ -16,6 +17,19 @@
> > >  		for_each_if(((hwe__) = (gt__)->hw_engines + (id__)) && \
> > >  			  xe_hw_engine_is_valid((hwe__)))
> > >
> > > +#ifdef CONFIG_FAULT_INJECTION
> > > +extern struct fault_attr gt_reset_failure; static inline bool
> > > +xe_fault_inject_gt_reset(void) {
> > > +	return should_fail(&gt_reset_failure, 1); } #else static inline bool
> > > +xe_fault_inject_gt_reset(void) {
> > > +	return false;
> > > +}
> > > +#endif
> > > +
> > >  struct xe_gt *xe_gt_alloc(struct xe_tile *tile);  int
> > > xe_gt_init_early(struct xe_gt *gt);  int xe_gt_init(struct xe_gt *gt);
> > > --
> > > 2.25.1
> > >

  reply	other threads:[~2023-08-01  6:02 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-26 23:26 [Intel-xe] [PATCH v9 0/3] Notify userspace about uevent failure Himal Prasad Ghimiray
2023-07-26 23:26 ` [Intel-xe] [PATCH v9 1/3] fault-inject: Include linux/types.h by default Himal Prasad Ghimiray
2023-07-28 16:49   ` Rodrigo Vivi
2023-07-26 23:26 ` [Intel-xe] [PATCH v9 2/3] drm/xe: Notify Userspace when gt reset fails Himal Prasad Ghimiray
2023-07-26 23:26 ` [Intel-xe] [PATCH v9 3/3] drm/xe: Introduce fault injection for gt reset Himal Prasad Ghimiray
2023-07-31 14:30   ` Balasubramani Vivekanandan
2023-07-31 15:24     ` Ghimiray, Himal Prasad
2023-08-01  6:02       ` Balasubramani Vivekanandan [this message]
2023-08-01  6:45         ` Ghimiray, Himal Prasad
2023-08-01  8:03           ` Balasubramani Vivekanandan
2023-08-01  8:04             ` Ghimiray, Himal Prasad
2023-07-27  0:29 ` [Intel-xe] ✓ CI.Patch_applied: success for Notify userspace about uevent failure Patchwork
2023-07-27  0:29 ` [Intel-xe] ✗ CI.checkpatch: warning " Patchwork
2023-07-27  0:30 ` [Intel-xe] ✓ CI.KUnit: success " Patchwork
2023-07-27  0:34 ` [Intel-xe] ✓ CI.Build: " Patchwork
2023-07-27  0:35 ` [Intel-xe] ✓ CI.Hooks: " Patchwork
2023-07-27  0:36 ` [Intel-xe] ✗ CI.checksparse: warning " Patchwork
2023-07-27  1:10 ` [Intel-xe] ○ CI.BAT: info " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZMif6Huf+E95GL3J@bvivekan-mobl \
    --to=balasubramani.vivekanandan@intel.com \
    --cc=himal.prasad.ghimiray@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=lucas.demarchi@intel.com \
    --cc=rodrigo.vivi@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox