Re: [RFC v1] drm/xe: Use fault injection infrastructure to find issues at probe time

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Francois Dugast <francois.dugast@intel.com>
To: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: <intel-xe@lists.freedesktop.org>,
	Lucas De Marchi <lucas.demarchi@intel.com>,
	Matthew Brost <matthew.brost@intel.com>
Subject: Re: [RFC v1] drm/xe: Use fault injection infrastructure to find issues at probe time
Date: Mon, 16 Sep 2024 10:45:37 +0200	[thread overview]
Message-ID: <ZufwMSTB4IlNcOKX@fdugast-desk> (raw)
In-Reply-To: <ZuSdddkrFZgCDkqt@intel.com>

On Fri, Sep 13, 2024 at 04:15:49PM -0400, Rodrigo Vivi wrote:
> On Fri, Sep 13, 2024 at 04:33:05PM +0200, Francois Dugast wrote:
> > This RFC is a rework of https://patchwork.freedesktop.org/series/137314/
> > but instead of importing custom macros from i915, it makes use of the
> > standard kernel fault injection infrastructure, see fault-injection.rst.
> > 
> > In particular, it leverages error injectable functions with the macro
> > ALLOW_ERROR_INJECTION(). This macro requires minimal code changes to the
> > faulting function, if it meets the injectable functions requirements:
> > fault-injection.html#requirements-for-the-error-injectable-functions
> > 
> > Unfortunately this is not the case in most of the injection points of the
> > original series, so a wrapper is added if needed. Only a few examples are
> > shown in this RFC to discuss the proper pattern to apply.
> > 
> > The return code of the functions using ALLOW_ERROR_INJECTION() can be
> > conditionnally modified at runtime by tuning some debugfs entries. This
> > requires CONFIG_FUNCTION_ERROR_INJECTION (among others).
> > 
> > One way to use fault injection at probe time by making each of those
> > functions fail one at a time is:
> > 
> >     FAILTYPE=fail_function
> >     DEVICE="0000:00:08.0" # depends on the system
> >     ERRNO=-12 # -ENOMEM, other value might be needed depending on error handling
> > 
> >     echo N > /sys/kernel/debug/$FAILTYPE/task-filter
> >     echo 100 > /sys/kernel/debug/$FAILTYPE/probability
> >     echo 0 > /sys/kernel/debug/$FAILTYPE/interval
> >     echo -1 > /sys/kernel/debug/$FAILTYPE/times
> >     echo 0 > /sys/kernel/debug/$FAILTYPE/space
> >     echo 1 > /sys/kernel/debug/$FAILTYPE/verbose
> > 
> >     modprobe xe
> >     echo $DEVICE > /sys/bus/pci/drivers/xe/unbind
> > 
> >     grep -oP "^.* \[xe\]" /sys/kernel/debug/$FAILTYPE/injectable | cut -d ' ' -f 1 | while read -r FUNCTION ; do
> >         echo "Injecting fault in $FUNCTION"
> >         echo "" > /sys/kernel/debug/$FAILTYPE/inject
> >         echo $FUNCTION > /sys/kernel/debug/$FAILTYPE/inject
> >         printf %#x $ERRNO > /sys/kernel/debug/$FAILTYPE/$FUNCTION/retval
> >         echo $DEVICE > /sys/bus/pci/drivers/xe/bind
> >     done
> > 
> >     rmmod xe
> > 
> > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_device.c | 37 +++++++++++++++++++++++++++++++---
> >  drivers/gpu/drm/xe/xe_tile.c   | 18 +++++++++++++++++
> >  drivers/gpu/drm/xe/xe_wopcm.c  |  3 +++
> >  3 files changed, 55 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> > index 4d3c794f134c..2db62aa45b39 100644
> > --- a/drivers/gpu/drm/xe/xe_device.c
> > +++ b/drivers/gpu/drm/xe/xe_device.c
> > @@ -6,6 +6,7 @@
> >  #include "xe_device.h"
> >  
> >  #include <linux/delay.h>
> > +#include <linux/fault-inject.h>
> >  #include <linux/units.h>
> >  
> >  #include <drm/drm_aperture.h>
> > @@ -300,6 +301,37 @@ static void xe_device_destroy(struct drm_device *dev, void *dummy)
> >  	ttm_device_fini(&xe->ttm);
> >  }
> >  
> > +/* Wrapper for fault injection */
> > +static noinline int device_create_ttm_device_init(struct xe_device *xe)
> > +{
> > +	/*
> > +	 * In case of error injection, the flow is modified because ttm_device_init()
> > +	 * executes (likely without error) but the return code is overridden in any
> > +	 * case to indicate an error in ttm_device_init(), which likely did not
> > +	 * occur.
> > +	 */
> > +	return ttm_device_init(&xe->ttm, &xe_ttm_funcs, xe->drm.dev,
> > +			       xe->drm.anon_inode->i_mapping,
> > +			       xe->drm.vma_offset_manager, false, false);
> > +
> > +	/*
> > +	 * The alternative below would also change the flow because in case of error
> > +	 * injection, ttm_device_init() would not be executed at all.
> > +	 *
> > +	 * int err = device_create_ttm_device_init_inject_fault();
> > +	 * if (err)
> > +	 *	return err;
> > +	 * return ttm_device_init(&xe->ttm, &xe_ttm_funcs, xe->drm.dev,
> > +	 *		       xe->drm.anon_inode->i_mapping,
> > +	 *		       xe->drm.vma_offset_manager, false, false);
> > +	 *
> > +	 * ...
> > +	 * static noinline int device_create_ttm_device_init_inject_fault() { return 0; }
> > +	 * ALLOW_ERROR_INJECTION(device_create_ttm_device_init_inject_fault, ERRNO);
> > +	 */
> > +}
> > +ALLOW_ERROR_INJECTION(device_create_ttm_device_init, ERRNO);
> > +
> >  struct xe_device *xe_device_create(struct pci_dev *pdev,
> >  				   const struct pci_device_id *ent)
> >  {
> > @@ -316,9 +348,7 @@ struct xe_device *xe_device_create(struct pci_dev *pdev,
> >  	if (IS_ERR(xe))
> >  		return xe;
> >  
> > -	err = ttm_device_init(&xe->ttm, &xe_ttm_funcs, xe->drm.dev,
> > -			      xe->drm.anon_inode->i_mapping,
> > -			      xe->drm.vma_offset_manager, false, false);
> > +	err = device_create_ttm_device_init(xe);
> 
> I believe the function name could be simplified to xe_device_init_ttm...

Yes indeed.

> 
> >  	if (WARN_ON(err))
> >  		goto err;
> >  
> > @@ -550,6 +580,7 @@ static int wait_for_lmem_ready(struct xe_device *xe)
> >  
> >  	return 0;
> >  }
> > +ALLOW_ERROR_INJECTION(wait_for_lmem_ready, ERRNO);
> 
> I liked how simple that is....

It is quite straightforward when the function is error injectable, which is
often not the case. Especially this requirement is often not met: "The
function does not execute any code which can change any state before the first
error return."

This is why other examples in this patch require an extra wrapper, see below.

> 
> >  
> >  static void update_device_info(struct xe_device *xe)
> >  {
> > diff --git a/drivers/gpu/drm/xe/xe_tile.c b/drivers/gpu/drm/xe/xe_tile.c
> > index dda5268507d8..22c0a42af9c2 100644
> > --- a/drivers/gpu/drm/xe/xe_tile.c
> > +++ b/drivers/gpu/drm/xe/xe_tile.c
> > @@ -3,6 +3,8 @@
> >   * Copyright © 2023 Intel Corporation
> >   */
> >  
> > +#include <linux/fault-inject.h>
> > +
> >  #include <drm/drm_managed.h>
> >  
> >  #include "xe_device.h"
> > @@ -99,6 +101,12 @@ static int xe_tile_alloc(struct xe_tile *tile)
> >  	return 0;
> >  }
> >  
> > +static noinline int tile_init_early_inject_fault(void)
> > +{
> > +	return 0;
> > +}
> > +ALLOW_ERROR_INJECTION(tile_init_early_inject_fault, ERRNO);
> > +
> >  /**
> >   * xe_tile_init_early - Initialize the tile and primary GT
> >   * @tile: Tile to initialize
> > @@ -117,6 +125,16 @@ int xe_tile_init_early(struct xe_tile *tile, struct xe_device *xe, u8 id)
> >  	tile->xe = xe;
> >  	tile->id = id;
> >  
> > +	/*
> > +	 * The flow is modified because xe_tile_init_early() fails before the
> > +	 * first possible error, from xe_tile_alloc(). It is does not match the
> > +	 * 2nd requirement of
> > +	 * fault-injection.html#requirements-for-the-error-injectable-functions
> > +	 */
> 
> I'm afraid I didn't understand this.

In this example we end up returning an error from an artificial code location:

    err = tile_init_early_inject_fault();
    if (err)
        return err;

... before even starting to execute the first statement which could return a
real error, xe_tile_alloc(). Not focusing on this particular example but in
general this can have side effects and create false positives if the state
is not changed as expected, either because the first statement that can fail
was not executed at all, or because all was executed successfully but still
reported as error.

Not sure if this is overthinking, after all this issue is also present with
the i915 macro for error injection and maybe the pros outweigh the cons.

Francois

> 
> and the name could perhaps start with fault_inject?!
> 
> or maybe:
> 
> xe_fault_inject_tile_init_early();

Sure, whatever helps easily identify wrappers only added for fault
injection. Maybe we can even remove the xe_ prefix as the function can be
static?

Francois

> 
> > +	err = tile_init_early_inject_fault();
> > +	if (err)
> > +		return err;
> > +
> >  	err = xe_tile_alloc(tile);
> >  	if (err)
> >  		return err;
> > diff --git a/drivers/gpu/drm/xe/xe_wopcm.c b/drivers/gpu/drm/xe/xe_wopcm.c
> > index 93c82825d896..88a201122a22 100644
> > --- a/drivers/gpu/drm/xe/xe_wopcm.c
> > +++ b/drivers/gpu/drm/xe/xe_wopcm.c
> > @@ -5,6 +5,8 @@
> >  
> >  #include "xe_wopcm.h"
> >  
> > +#include <linux/fault-inject.h>
> > +
> >  #include "regs/xe_guc_regs.h"
> >  #include "xe_device.h"
> >  #include "xe_force_wake.h"
> > @@ -268,3 +270,4 @@ int xe_wopcm_init(struct xe_wopcm *wopcm)
> >  
> >  	return ret;
> >  }
> > +ALLOW_ERROR_INJECTION(xe_wopcm_init, ERRNO);
> > -- 
> > 2.43.0
> >

next prev parent reply	other threads:[~2024-09-16  8:46 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-13 14:33 [RFC v1] drm/xe: Use fault injection infrastructure to find issues at probe time Francois Dugast
2024-09-13 18:15 ` ✓ CI.Patch_applied: success for " Patchwork
2024-09-13 18:15 ` ✗ CI.checkpatch: warning " Patchwork
2024-09-13 18:15 ` ✗ CI.KUnit: failure " Patchwork
2024-09-13 20:15 ` [RFC v1] " Rodrigo Vivi
2024-09-16  8:45   ` Francois Dugast [this message]
2024-09-18 12:55     ` Francois Dugast

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZufwMSTB4IlNcOKX@fdugast-desk \
    --to=francois.dugast@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=lucas.demarchi@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=rodrigo.vivi@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox