From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: Francois Dugast <francois.dugast@intel.com>
Cc: <intel-xe@lists.freedesktop.org>,
Lucas De Marchi <lucas.demarchi@intel.com>,
Matthew Brost <matthew.brost@intel.com>
Subject: Re: [RFC v1] drm/xe: Use fault injection infrastructure to find issues at probe time
Date: Fri, 13 Sep 2024 16:15:49 -0400 [thread overview]
Message-ID: <ZuSdddkrFZgCDkqt@intel.com> (raw)
In-Reply-To: <20240913143305.2262927-1-francois.dugast@intel.com>
On Fri, Sep 13, 2024 at 04:33:05PM +0200, Francois Dugast wrote:
> This RFC is a rework of https://patchwork.freedesktop.org/series/137314/
> but instead of importing custom macros from i915, it makes use of the
> standard kernel fault injection infrastructure, see fault-injection.rst.
>
> In particular, it leverages error injectable functions with the macro
> ALLOW_ERROR_INJECTION(). This macro requires minimal code changes to the
> faulting function, if it meets the injectable functions requirements:
> fault-injection.html#requirements-for-the-error-injectable-functions
>
> Unfortunately this is not the case in most of the injection points of the
> original series, so a wrapper is added if needed. Only a few examples are
> shown in this RFC to discuss the proper pattern to apply.
>
> The return code of the functions using ALLOW_ERROR_INJECTION() can be
> conditionnally modified at runtime by tuning some debugfs entries. This
> requires CONFIG_FUNCTION_ERROR_INJECTION (among others).
>
> One way to use fault injection at probe time by making each of those
> functions fail one at a time is:
>
> FAILTYPE=fail_function
> DEVICE="0000:00:08.0" # depends on the system
> ERRNO=-12 # -ENOMEM, other value might be needed depending on error handling
>
> echo N > /sys/kernel/debug/$FAILTYPE/task-filter
> echo 100 > /sys/kernel/debug/$FAILTYPE/probability
> echo 0 > /sys/kernel/debug/$FAILTYPE/interval
> echo -1 > /sys/kernel/debug/$FAILTYPE/times
> echo 0 > /sys/kernel/debug/$FAILTYPE/space
> echo 1 > /sys/kernel/debug/$FAILTYPE/verbose
>
> modprobe xe
> echo $DEVICE > /sys/bus/pci/drivers/xe/unbind
>
> grep -oP "^.* \[xe\]" /sys/kernel/debug/$FAILTYPE/injectable | cut -d ' ' -f 1 | while read -r FUNCTION ; do
> echo "Injecting fault in $FUNCTION"
> echo "" > /sys/kernel/debug/$FAILTYPE/inject
> echo $FUNCTION > /sys/kernel/debug/$FAILTYPE/inject
> printf %#x $ERRNO > /sys/kernel/debug/$FAILTYPE/$FUNCTION/retval
> echo $DEVICE > /sys/bus/pci/drivers/xe/bind
> done
>
> rmmod xe
>
> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> ---
> drivers/gpu/drm/xe/xe_device.c | 37 +++++++++++++++++++++++++++++++---
> drivers/gpu/drm/xe/xe_tile.c | 18 +++++++++++++++++
> drivers/gpu/drm/xe/xe_wopcm.c | 3 +++
> 3 files changed, 55 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> index 4d3c794f134c..2db62aa45b39 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -6,6 +6,7 @@
> #include "xe_device.h"
>
> #include <linux/delay.h>
> +#include <linux/fault-inject.h>
> #include <linux/units.h>
>
> #include <drm/drm_aperture.h>
> @@ -300,6 +301,37 @@ static void xe_device_destroy(struct drm_device *dev, void *dummy)
> ttm_device_fini(&xe->ttm);
> }
>
> +/* Wrapper for fault injection */
> +static noinline int device_create_ttm_device_init(struct xe_device *xe)
> +{
> + /*
> + * In case of error injection, the flow is modified because ttm_device_init()
> + * executes (likely without error) but the return code is overridden in any
> + * case to indicate an error in ttm_device_init(), which likely did not
> + * occur.
> + */
> + return ttm_device_init(&xe->ttm, &xe_ttm_funcs, xe->drm.dev,
> + xe->drm.anon_inode->i_mapping,
> + xe->drm.vma_offset_manager, false, false);
> +
> + /*
> + * The alternative below would also change the flow because in case of error
> + * injection, ttm_device_init() would not be executed at all.
> + *
> + * int err = device_create_ttm_device_init_inject_fault();
> + * if (err)
> + * return err;
> + * return ttm_device_init(&xe->ttm, &xe_ttm_funcs, xe->drm.dev,
> + * xe->drm.anon_inode->i_mapping,
> + * xe->drm.vma_offset_manager, false, false);
> + *
> + * ...
> + * static noinline int device_create_ttm_device_init_inject_fault() { return 0; }
> + * ALLOW_ERROR_INJECTION(device_create_ttm_device_init_inject_fault, ERRNO);
> + */
> +}
> +ALLOW_ERROR_INJECTION(device_create_ttm_device_init, ERRNO);
> +
> struct xe_device *xe_device_create(struct pci_dev *pdev,
> const struct pci_device_id *ent)
> {
> @@ -316,9 +348,7 @@ struct xe_device *xe_device_create(struct pci_dev *pdev,
> if (IS_ERR(xe))
> return xe;
>
> - err = ttm_device_init(&xe->ttm, &xe_ttm_funcs, xe->drm.dev,
> - xe->drm.anon_inode->i_mapping,
> - xe->drm.vma_offset_manager, false, false);
> + err = device_create_ttm_device_init(xe);
I believe the function name could be simplified to xe_device_init_ttm...
> if (WARN_ON(err))
> goto err;
>
> @@ -550,6 +580,7 @@ static int wait_for_lmem_ready(struct xe_device *xe)
>
> return 0;
> }
> +ALLOW_ERROR_INJECTION(wait_for_lmem_ready, ERRNO);
I liked how simple that is....
>
> static void update_device_info(struct xe_device *xe)
> {
> diff --git a/drivers/gpu/drm/xe/xe_tile.c b/drivers/gpu/drm/xe/xe_tile.c
> index dda5268507d8..22c0a42af9c2 100644
> --- a/drivers/gpu/drm/xe/xe_tile.c
> +++ b/drivers/gpu/drm/xe/xe_tile.c
> @@ -3,6 +3,8 @@
> * Copyright © 2023 Intel Corporation
> */
>
> +#include <linux/fault-inject.h>
> +
> #include <drm/drm_managed.h>
>
> #include "xe_device.h"
> @@ -99,6 +101,12 @@ static int xe_tile_alloc(struct xe_tile *tile)
> return 0;
> }
>
> +static noinline int tile_init_early_inject_fault(void)
> +{
> + return 0;
> +}
> +ALLOW_ERROR_INJECTION(tile_init_early_inject_fault, ERRNO);
> +
> /**
> * xe_tile_init_early - Initialize the tile and primary GT
> * @tile: Tile to initialize
> @@ -117,6 +125,16 @@ int xe_tile_init_early(struct xe_tile *tile, struct xe_device *xe, u8 id)
> tile->xe = xe;
> tile->id = id;
>
> + /*
> + * The flow is modified because xe_tile_init_early() fails before the
> + * first possible error, from xe_tile_alloc(). It is does not match the
> + * 2nd requirement of
> + * fault-injection.html#requirements-for-the-error-injectable-functions
> + */
I'm afraid I didn't understand this.
and the name could perhaps start with fault_inject?!
or maybe:
xe_fault_inject_tile_init_early();
> + err = tile_init_early_inject_fault();
> + if (err)
> + return err;
> +
> err = xe_tile_alloc(tile);
> if (err)
> return err;
> diff --git a/drivers/gpu/drm/xe/xe_wopcm.c b/drivers/gpu/drm/xe/xe_wopcm.c
> index 93c82825d896..88a201122a22 100644
> --- a/drivers/gpu/drm/xe/xe_wopcm.c
> +++ b/drivers/gpu/drm/xe/xe_wopcm.c
> @@ -5,6 +5,8 @@
>
> #include "xe_wopcm.h"
>
> +#include <linux/fault-inject.h>
> +
> #include "regs/xe_guc_regs.h"
> #include "xe_device.h"
> #include "xe_force_wake.h"
> @@ -268,3 +270,4 @@ int xe_wopcm_init(struct xe_wopcm *wopcm)
>
> return ret;
> }
> +ALLOW_ERROR_INJECTION(xe_wopcm_init, ERRNO);
> --
> 2.43.0
>
next prev parent reply other threads:[~2024-09-13 20:16 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-13 14:33 [RFC v1] drm/xe: Use fault injection infrastructure to find issues at probe time Francois Dugast
2024-09-13 18:15 ` ✓ CI.Patch_applied: success for " Patchwork
2024-09-13 18:15 ` ✗ CI.checkpatch: warning " Patchwork
2024-09-13 18:15 ` ✗ CI.KUnit: failure " Patchwork
2024-09-13 20:15 ` Rodrigo Vivi [this message]
2024-09-16 8:45 ` [RFC v1] " Francois Dugast
2024-09-18 12:55 ` Francois Dugast
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZuSdddkrFZgCDkqt@intel.com \
--to=rodrigo.vivi@intel.com \
--cc=francois.dugast@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=lucas.demarchi@intel.com \
--cc=matthew.brost@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.