All of lore.kernel.org
 help / color / mirror / Atom feed
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: <igt-dev@lists.freedesktop.org>, <intel-xe@lists.freedesktop.org>,
	"Himal Prasad Ghimiray" <himal.prasad.ghimiray@intel.com>
Subject: Re: [PATCH i-g-t] tests/intel/xe_wedged: Introduce a new test for Xe device wedged state
Date: Thu, 14 Mar 2024 13:57:54 -0400	[thread overview]
Message-ID: <ZfM6ogyMbRxQwMkS@intel.com> (raw)
In-Reply-To: <rkw7vdk6fssletnevhneqciacoq5rxgbwfb6engvpf4yuqvmqb@x5koe4zreiwo>

On Wed, Mar 13, 2024 at 03:07:44PM -0500, Lucas De Marchi wrote:
> On Wed, Mar 13, 2024 at 03:55:28PM -0400, Rodrigo Vivi wrote:
> > Let's inject a gt_reset failure that will put Xe device in the
> > new wedged state, then we confirm the IOCTL is blocked and we
> > reload the driver to get back to a clean state for other test
> > execution, since wedged state in Xe is a final state that can only
> > be cleared with a module reload.
> > 
> > This new test case is entirely based on xe_uevent provided by
> > Himal.
> 
> /me confused... I don't see any uevent handling here.

the uevent part is gone, but the failure injection came from there.

> 
> > 
> > Cc:  Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> > Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > ---
> > tests/intel/xe_wedged.c | 91 +++++++++++++++++++++++++++++++++++++++++
> > tests/meson.build       |  1 +
> > 2 files changed, 92 insertions(+)
> > create mode 100644 tests/intel/xe_wedged.c
> > 
> > diff --git a/tests/intel/xe_wedged.c b/tests/intel/xe_wedged.c
> > new file mode 100644
> > index 000000000..f767e2511
> > --- /dev/null
> > +++ b/tests/intel/xe_wedged.c
> > @@ -0,0 +1,91 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2024 Intel Corporation
> > + */
> > +
> > +/**
> > + * TEST: cause fake gt reset failure which put Xe device in wedged state
> > + * Category: Software building block
> > + * Sub-category: driver
> > + * Functionality: wedged
> > + * Test category: functionality test
> > + */
> > +
> > +#include "igt.h"
> > +#include "igt_kmod.h"
> > +
> > +#include "xe/xe_ioctl.h"
> > +
> > +static void force_wedged(int fd)
> > +{
> > +	igt_debugfs_write(fd, "fail_gt_reset/probability", "100");
> > +	igt_debugfs_write(fd, "fail_gt_reset/times", "2");
> > +
> > +	xe_force_gt_reset(fd, 0);
> 
> humn... do we have to check the writes above did anything?

unfortunately the debugfs_write is a void return...

we could read it back, but I don't believe it brings anything...

>  I also don't
> see the kernel side, but if it just resets normally, the test would
> still pass afaics.

nope, if the reset works normally without injecting the failure and
declaring the gt busted, then we would fail below

igt_assert_eq(simple_ioctl(fd), 0);
force_busted(fd);
igt_assert_neq(simple_ioctl(fd), 0);
fd = rebind_xe(fd);
igt_assert_eq(simple_ioctl(fd), 0);

notice that the middle one is != 0,


but I'm considering to change that to

igt_assert_eq(simple_ioctl(fd), -ECANCELED);

for clarity.

> 
> > +	sleep(1);
> > +}
> > +
> > +static int reload_xe(int fd)
> > +{
> > +	int error;
> > +
> > +	drm_close_driver(fd);
> > +	igt_xe_driver_unload();
> 
> 
> what if we are running on e.g. MTL with a DG2 and want to debug one of
> them? Rather than re-loading the module and possibly causing unrelated
> issues (if e.g. module removal from the other card crashes), why not
> just unbind the module from the card under test?
> 
> i.e. the equivalent in C of:
> 
> rebind() {
> 	pci_slot=$1
> 	echo -n "0000:$pci_slot" > /sys/bus/pci/drivers/$driver/unbind
> 	echo -n "0000:$pci_slot" > /sys/bus/pci/drivers/$driver/bind
> }

Thanks, that indeed is a better choice.

the only caveat is that I need to close the main fd client for a proper
exit before we can rebind cleanly. But I could finally get that working.

> 
> Lucas De Marchi
> 
> > +
> > +	error = igt_xe_driver_load(NULL);
> > +
> > +	igt_assert_eq(error, 0);
> > +
> > +	/* driver is ready, check if it's bound */
> > +	fd = __drm_open_driver(DRIVER_XE);
> > +	igt_fail_on_f(fd < 0, "Cannot open the xe DRM driver while reloading xe after wedged\n");
> > +	return fd;
> > +}
> > +
> > +static int simple_ioctl(int fd)
> > +{
> > +	int ret;
> > +
> > +	struct drm_xe_vm_create create = {
> > +		.extensions = 0,
> > +		.flags = 0,
> > +	};
> > +
> > +	ret = igt_ioctl(fd, DRM_IOCTL_XE_VM_CREATE, &create);
> > +
> > +	if (ret == 0)
> > +		xe_vm_destroy(fd, create.vm_id);
> > +
> > +	return ret;
> > +}
> > +
> > +/**
> > + * SUBTEST: basic-wedged
> > + * Description: Force Xe device wedged after injecting a failure in GT reset
> > + */
> > +igt_main
> > +{
> > +	int fd;
> > +
> > +	igt_fixture {
> > +		fd = drm_open_driver(DRIVER_XE);
> > +		igt_require(igt_debugfs_exists(fd, "fail_gt_reset/probability",
> > +					       O_RDWR));
> > +	}
> > +
> > +	igt_subtest("basic-wedged") {
> > +		igt_assert_eq(simple_ioctl(fd), 0);
> > +		force_wedged(fd);
> > +		igt_assert_neq(simple_ioctl(fd), 0);
> > +		fd = reload_xe(fd);
> > +		igt_assert_eq(simple_ioctl(fd), 0);
> > +	}
> > +
> > +	igt_fixture {
> > +		if (igt_debugfs_exists(fd, "fail_gt_reset/probability", O_RDWR)) {
> > +			igt_debugfs_write(fd, "fail_gt_reset/probability", "0");
> > +			igt_debugfs_write(fd, "fail_gt_reset/times", "1");
> > +		}
> > +		drm_close_driver(fd);
> > +	}
> > +}
> > diff --git a/tests/meson.build b/tests/meson.build
> > index a856510fc..e590d4348 100644
> > --- a/tests/meson.build
> > +++ b/tests/meson.build
> > @@ -312,6 +312,7 @@ intel_xe_progs = [
> > 	'xe_render_copy',
> > 	'xe_vm',
> > 	'xe_waitfence',
> > +	'xe_wedged',
> > 	'xe_spin_batch',
> > 	'xe_sysfs_defaults',
> > 	'xe_sysfs_scheduler',
> > -- 
> > 2.44.0
> > 

  reply	other threads:[~2024-03-14 17:58 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-13 19:55 [PATCH i-g-t] tests/intel/xe_wedged: Introduce a new test for Xe device wedged state Rodrigo Vivi
2024-03-13 20:07 ` Lucas De Marchi
2024-03-14 17:57   ` Rodrigo Vivi [this message]
2024-03-13 20:45 ` ✗ CI.Patch_applied: failure for " Patchwork
2024-03-13 23:27 ` ✗ CI.xeBAT: " Patchwork
2024-03-13 23:45 ` ✓ Fi.CI.BAT: success " Patchwork
2024-03-14  8:01 ` ✗ Fi.CI.IGT: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZfM6ogyMbRxQwMkS@intel.com \
    --to=rodrigo.vivi@intel.com \
    --cc=himal.prasad.ghimiray@intel.com \
    --cc=igt-dev@lists.freedesktop.org \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=lucas.demarchi@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.