Igt-dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Matt Roper <matthew.d.roper@intel.com>
To: igt-dev@lists.freedesktop.org
Cc: matthew.d.roper@intel.com, Rodrigo Vivi <rodrigo.vivi@intel.com>
Subject: [PATCH i-g-t] tests/intel/xe_wedged: Avoid racy cleanup
Date: Thu,  3 Oct 2024 17:17:16 -0700	[thread overview]
Message-ID: <20241004001716.1323641-1-matthew.d.roper@intel.com> (raw)

The wedged-at-any-timeout subtest uses wedged_mode=2 and triggers an
engine hang via IGT spinner, which means the hardware can become wedged
at any time after the spinner batch is submitted to the kernel.  Any
further ioctl submissions from the test are effectively racing with the
GPU and might succeed if they run before the GPU/driver notices the
hang, or might fail with -ECANCELED if the hang has already been
detected.  This is also true for the cleanup at the end of the
simple_hang() function (i.e., execqueue destruction, gem close, etc.).
The IGT wrappers for those cleanup ioctls will terminate the test if
they fail, causing a bogus test failure.

As soon as the spinner batch is submitted to the kernel, we can't
guarantee that we have time to sneak in any more ioctls, so just drop
the cleanup actions; they aren't really needed.  In a similar vein, we
technically shouldn't assume that a 1 second wait will guarantee that
we've reached wedged status (it almost certainly will, but arbitrary
delays of "this is probably long enough" are never good).  Instead, do a
syncobj_wait on the spinner; that might complete normally or might
return ECANCELED, either of which is a signal that we can move on with
the rest of the test.

Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2780
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2747
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
---
 tests/intel/xe_wedged.c | 44 +++++++++++++++++++++++++++++------------
 1 file changed, 31 insertions(+), 13 deletions(-)

diff --git a/tests/intel/xe_wedged.c b/tests/intel/xe_wedged.c
index a3f7a697f..3678abf13 100644
--- a/tests/intel/xe_wedged.c
+++ b/tests/intel/xe_wedged.c
@@ -170,7 +170,7 @@ simple_exec(int fd, struct drm_xe_engine_class_instance *eci)
 }
 
 static void
-simple_hang(int fd)
+simple_hang(int fd, struct drm_xe_sync *sync)
 {
 	struct drm_xe_engine_class_instance *eci = &xe_engine(fd, 0)->instance;
 	uint32_t vm;
@@ -191,6 +191,11 @@ simple_hang(int fd)
 	struct xe_spin_opts spin_opts = { .preempt = false };
 	int err;
 
+	if (sync) {
+		exec_hang.syncs = to_user_pointer(sync);
+		exec_hang.num_syncs = 1;
+	}
+
 	vm = xe_vm_create(fd, 0, 0);
 	bo_size = xe_bb_size(fd, sizeof(*data));
 	bo = xe_bo_create(fd, vm, bo_size,
@@ -208,11 +213,6 @@ simple_hang(int fd)
 	do {
 		err =  igt_ioctl(fd, DRM_IOCTL_XE_EXEC, &exec_hang);
 	} while (err && errno == ENOMEM);
-
-	xe_exec_queue_destroy(fd, hang_exec_queue);
-	munmap(data, bo_size);
-	gem_close(fd, bo);
-	xe_vm_destroy(fd, vm);
 }
 
 /**
@@ -252,19 +252,37 @@ igt_main
 	}
 
 	igt_subtest_f("wedged-at-any-timeout") {
+		struct drm_xe_sync hang_sync = {
+			.type = DRM_XE_SYNC_TYPE_SYNCOBJ,
+			.flags = DRM_XE_SYNC_FLAG_SIGNAL,
+		};
+		int err;
+
 		igt_require(igt_debugfs_exists(fd, "wedged_mode", O_RDWR));
 		ignore_wedged_in_dmesg();
 
+		hang_sync.handle = syncobj_create(fd, 0);
+
 		igt_debugfs_write(fd, "wedged_mode", "2");
-		simple_hang(fd);
+		simple_hang(fd, &hang_sync);
+
 		/*
-		 * Any ioctl after the first timeout on wedged_mode=2 is blocked
-		 * so we cannot relly on sync objects. Let's wait a bit for
-		 * things to settle before we confirm device as wedged and
-		 * rebind.
+		 * Wait for the hang to be detected.  If the hang has already
+		 * taken place, this will return ECANCELED and we can just move
+		 * on immediately.
 		 */
-		sleep(1);
+		err = syncobj_wait_err(fd, &hang_sync.handle, 1, INT64_MAX, 0);
+		if (err)
+			igt_assert_eq(err, -ECANCELED);
+
+		/* Other ioctls should also be returning ECANCELED now */
 		igt_assert_neq(simple_ioctl(fd), 0);
+		igt_assert_eq(errno, ECANCELED);
+
+		/*
+		 * Rebind the device and ensure proper operation is restored
+		 * for all engines.
+		 */
 		fd = rebind_xe(fd);
 		igt_assert_eq(simple_ioctl(fd), 0);
 		xe_for_each_engine(fd, hwe)
@@ -278,7 +296,7 @@ igt_main
 		igt_assert_eq(simple_ioctl(fd), 0);
 		igt_debugfs_write(fd, "wedged_mode", "1");
 		ignore_wedged_in_dmesg();
-		simple_hang(fd);
+		simple_hang(fd, NULL);
 		igt_assert_eq(simple_ioctl(fd), 0);
 	}
 
-- 
2.46.2


             reply	other threads:[~2024-10-04  0:17 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-04  0:17 Matt Roper [this message]
2024-10-07 14:16 ` ✗ Fi.CI.BUILD: failure for tests/intel/xe_wedged: Avoid racy cleanup Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241004001716.1323641-1-matthew.d.roper@intel.com \
    --to=matthew.d.roper@intel.com \
    --cc=igt-dev@lists.freedesktop.org \
    --cc=rodrigo.vivi@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox