Igt-dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH i-g-t] tests/intel/xe_wedged: Avoid racy cleanup
@ 2024-10-04  0:17 Matt Roper
  2024-10-07 14:16 ` ✗ Fi.CI.BUILD: failure for " Patchwork
  0 siblings, 1 reply; 2+ messages in thread
From: Matt Roper @ 2024-10-04  0:17 UTC (permalink / raw)
  To: igt-dev; +Cc: matthew.d.roper, Rodrigo Vivi

The wedged-at-any-timeout subtest uses wedged_mode=2 and triggers an
engine hang via IGT spinner, which means the hardware can become wedged
at any time after the spinner batch is submitted to the kernel.  Any
further ioctl submissions from the test are effectively racing with the
GPU and might succeed if they run before the GPU/driver notices the
hang, or might fail with -ECANCELED if the hang has already been
detected.  This is also true for the cleanup at the end of the
simple_hang() function (i.e., execqueue destruction, gem close, etc.).
The IGT wrappers for those cleanup ioctls will terminate the test if
they fail, causing a bogus test failure.

As soon as the spinner batch is submitted to the kernel, we can't
guarantee that we have time to sneak in any more ioctls, so just drop
the cleanup actions; they aren't really needed.  In a similar vein, we
technically shouldn't assume that a 1 second wait will guarantee that
we've reached wedged status (it almost certainly will, but arbitrary
delays of "this is probably long enough" are never good).  Instead, do a
syncobj_wait on the spinner; that might complete normally or might
return ECANCELED, either of which is a signal that we can move on with
the rest of the test.

Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2780
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2747
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
---
 tests/intel/xe_wedged.c | 44 +++++++++++++++++++++++++++++------------
 1 file changed, 31 insertions(+), 13 deletions(-)

diff --git a/tests/intel/xe_wedged.c b/tests/intel/xe_wedged.c
index a3f7a697f..3678abf13 100644
--- a/tests/intel/xe_wedged.c
+++ b/tests/intel/xe_wedged.c
@@ -170,7 +170,7 @@ simple_exec(int fd, struct drm_xe_engine_class_instance *eci)
 }
 
 static void
-simple_hang(int fd)
+simple_hang(int fd, struct drm_xe_sync *sync)
 {
 	struct drm_xe_engine_class_instance *eci = &xe_engine(fd, 0)->instance;
 	uint32_t vm;
@@ -191,6 +191,11 @@ simple_hang(int fd)
 	struct xe_spin_opts spin_opts = { .preempt = false };
 	int err;
 
+	if (sync) {
+		exec_hang.syncs = to_user_pointer(sync);
+		exec_hang.num_syncs = 1;
+	}
+
 	vm = xe_vm_create(fd, 0, 0);
 	bo_size = xe_bb_size(fd, sizeof(*data));
 	bo = xe_bo_create(fd, vm, bo_size,
@@ -208,11 +213,6 @@ simple_hang(int fd)
 	do {
 		err =  igt_ioctl(fd, DRM_IOCTL_XE_EXEC, &exec_hang);
 	} while (err && errno == ENOMEM);
-
-	xe_exec_queue_destroy(fd, hang_exec_queue);
-	munmap(data, bo_size);
-	gem_close(fd, bo);
-	xe_vm_destroy(fd, vm);
 }
 
 /**
@@ -252,19 +252,37 @@ igt_main
 	}
 
 	igt_subtest_f("wedged-at-any-timeout") {
+		struct drm_xe_sync hang_sync = {
+			.type = DRM_XE_SYNC_TYPE_SYNCOBJ,
+			.flags = DRM_XE_SYNC_FLAG_SIGNAL,
+		};
+		int err;
+
 		igt_require(igt_debugfs_exists(fd, "wedged_mode", O_RDWR));
 		ignore_wedged_in_dmesg();
 
+		hang_sync.handle = syncobj_create(fd, 0);
+
 		igt_debugfs_write(fd, "wedged_mode", "2");
-		simple_hang(fd);
+		simple_hang(fd, &hang_sync);
+
 		/*
-		 * Any ioctl after the first timeout on wedged_mode=2 is blocked
-		 * so we cannot relly on sync objects. Let's wait a bit for
-		 * things to settle before we confirm device as wedged and
-		 * rebind.
+		 * Wait for the hang to be detected.  If the hang has already
+		 * taken place, this will return ECANCELED and we can just move
+		 * on immediately.
 		 */
-		sleep(1);
+		err = syncobj_wait_err(fd, &hang_sync.handle, 1, INT64_MAX, 0);
+		if (err)
+			igt_assert_eq(err, -ECANCELED);
+
+		/* Other ioctls should also be returning ECANCELED now */
 		igt_assert_neq(simple_ioctl(fd), 0);
+		igt_assert_eq(errno, ECANCELED);
+
+		/*
+		 * Rebind the device and ensure proper operation is restored
+		 * for all engines.
+		 */
 		fd = rebind_xe(fd);
 		igt_assert_eq(simple_ioctl(fd), 0);
 		xe_for_each_engine(fd, hwe)
@@ -278,7 +296,7 @@ igt_main
 		igt_assert_eq(simple_ioctl(fd), 0);
 		igt_debugfs_write(fd, "wedged_mode", "1");
 		ignore_wedged_in_dmesg();
-		simple_hang(fd);
+		simple_hang(fd, NULL);
 		igt_assert_eq(simple_ioctl(fd), 0);
 	}
 
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 2+ messages in thread

* ✗ Fi.CI.BUILD: failure for tests/intel/xe_wedged: Avoid racy cleanup
  2024-10-04  0:17 [PATCH i-g-t] tests/intel/xe_wedged: Avoid racy cleanup Matt Roper
@ 2024-10-07 14:16 ` Patchwork
  0 siblings, 0 replies; 2+ messages in thread
From: Patchwork @ 2024-10-07 14:16 UTC (permalink / raw)
  To: Matt Roper; +Cc: igt-dev

== Series Details ==

Series: tests/intel/xe_wedged: Avoid racy cleanup
URL   : https://patchwork.freedesktop.org/series/139537/
State : failure

== Summary ==

Applying: tests/intel/xe_wedged: Avoid racy cleanup
Using index info to reconstruct a base tree...
M	tests/intel/xe_wedged.c
Falling back to patching base and 3-way merge...
Auto-merging tests/intel/xe_wedged.c
CONFLICT (content): Merge conflict in tests/intel/xe_wedged.c
Patch failed at 0001 tests/intel/xe_wedged: Avoid racy cleanup
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".



^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2024-10-07 14:16 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-04  0:17 [PATCH i-g-t] tests/intel/xe_wedged: Avoid racy cleanup Matt Roper
2024-10-07 14:16 ` ✗ Fi.CI.BUILD: failure for " Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox