public inbox for igt-dev@lists.freedesktop.org
 help / color / mirror / Atom feed
* [PATCH i-g-t] tests/xe_exec: Add VM rebind stress test under workqueue cpumask cycling
@ 2026-04-17 12:01 S Sebinraj
  2026-04-20  8:21 ` Krzysztof Karas
                   ` (4 more replies)
  0 siblings, 5 replies; 7+ messages in thread
From: S Sebinraj @ 2026-04-17 12:01 UTC (permalink / raw)
  To: igt-dev
  Cc: carlos.santa, matthew.brost, jeevaka.badrappan, karthik.b.s,
	krzysztof.karas, kamil.konieczny, zbigniew.kempczynski,
	S Sebinraj

Add a new subtest threads-wq-stress-rebind-bindexecqueue that stresses
the VM rebind path under workqueue CPU pool migration pressure.

The test spawns per-engine threads that continuously perform VM unbind/
rebind cycles using per-slot bind exec queues, while a helper child
process rapidly cycles the global unbound workqueue cpumask through
progressively wider CPU sets (f -> ff -> fff -> ffff) at 100ms intervals.

A new WQ_STRESS flag enables timed fence waits in test_legacy_mode at
three syncobj_wait() checkpoints (per-exec-queue, bind-chain, and unbind/
TLB-invalidation fences) using a 5-second deadline. If any fence misses
the deadline, a shared atomic is set and all threads bail out immediately
rather than running for the full 30-second window.

All GPU work runs in a forked child that writes its result via a pipe,
the parent polls with a 60-second timeout and restores the original
cpumask regardless of outcome.

When a hang is detected the child drops the kernel page cache
and issues xe_force_gt_reset_all() to unblock any DMA fences still
pending from the stress run, then writes the hang result and calls
_exit() immediately without closing the DRM fd. Closing the fd while GPU
work is stuck would block indefinitely in dma_resv_wait_timeout(intr=false,
TASK_UNINTERRUPTIBLE). For the same reason, GPU resource teardown
(exec queues, BOs, VMs) is skipped in test_legacy_mode when a hang
is detected.

On the failure path the parent signals SIGCHLD as SIG_IGN to reparent
the potentially D-state child to init, sends SIGKILL as a best-effort
wake-up (silently ignored by the kernel when the process is in D-state),
and calls igt_fail() so the subtest is reported as FAIL cleanly.

A bug of same signature regressed in Xe driver, where the whole system
hung due to a fencing signal not completing, which was finaly
traced to an issue in the kernel workqueue scheduling.
Hard reboot was the only way to bring the system back to work.
https://patchwork.freedesktop.org/patch/715805/

Cc: Santa Carlos <carlos.santa@intel.com>
Signed-off-by: S Sebinraj <s.sebinraj@intel.com>
---
 tests/intel/xe_exec_threads.c | 341 +++++++++++++++++++++++++++++++++-
 1 file changed, 334 insertions(+), 7 deletions(-)

diff --git a/tests/intel/xe_exec_threads.c b/tests/intel/xe_exec_threads.c
index f082a0eda..1cb6ce7f4 100644
--- a/tests/intel/xe_exec_threads.c
+++ b/tests/intel/xe_exec_threads.c
@@ -13,6 +13,11 @@
  */
 
 #include <fcntl.h>
+#include <inttypes.h>
+#include <poll.h>
+#include <signal.h>
+#include <stdatomic.h>
+#include <sys/wait.h>
 
 #include "igt.h"
 #include "lib/igt_syncobj.h"
@@ -23,6 +28,7 @@
 #include "xe/xe_query.h"
 #include "xe/xe_gt.h"
 #include "xe/xe_spin.h"
+#include "lib/igt_thread.h"
 #include <string.h>
 
 #define MAX_N_EXEC_QUEUES	16
@@ -42,9 +48,88 @@
 #define BIND_EXEC_QUEUE	(0x1 << 13)
 #define MANY_QUEUES	(0x1 << 14)
 #define MULTI_QUEUE		(0x1 << 15)
+#define WQ_STRESS		(0x1 << 16)
+
+/*
+ * Maximum fence wait time when WQ_STRESS is active. If any bind/unbind
+ * fence takes longer than this to signal, the workqueue is considered stuck
+ * and the test fails — this is the direct symptom of the
+ * Xe hang caused by the kernel workqueue pool_workqueue pending_pwqs bug.
+ */
+#define WQ_FENCE_TIMEOUT_NS	(5LL * NSEC_PER_SEC)
+
+/*
+ * Maximum time the parent waits for the child to write its result byte.
+ * The child's stress loop runs for up to 30s (igt_until_timeout(30)), plus
+ * cleanup overhead (GT reset, drop_caches, cpumask restore, sleep(1)).
+ * 60s gives ample headroom on a healthy kernel.
+ */
+#define WQ_CHILD_TIMEOUT_MS	(60 * 1000)
+
+/* sysfs node that controls the unbound workqueue CPU affinity mask */
+#define WQ_CPUMASK_PATH		"/sys/devices/virtual/workqueue/cpumask"
+
+/* procfs node for dropping the kernel's page, dentry, and inode caches */
+#define DROP_CACHES_PATH	"/proc/sys/vm/drop_caches"
 
 pthread_barrier_t barrier;
 
+/*
+ * Set to true by the first thread that detects a fence stall under WQ_STRESS.
+ * All other threads and the igt_until_timeout loop check this to bail out
+ * immediately rather than hammering a hung kernel for the full timeout.
+ */
+static _Atomic bool wq_stress_hang_detected;
+
+/**
+ * stress_fence_deadline - compute an absolute CLOCK_MONOTONIC deadline
+ * 5 seconds from now, used as the syncobj_wait timeout under WQ_STRESS.
+ */
+static int64_t stress_fence_deadline(void)
+{
+	struct timespec ts;
+
+	clock_gettime(CLOCK_MONOTONIC, &ts);
+	return (int64_t)ts.tv_sec * NSEC_PER_SEC + ts.tv_nsec +
+		WQ_FENCE_TIMEOUT_NS;
+}
+
+/**
+ * cpumask_stressor_loop - run in an igt_fork_helper child process.
+ *
+ * Rapidly cycles the kernel unbound workqueue cpumask through progressively
+ * wider CPU sets (mirroring the original shell reproduction script). This
+ * forces workqueue work items to be migrated between CPU pools, exercising
+ * the wq_node_nr_active / pool_workqueue plug-unplug path that hides the
+ * pending_pwqs scheduling bug.
+ *
+ * The original reproduction commands were:
+ *   for i in {1..1000}; do
+ *       echo f  > /sys/devices/virtual/workqueue/cpumask
+ *       echo ff > /sys/devices/virtual/workqueue/cpumask
+ *       ...
+ *       sleep .1
+ *   done
+ */
+static void cpumask_stressor_loop(void)
+{
+	static const char * const masks[] = { "f", "ff", "fff", "ffff" };
+	int wq_fd;
+
+	wq_fd = open(WQ_CPUMASK_PATH, O_WRONLY);
+	if (wq_fd < 0)
+		exit(0);
+
+	for (;;) {
+		for (int i = 0; i < ARRAY_SIZE(masks); i++) {
+			/* ignore write errors — cpumask rejects invalid masks */
+			write(wq_fd, masks[i], strlen(masks[i]));
+			usleep(100000); /* 100ms */
+		}
+	}
+	close(wq_fd);
+}
+
 static void
 test_balancer(int fd, int gt, uint32_t vm, uint64_t addr, uint64_t userptr,
 	      int class, int n_exec_queues, int n_execs, unsigned int flags)
@@ -600,6 +685,10 @@ test_legacy_mode(int fd, uint32_t vm, uint64_t addr, uint64_t userptr,
 		uint64_t exec_addr;
 		int e = i % n_exec_queues;
 
+		/* Bail early if another thread already detected a hang */
+		if ((flags & WQ_STRESS) && atomic_load(&wq_stress_hang_detected))
+			goto wq_stress_cleanup;
+
 		if (flags & MANY_QUEUES) {
 			if (exec_queues[e]) {
 				igt_assert(syncobj_wait(fd, &syncobjs[e], 1,
@@ -693,15 +782,66 @@ test_legacy_mode(int fd, uint32_t vm, uint64_t addr, uint64_t userptr,
 		}
 	}
 
-	for (i = 0; i < n_exec_queues; i++)
-		igt_assert(syncobj_wait(fd, &syncobjs[i], 1, INT64_MAX, 0,
-					NULL));
-	igt_assert(syncobj_wait(fd, &sync[0].handle, 1, INT64_MAX, 0, NULL));
+	/*
+	 * Drain all exec-queue fences.  Under WQ_STRESS use a 5s deadline;
+	 * a timeout means the workqueue is hung so bail immediately.
+	 */
+	for (i = 0; i < n_exec_queues; i++) {
+		if (flags & WQ_STRESS) {
+			if (atomic_load(&wq_stress_hang_detected) ||
+			    !syncobj_wait(fd, &syncobjs[i], 1,
+					  stress_fence_deadline(), 0, NULL)) {
+				igt_critical("exec-queue[%d] fence stalled "
+					 "under WQ_STRESS, workqueue "
+					 "scheduling hang suspected\n", i);
+				atomic_store(&wq_stress_hang_detected, true);
+				igt_thread_fail();
+				goto wq_stress_cleanup;
+			}
+		} else {
+			igt_assert(syncobj_wait(fd, &syncobjs[i], 1,
+						INT64_MAX, 0, NULL));
+		}
+	}
+
+	if (flags & WQ_STRESS) {
+		if (atomic_load(&wq_stress_hang_detected) ||
+		    !syncobj_wait(fd, &sync[0].handle, 1,
+				  stress_fence_deadline(), 0, NULL)) {
+			igt_critical("bind-chain fence stalled under WQ_STRESS\n");
+			atomic_store(&wq_stress_hang_detected, true);
+			igt_thread_fail();
+			goto wq_stress_cleanup;
+		}
+	} else {
+		igt_assert(syncobj_wait(fd, &sync[0].handle, 1,
+					INT64_MAX, 0, NULL));
+	}
 
 	sync[0].flags |= DRM_XE_SYNC_FLAG_SIGNAL;
 	xe_vm_unbind_async(fd, vm, bind_exec_queues[0], 0, addr,
 			   bo_size, sync, 1);
-	igt_assert(syncobj_wait(fd, &sync[0].handle, 1, INT64_MAX, 0, NULL));
+	/*
+	 * This is the most critical fence under WQ_STRESS: it covers the
+	 * TLB-invalidation completion triggered by xe_vm_unbind_async().
+	 * If ttm_bo_delayed_delete() workers are stuck in the workqueue
+	 * the TLB flush fence will never signal and we will timeout here.
+	 */
+	if (flags & WQ_STRESS) {
+		if (atomic_load(&wq_stress_hang_detected) ||
+		    !syncobj_wait(fd, &sync[0].handle, 1,
+				  stress_fence_deadline(), 0, NULL)) {
+			igt_critical("unbind/TLB-invalidation fence stalled "
+				 "under WQ_STRESS, "
+				 "ttm_bo_delayed_delete work item likely stuck\n");
+			atomic_store(&wq_stress_hang_detected, true);
+			igt_thread_fail();
+			goto wq_stress_cleanup;
+		}
+	} else {
+		igt_assert(syncobj_wait(fd, &sync[0].handle, 1,
+					INT64_MAX, 0, NULL));
+	}
 
 	for (i = flags & INVALIDATE ? n_execs - 1 : 0;
 	     i < n_execs; i++) {
@@ -713,9 +853,20 @@ test_legacy_mode(int fd, uint32_t vm, uint64_t addr, uint64_t userptr,
 			igt_assert_eq(data[i].data, 0xc0ffee);
 	}
 
+wq_stress_cleanup:
 	syncobj_destroy(fd, sync[0].handle);
 	for (i = 0; i < n_exec_queues; i++) {
 		syncobj_destroy(fd, syncobjs[i]);
+		/*
+		 * Under WQ_STRESS, if a hang was detected skip all GPU resource
+		 * teardown calls (xe_exec_queue_destroy, xe_vm_destroy, gem_close).
+		 * Those ioctls wait for pending GPU work to drain and will hang
+		 * indefinitely if the workqueue is stuck. The kernel reclaims all
+		 * GPU resources automatically on process exit.
+		 */
+		if (flags & WQ_STRESS && atomic_load(&wq_stress_hang_detected))
+			continue;
+
 		xe_exec_queue_destroy(fd, exec_queues[i]);
 		if (bind_exec_queues[i])
 			xe_exec_queue_destroy(fd, bind_exec_queues[i]);
@@ -723,11 +874,14 @@ test_legacy_mode(int fd, uint32_t vm, uint64_t addr, uint64_t userptr,
 
 	if (bo) {
 		munmap(data, bo_size);
-		gem_close(fd, bo);
+		if (!(flags & WQ_STRESS) || !atomic_load(&wq_stress_hang_detected))
+			gem_close(fd, bo);
 	} else if (!(flags & INVALIDATE)) {
 		free(data);
 	}
-	if (owns_vm)
+
+	if (owns_vm &&
+	    (!(flags & WQ_STRESS) || !atomic_load(&wq_stress_hang_detected)))
 		xe_vm_destroy(fd, vm);
 	if (owns_fd)
 		drm_close_driver(fd);
@@ -1529,6 +1683,179 @@ int igt_main()
 		}
 	}
 
+	/**
+	 * SUBTEST: threads-wq-stress-rebind-bindexecqueue
+	 * Description: Concurrently hammers VM bind/unbind cycles using per-slot
+	 *              bind exec queues across all engines while a background
+	 *              process rapidly cycles the unbound workqueue cpumask,
+	 *              forcing work items to migrate between CPU pools.
+	 *
+	 *              Each bind and unbind fence is waited on with a 5-second
+	 *              deadline. The test passes if all fences signal within that
+	 *              window across repeated iterations for up to 30 seconds.
+	 *              It fails if any fence stalls beyond the deadline, indicating
+	 *              that GPU work items are no longer being scheduled.
+	 */
+	igt_subtest("threads-wq-stress-rebind-bindexecqueue") {
+		const char * const wq_cpumask_path = WQ_CPUMASK_PATH;
+		char orig_cpumask[64] = {};
+		int cfd, result_pipe[2];
+		pid_t child;
+		uint8_t result_byte;
+
+		struct igt_helper_process cpumask_proc = {};
+		int child_fd;
+		bool hang;
+		uint8_t r;
+
+		/* Needs write access to workqueue cpumask sysfs node (root) */
+		igt_require(access(wq_cpumask_path, W_OK) == 0);
+
+		/* Save current cpumask so we can restore it after the test */
+		cfd = open(wq_cpumask_path, O_RDONLY);
+		igt_assert_neq(cfd, -1);
+		read(cfd, orig_cpumask, sizeof(orig_cpumask) - 1);
+		close(cfd);
+		orig_cpumask[strcspn(orig_cpumask, "\n")] = '\0';
+
+		/* IPC channel: child writes a 1-byte result; parent reads via poll() */
+		igt_assert_eq(pipe(result_pipe), 0);
+
+		/*
+		 * All GPU work is submitted through child_fd (opened inside the
+		 * child). The fixture fd held by the parent has no pending GPU
+		 * work, so drm_close_driver(fd) in the end fixture will never
+		 * block — even if the child's _exit() gets stuck in
+		 * dma_resv_wait_timeout() (intr=false, TASK_UNINTERRUPTIBLE).
+		 */
+		child = fork();
+		igt_assert_neq(child, -1);
+
+		if (child == 0) {
+			/* ---- child: owns all GPU resources ---- */
+			close(result_pipe[0]);
+
+			child_fd = drm_open_driver(DRIVER_XE);
+
+			cpumask_proc.use_SIGKILL = true;
+			igt_fork_helper(&cpumask_proc)
+				cpumask_stressor_loop();
+
+			atomic_store(&wq_stress_hang_detected, false);
+			igt_until_timeout(30) {
+				threads(child_fd,
+					REBIND | BIND_EXEC_QUEUE | WQ_STRESS);
+				if (atomic_load(&wq_stress_hang_detected))
+					break;
+			}
+
+			igt_stop_helper(&cpumask_proc);
+
+			/* Restore cpumask from child */
+			cfd = open(wq_cpumask_path, O_WRONLY);
+			if (cfd >= 0) {
+				write(cfd, orig_cpumask, strlen(orig_cpumask));
+				close(cfd);
+			}
+
+			hang = atomic_load(&wq_stress_hang_detected);
+			if (hang) {
+				int dc_fd;
+
+				igt_critical("WorkQueue hang detected; dropping "
+					     "VM page cache and forcing GT "
+					     "reset\n");
+
+				dc_fd = open(DROP_CACHES_PATH,
+					     O_WRONLY);
+				if (dc_fd >= 0) {
+					write(dc_fd, "3", 1);
+					close(dc_fd);
+				}
+
+				xe_force_gt_reset_all(child_fd);
+				sleep(1);
+			}
+
+			/*
+			 * Write result BEFORE _exit().  When hang == true the
+			 * subsequent _exit() triggers do_exit() -> exit_files()
+			 * -> fput(child_fd) -> dma_resv_wait_timeout(intr=false)
+			 * and blocks in TASK_UNINTERRUPTIBLE.  The parent already
+			 * has the answer at this point and does not need to wait
+			 * for the child to actually exit.
+			 */
+			r = hang ? 1 : 0;
+			write(result_pipe[1], &r, 1);
+			close(result_pipe[1]);
+
+			if (!hang)
+				drm_close_driver(child_fd);
+			_exit(hang ? IGT_EXIT_FAILURE : IGT_EXIT_SUCCESS);
+		}
+
+		/* ---- parent ---- */
+		close(result_pipe[1]);
+
+		/*
+		 * Wait up to 60s for the child to write its result byte.
+		 * The child's stress loop runs for up to 30s, plus cleanup
+		 * (GT reset, drop_caches, cpumask restore) — 60s total gives
+		 * enough headroom on a healthy kernel while still catching a
+		 * child that has silently deadlocked before ever writing.
+		 *
+		 * poll() timeout -> treat as hang (result_byte = 0xFF).
+		 * read() returning 0 (EOF, no byte written) -> same.
+		 */
+		{
+			struct pollfd pfd = {
+				.fd     = result_pipe[0],
+				.events = POLLIN,
+			};
+			int poll_ret = poll(&pfd, 1, WQ_CHILD_TIMEOUT_MS);
+
+			if (poll_ret <= 0) {
+				/* timeout (0) or poll error (-1) */
+				igt_warn("Timed out waiting for child result "
+					 "after 60s — treating as hang\n");
+				result_byte = 0xFF;
+			} else if (read(result_pipe[0], &result_byte, 1) != 1) {
+				result_byte = 0xFF; /* EOF: child crashed before writing */
+			}
+		}
+		close(result_pipe[0]);
+
+		/* Belt-and-suspenders: restore cpumask from parent too */
+		cfd = open(wq_cpumask_path, O_WRONLY);
+		if (cfd >= 0) {
+			write(cfd, orig_cpumask, strlen(orig_cpumask));
+			close(cfd);
+		}
+
+		if (result_byte != 0) {
+			igt_critical("WQ stress worker detected fence stall "
+				     "— workqueue scheduling hang "
+				     "confirmed\n");
+
+			/*
+			 * Attempt to kill the child.  If it is stuck in
+			 * TASK_UNINTERRUPTIBLE D-state (dma_resv_wait_timeout
+			 * with intr=false) SIGKILL will be silently ignored by
+			 * the kernel — the signal is delivered but the process
+			 * cannot be woken.  We try anyway so that on kernels
+			 * where the GT reset successfully resolves the D-state
+			 * the child is reaped promptly.
+			 */
+			signal(SIGCHLD, SIG_IGN);  /* reparent D-state child to init immediately */
+			kill(child, SIGKILL);
+
+			igt_fail(IGT_EXIT_FAILURE);
+		}
+
+		/* Clean path: no hang, reap child normally and continue */
+		waitpid(child, NULL, 0);
+	}
+
 	igt_fixture()
 		drm_close_driver(fd);
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-04-22  5:16 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-17 12:01 [PATCH i-g-t] tests/xe_exec: Add VM rebind stress test under workqueue cpumask cycling S Sebinraj
2026-04-20  8:21 ` Krzysztof Karas
2026-04-22  5:16   ` Sebinraj, S
2026-04-21 11:55 ` ✓ Xe.CI.BAT: success for " Patchwork
2026-04-21 11:59 ` ✓ i915.CI.BAT: " Patchwork
2026-04-21 12:45 ` ✗ Xe.CI.FULL: failure " Patchwork
2026-04-21 16:05 ` ✗ i915.CI.Full: " Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox