From: S Sebinraj <s.sebinraj@intel.com>
To: igt-dev@lists.freedesktop.org
Cc: carlos.santa@intel.com, matthew.brost@intel.com,
jeevaka.badrappan@intel.com, karthik.b.s@intel.com,
krzysztof.karas@intel.com, kamil.konieczny@intel.com,
zbigniew.kempczynski@intel.com, S Sebinraj <s.sebinraj@intel.com>
Subject: [PATCH i-g-t v3] tests/intel/xe_exec: Add VM rebind stress test with cpumask cycling
Date: Fri, 24 Apr 2026 09:54:57 +0530 [thread overview]
Message-ID: <20260424042457.2178562-1-s.sebinraj@intel.com> (raw)
Add a new subtest threads-wq-stress-rebind-bindexecqueue that stresses
the VM rebind path under workqueue CPU pool migration pressure.
The test spawns per-engine threads that continuously perform VM unbind/
rebind cycles using per-slot bind exec queues, while a helper child
process rapidly cycles the global unbound workqueue cpumask through
progressively wider CPU sets (f -> ff -> fff -> ffff) at 100ms intervals.
A new WQ_STRESS flag enables timed fence waits in test_legacy_mode at
three syncobj_wait() checkpoints (per-exec-queue, bind-chain, and unbind/
TLB-invalidation fences) using a 5-second deadline. If any fence misses
the deadline, a shared atomic is set and all threads bail out immediately
rather than running for the full 30-second window.
All GPU work runs in a forked child that writes its result via a pipe,
the parent polls with a 60-second timeout and restores the original
cpumask regardless of outcome.
When a hang is detected the child drops the kernel page cache
and issues xe_force_gt_reset_all() to unblock any DMA fences still
pending from the stress run, then writes the hang result and calls
_exit() immediately without closing the DRM fd. Closing the fd while GPU
work is stuck would block indefinitely in dma_resv_wait_timeout(intr=false,
TASK_UNINTERRUPTIBLE). For the same reason, GPU resource teardown
(exec queues, BOs, VMs) is skipped in test_legacy_mode when a hang
is detected.
On the failure path the test is aborted, as the process (child) gets into
a D+ state (hung) and may affect other tests if marked as fail alone. So
aborting would mean forcing a reboot in testing environment.
A bug of same signature regressed in Xe driver, where the whole system
hung due to a fencing signal not completing, which was finaly
traced to an issue in the kernel workqueue scheduling.
Hard reboot was the only way to bring the system back to work.
https://patchwork.freedesktop.org/patch/715805/
v2:
- Abort the hung test instead of marking as fail alone
- Code Comment corrections
- Fix type casting
- Write check for cpumask
v3:
- Rename title of commit
- Correct commenting style
Reviewed-by: Krzysztof Karas <krzysztof.karas@intel.com>
Cc: Santa Carlos <carlos.santa@intel.com>
Signed-off-by: S Sebinraj <s.sebinraj@intel.com>
---
tests/intel/xe_exec_threads.c | 334 +++++++++++++++++++++++++++++++++-
1 file changed, 327 insertions(+), 7 deletions(-)
diff --git a/tests/intel/xe_exec_threads.c b/tests/intel/xe_exec_threads.c
index f082a0eda..7197f4345 100644
--- a/tests/intel/xe_exec_threads.c
+++ b/tests/intel/xe_exec_threads.c
@@ -13,9 +13,15 @@
*/
#include <fcntl.h>
+#include <inttypes.h>
+#include <poll.h>
+#include <signal.h>
+#include <stdatomic.h>
+#include <sys/wait.h>
#include "igt.h"
#include "lib/igt_syncobj.h"
+#include "lib/igt_thread.h"
#include "lib/intel_reg.h"
#include "xe_drm.h"
@@ -42,9 +48,89 @@
#define BIND_EXEC_QUEUE (0x1 << 13)
#define MANY_QUEUES (0x1 << 14)
#define MULTI_QUEUE (0x1 << 15)
+#define WQ_STRESS (0x1 << 16)
+
+/*
+ * Maximum fence wait time when WQ_STRESS is active. If any bind/unbind
+ * fence takes longer than this to signal, the workqueue is considered stuck
+ * and the test fails — this is the direct symptom of the
+ * Xe hang caused by the kernel workqueue pool_workqueue pending_pwqs bug.
+ */
+#define WQ_FENCE_TIMEOUT_NS (5LL * NSEC_PER_SEC)
+
+/*
+ * Maximum time the parent waits for the child to write its result byte.
+ * The child's stress loop runs for up to 30s (igt_until_timeout(30)), plus
+ * cleanup overhead (GT reset, drop_caches, cpumask restore, sleep(1)).
+ * 60s gives ample headroom on a healthy kernel.
+ */
+#define WQ_CHILD_TIMEOUT_MS (60 * 1000)
+
+/* Sysfs node that controls the unbound workqueue CPU affinity mask. */
+#define WQ_CPUMASK_PATH "/sys/devices/virtual/workqueue/cpumask"
+
+/* Procfs node for dropping the kernel's page, dentry, and inode caches. */
+#define DROP_CACHES_PATH "/proc/sys/vm/drop_caches"
pthread_barrier_t barrier;
+/*
+ * Set to true by the first thread that detects a fence stall under WQ_STRESS.
+ * All other threads and the igt_until_timeout loop check this to bail out
+ * immediately rather than hammering a hung kernel for the full timeout.
+ */
+static _Atomic bool wq_stress_hang_detected;
+
+/*
+ * stress_fence_deadline - compute an absolute CLOCK_MONOTONIC deadline
+ * 5 seconds from now, used as the syncobj_wait timeout under WQ_STRESS.
+ */
+static int64_t stress_fence_deadline(void)
+{
+ struct timespec ts;
+
+ clock_gettime(CLOCK_MONOTONIC, &ts);
+ return (int64_t)ts.tv_sec * NSEC_PER_SEC + (int64_t)ts.tv_nsec +
+ WQ_FENCE_TIMEOUT_NS;
+}
+
+/*
+ * cpumask_stressor_loop
+ *
+ * Rapidly cycles the kernel unbound workqueue cpumask through progressively
+ * wider CPU sets (mirroring the original shell reproduction script). This
+ * forces workqueue work items to be migrated between CPU pools, exercising
+ * the wq_node_nr_active / pool_workqueue plug-unplug path that hides the
+ * pending_pwqs scheduling bug.
+ *
+ * The original reproduction commands were:
+ * for i in {1..1000}; do
+ * echo f > /sys/devices/virtual/workqueue/cpumask
+ * echo ff > /sys/devices/virtual/workqueue/cpumask
+ * ...
+ * sleep .1
+ * done
+ */
+static void cpumask_stressor_loop(void)
+{
+ static const char * const masks[] = { "f", "ff", "fff", "ffff" };
+ int wq_fd;
+
+ wq_fd = open(WQ_CPUMASK_PATH, O_WRONLY);
+ if (wq_fd < 0)
+ exit(IGT_EXIT_FAILURE);
+
+ for (;;) {
+ for (int i = 0; i < ARRAY_SIZE(masks); i++) {
+ if (write(wq_fd, masks[i], strlen(masks[i])) < 0 &&
+ errno != EINVAL)
+ exit(IGT_EXIT_FAILURE); /* unexpected error - fail the test */
+ usleep(100000); /* 100ms */
+ }
+ }
+ close(wq_fd);
+}
+
static void
test_balancer(int fd, int gt, uint32_t vm, uint64_t addr, uint64_t userptr,
int class, int n_exec_queues, int n_execs, unsigned int flags)
@@ -600,6 +686,10 @@ test_legacy_mode(int fd, uint32_t vm, uint64_t addr, uint64_t userptr,
uint64_t exec_addr;
int e = i % n_exec_queues;
+ /* Bail early if another thread already detected a hang */
+ if ((flags & WQ_STRESS) && atomic_load(&wq_stress_hang_detected))
+ goto wq_stress_cleanup;
+
if (flags & MANY_QUEUES) {
if (exec_queues[e]) {
igt_assert(syncobj_wait(fd, &syncobjs[e], 1,
@@ -693,15 +783,65 @@ test_legacy_mode(int fd, uint32_t vm, uint64_t addr, uint64_t userptr,
}
}
- for (i = 0; i < n_exec_queues; i++)
- igt_assert(syncobj_wait(fd, &syncobjs[i], 1, INT64_MAX, 0,
- NULL));
- igt_assert(syncobj_wait(fd, &sync[0].handle, 1, INT64_MAX, 0, NULL));
+
+ for (i = 0; i < n_exec_queues; i++) {
+ if (flags & WQ_STRESS) {
+ /* Drain all exec-queue fences under a 5 sec deadline */
+ /* A timeout means the workqueue is hung so bail immediately */
+ if (atomic_load(&wq_stress_hang_detected) ||
+ !syncobj_wait(fd, &syncobjs[i], 1,
+ stress_fence_deadline(), 0, NULL)) {
+ igt_critical("exec-queue[%d] fence stalled "
+ "under WQ_STRESS, workqueue "
+ "scheduling hang suspected\n", i);
+ atomic_store(&wq_stress_hang_detected, true);
+ igt_thread_fail();
+ goto wq_stress_cleanup;
+ }
+ } else {
+ igt_assert(syncobj_wait(fd, &syncobjs[i], 1,
+ INT64_MAX, 0, NULL));
+ }
+ }
+
+ if (flags & WQ_STRESS) {
+ if (atomic_load(&wq_stress_hang_detected) ||
+ !syncobj_wait(fd, &sync[0].handle, 1,
+ stress_fence_deadline(), 0, NULL)) {
+ igt_critical("bind-chain fence stalled under WQ_STRESS\n");
+ atomic_store(&wq_stress_hang_detected, true);
+ igt_thread_fail();
+ goto wq_stress_cleanup;
+ }
+ } else {
+ igt_assert(syncobj_wait(fd, &sync[0].handle, 1,
+ INT64_MAX, 0, NULL));
+ }
sync[0].flags |= DRM_XE_SYNC_FLAG_SIGNAL;
xe_vm_unbind_async(fd, vm, bind_exec_queues[0], 0, addr,
bo_size, sync, 1);
- igt_assert(syncobj_wait(fd, &sync[0].handle, 1, INT64_MAX, 0, NULL));
+ if (flags & WQ_STRESS) {
+ /*
+ * This is the most critical fence under WQ_STRESS: it covers the
+ * TLB-invalidation completion triggered by xe_vm_unbind_async().
+ * If ttm_bo_delayed_delete() workers are stuck in the workqueue
+ * the TLB flush fence will never signal and we will timeout here.
+ */
+ if (atomic_load(&wq_stress_hang_detected) ||
+ !syncobj_wait(fd, &sync[0].handle, 1,
+ stress_fence_deadline(), 0, NULL)) {
+ igt_critical("unbind/TLB-invalidation fence stalled "
+ "under WQ_STRESS, "
+ "ttm_bo_delayed_delete work item likely stuck\n");
+ atomic_store(&wq_stress_hang_detected, true);
+ igt_thread_fail();
+ goto wq_stress_cleanup;
+ }
+ } else {
+ igt_assert(syncobj_wait(fd, &sync[0].handle, 1,
+ INT64_MAX, 0, NULL));
+ }
for (i = flags & INVALIDATE ? n_execs - 1 : 0;
i < n_execs; i++) {
@@ -713,9 +853,21 @@ test_legacy_mode(int fd, uint32_t vm, uint64_t addr, uint64_t userptr,
igt_assert_eq(data[i].data, 0xc0ffee);
}
+wq_stress_cleanup:
syncobj_destroy(fd, sync[0].handle);
for (i = 0; i < n_exec_queues; i++) {
syncobj_destroy(fd, syncobjs[i]);
+ if (flags & WQ_STRESS && atomic_load(&wq_stress_hang_detected)) {
+ /*
+ * Under WQ_STRESS, if a hang was detected skip all GPU resource
+ * teardown calls (xe_exec_queue_destroy, xe_vm_destroy, gem_close).
+ * Those ioctls wait for pending GPU work to drain and will hang
+ * indefinitely if the workqueue is stuck. The kernel reclaims all
+ * GPU resources automatically on process exit.
+ */
+ continue;
+ }
+
xe_exec_queue_destroy(fd, exec_queues[i]);
if (bind_exec_queues[i])
xe_exec_queue_destroy(fd, bind_exec_queues[i]);
@@ -723,11 +875,14 @@ test_legacy_mode(int fd, uint32_t vm, uint64_t addr, uint64_t userptr,
if (bo) {
munmap(data, bo_size);
- gem_close(fd, bo);
+ if (!(flags & WQ_STRESS) || !atomic_load(&wq_stress_hang_detected))
+ gem_close(fd, bo);
} else if (!(flags & INVALIDATE)) {
free(data);
}
- if (owns_vm)
+
+ if (owns_vm &&
+ (!(flags & WQ_STRESS) || !atomic_load(&wq_stress_hang_detected)))
xe_vm_destroy(fd, vm);
if (owns_fd)
drm_close_driver(fd);
@@ -1529,6 +1684,171 @@ int igt_main()
}
}
+ /**
+ * SUBTEST: threads-wq-stress-rebind-bindexecqueue
+ * Description: Concurrently hammers VM bind/unbind cycles using per-slot
+ * bind exec queues across all engines while a background
+ * process rapidly cycles the unbound workqueue cpumask,
+ * forcing work items to migrate between CPU pools.
+ *
+ * Each bind and unbind fence is waited on with a 5-second
+ * deadline. The test passes if all fences signal within that
+ * window across repeated iterations for up to 30 seconds.
+ * It fails if any fence stalls beyond the deadline, indicating
+ * that GPU work items are no longer being scheduled.
+ */
+ igt_subtest("threads-wq-stress-rebind-bindexecqueue") {
+ char orig_cpumask[64] = {};
+ int cfd, result_pipe[2];
+ pid_t child;
+ uint8_t result_byte;
+
+ struct igt_helper_process cpumask_proc = {};
+ int child_fd;
+ bool hang;
+ uint8_t r;
+
+ /* Needs write access to workqueue cpumask sysfs node (root) */
+ igt_require(access(WQ_CPUMASK_PATH, W_OK) == 0);
+
+ /* Save current cpumask so we can restore it after the test */
+ cfd = open(WQ_CPUMASK_PATH, O_RDONLY);
+ igt_assert_neq(cfd, -1);
+ read(cfd, orig_cpumask, sizeof(orig_cpumask) - 1);
+ close(cfd);
+ orig_cpumask[strcspn(orig_cpumask, "\n")] = '\0';
+
+ /*
+ * Start the cpumask stressor in the parent so that any unexpected
+ * write failure propagates through igt_stop_helper() and fails
+ * the test - running GPU stress without active cpumask cycling
+ * would make this test meaningless.
+ */
+ cpumask_proc.use_SIGKILL = true;
+ igt_fork_helper(&cpumask_proc)
+ cpumask_stressor_loop();
+
+ /* IPC channel: child writes a 1-byte result; parent reads via poll() */
+ igt_assert_eq(pipe(result_pipe), 0);
+
+ /*
+ * All GPU work is submitted through child_fd (opened inside the
+ * child). The fixture fd held by the parent has no pending GPU
+ * work, so drm_close_driver(fd) in the end fixture will never
+ * block — even if the child's _exit() gets stuck in
+ * dma_resv_wait_timeout() (intr=false, TASK_UNINTERRUPTIBLE).
+ */
+ child = fork();
+ igt_assert_neq(child, -1);
+
+ if (child == 0) {
+ /* ---- child: owns all GPU resources ---- */
+ close(result_pipe[0]);
+
+ child_fd = drm_open_driver(DRIVER_XE);
+
+ atomic_store(&wq_stress_hang_detected, false);
+ igt_until_timeout(30) {
+ threads(child_fd,
+ REBIND | BIND_EXEC_QUEUE | WQ_STRESS);
+ if (atomic_load(&wq_stress_hang_detected))
+ break;
+ }
+
+ /* Restore cpumask from child */
+ cfd = open(WQ_CPUMASK_PATH, O_WRONLY);
+ if (cfd >= 0) {
+ write(cfd, orig_cpumask, strlen(orig_cpumask));
+ close(cfd);
+ }
+
+ hang = atomic_load(&wq_stress_hang_detected);
+ if (hang) {
+ int dc_fd;
+
+ igt_critical("WorkQueue hang detected; dropping "
+ "VM page cache and forcing GT "
+ "reset\n");
+
+ dc_fd = open(DROP_CACHES_PATH,
+ O_WRONLY);
+ if (dc_fd >= 0) {
+ write(dc_fd, "3", 1);
+ close(dc_fd);
+ }
+
+ xe_force_gt_reset_all(child_fd);
+ sleep(1);
+ }
+
+ /*
+ * Write result BEFORE _exit(). When hang == true the
+ * subsequent _exit() triggers do_exit() -> exit_files()
+ * -> fput(child_fd) -> dma_resv_wait_timeout(intr=false)
+ * and blocks in TASK_UNINTERRUPTIBLE. The parent already
+ * has the answer at this point and does not need to wait
+ * for the child to actually exit.
+ */
+ r = hang ? 1 : 0;
+ write(result_pipe[1], &r, 1);
+ close(result_pipe[1]);
+
+ if (!hang)
+ drm_close_driver(child_fd);
+ _exit(hang ? IGT_EXIT_FAILURE : IGT_EXIT_SUCCESS);
+ }
+
+ /* ---- parent ---- */
+ close(result_pipe[1]);
+
+ /*
+ * Wait up to 60s for the child to write its result byte.
+ * The child's stress loop runs for up to 30s, plus cleanup
+ * (GT reset, drop_caches, cpumask restore) — 60s total gives
+ * enough headroom on a healthy kernel while still catching a
+ * child that has silently deadlocked before ever writing.
+ *
+ * poll() timeout -> treat as hang (result_byte = 0xFF).
+ * read() returning 0 (EOF, no byte written) -> same.
+ */
+ {
+ struct pollfd pfd = {
+ .fd = result_pipe[0],
+ .events = POLLIN,
+ };
+ int poll_ret = poll(&pfd, 1, WQ_CHILD_TIMEOUT_MS);
+
+ if (poll_ret <= 0) {
+ /* timeout (0) or poll error (-1) */
+ igt_warn("Timed out waiting for child result "
+ "after 60s - treating as hang\n");
+ result_byte = 0xFF;
+ } else if (read(result_pipe[0], &result_byte, 1) != 1) {
+ result_byte = 0xFF; /* EOF: child crashed before writing */
+ }
+ }
+ close(result_pipe[0]);
+
+ /* Restore cpumask from parent too. */
+ igt_stop_helper(&cpumask_proc);
+ cfd = open(WQ_CPUMASK_PATH, O_WRONLY);
+ if (cfd >= 0) {
+ write(cfd, orig_cpumask, strlen(orig_cpumask));
+ close(cfd);
+ }
+
+ /* Abort if hang detected, as moving forward without doing
+ * so could lead to undefined behavior and further issues in
+ * other tests
+ */
+ igt_assert_f(result_byte == 0,
+ "WQ stress worker detected fence stall "
+ "- workqueue scheduling hang confirmed\n");
+
+ /* Clean path: no hang, reap child normally and continue */
+ waitpid(child, NULL, 0);
+ }
+
igt_fixture()
drm_close_driver(fd);
}
--
2.43.0
next reply other threads:[~2026-04-24 4:46 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-24 4:24 S Sebinraj [this message]
2026-04-24 5:31 ` ✓ i915.CI.BAT: success for tests/intel/xe_exec: Add VM rebind stress test with cpumask cycling Patchwork
2026-04-24 5:39 ` ✓ Xe.CI.BAT: " Patchwork
2026-04-24 6:34 ` ✓ Xe.CI.FULL: " Patchwork
2026-04-24 7:17 ` ✗ i915.CI.Full: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260424042457.2178562-1-s.sebinraj@intel.com \
--to=s.sebinraj@intel.com \
--cc=carlos.santa@intel.com \
--cc=igt-dev@lists.freedesktop.org \
--cc=jeevaka.badrappan@intel.com \
--cc=kamil.konieczny@intel.com \
--cc=karthik.b.s@intel.com \
--cc=krzysztof.karas@intel.com \
--cc=matthew.brost@intel.com \
--cc=zbigniew.kempczynski@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox