From: <vitaly.prosyak@amd.com>
To: <igt-dev@lists.freedesktop.org>
Cc: "Vitaly Prosyak" <vitaly.prosyak@amd.com>,
"Pierre-Eric Pelloux-Prayer" <pierre-eric.pelloux-prayer@amd.com>,
"Christian König" <christian.koenig@amd.com>,
"Alex Deucher" <alexander.deucher@amd.com>,
"Jesse Zhang" <jesse.zhang@amd.com>
Subject: [PATCH] lib/amdgpu: fix sched_mask leak on igt_assert failure in dispatch tests
Date: Wed, 8 Apr 2026 21:43:18 -0400 [thread overview]
Message-ID: <20260409014335.79604-1-vitaly.prosyak@amd.com> (raw)
From: Vitaly Prosyak <vitaly.prosyak@amd.com>
amdgpu_dispatch_hang_slow_helper() and amdgpu_gfx_dispatch_test() isolate
individual compute/gfx rings by writing a single-bit mask to the sysfs
sched_mask file. The original mask is restored at the end of the function,
after the ring iteration loop.
When any igt_assert fires inside the loop body (or in called dispatch
helpers like amdgpu_memcpy_dispatch_hang_slow_test), IGT's failure path
igt_fail() -> exit_subtest() -> siglongjmp() unwinds directly back to the
subtest entry point in igt_subtest_with_dynamic, completely bypassing the
mask restore code. This leaves all-but-one hardware ring permanently
disabled for the remainder of the test process.
Subsequent subtests then see the drm_gpu_scheduler report ready=false for
the disabled engines, which can lead to NULL-pointer dereferences in the
kernel scheduler when attempting to pick a fence from a disabled ring.
Fix this with a three-layer safety net:
1. igt_install_exit_handler() -- a file-scoped exit handler registered
once, which runs automatically during igt_exit() and on fatal signals.
If the mask is still dirty (not restored), the handler writes the
original mask back to sysfs.
2. Lazy restore at function entry -- sched_mask_arm() checks whether a
prior subtest left the mask dirty and restores it before proceeding.
This protects subsequent subtests within the same test binary.
3. Normal restore path unchanged -- the existing code at function end
still runs on the happy path and now clears the dirty flag on success.
The static variables (sched_mask_sysfs, sched_mask_saved, sched_mask_dirty)
are file-scoped and shared between both functions, which is safe because
igt_runner executes tests sequentially (no parallel subtests).
Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
---
lib/amdgpu/compute_utils/amd_dispatch.c | 53 +++++++++++++++++++++++++
1 file changed, 53 insertions(+)
diff --git a/lib/amdgpu/compute_utils/amd_dispatch.c b/lib/amdgpu/compute_utils/amd_dispatch.c
index 23c15ef1b..138222cde 100644
--- a/lib/amdgpu/compute_utils/amd_dispatch.c
+++ b/lib/amdgpu/compute_utils/amd_dispatch.c
@@ -12,6 +12,51 @@
#include "amdgpu/amd_ip_blocks.h"
#include "amdgpu/shaders/amd_shaders.h"
+/*
+ * Static state for sched_mask cleanup on abnormal subtest exit.
+ *
+ * When amdgpu_dispatch_hang_slow_helper() or amdgpu_gfx_dispatch_test()
+ * isolate a single compute/gfx ring via sysfs sched_mask, an igt_assert
+ * failure inside the dispatch helpers triggers siglongjmp() back to the
+ * subtest entry point, bypassing the mask restore at the end of the
+ * function. This leaves all other HW rings disabled, which causes
+ * drm_sched to see ready == false and can lead to NULL-pointer
+ * dereferences on subsequent tests.
+ *
+ * Saving the original mask in file-scoped variables and registering an
+ * IGT exit handler guarantees restoration on both normal and abnormal
+ * exit paths (siglongjmp, signals, process exit).
+ */
+static char sched_mask_sysfs[256];
+static long sched_mask_saved;
+static bool sched_mask_dirty;
+
+static void sched_mask_exit_handler(int sig)
+{
+ char cmd[1024];
+
+ if (!sched_mask_dirty)
+ return;
+
+ sched_mask_dirty = false;
+ snprintf(cmd, sizeof(cmd) - 1, "sudo echo 0x%lx > %s",
+ sched_mask_saved, sched_mask_sysfs);
+ system(cmd);
+}
+
+static void sched_mask_arm(const char *sysfs, long mask)
+{
+ /* If a prior subtest left the mask dirty, restore it first */
+ if (sched_mask_dirty)
+ sched_mask_exit_handler(0);
+
+ strncpy(sched_mask_sysfs, sysfs, sizeof(sched_mask_sysfs) - 1);
+ sched_mask_sysfs[sizeof(sched_mask_sysfs) - 1] = '\0';
+ sched_mask_saved = mask;
+ sched_mask_dirty = true;
+ igt_install_exit_handler(sched_mask_exit_handler);
+}
+
static void
amdgpu_memset_dispatch_test(amdgpu_device_handle device_handle,
uint32_t ip_type, uint32_t priority,
@@ -687,6 +732,9 @@ amdgpu_dispatch_hang_slow_helper(amdgpu_device_handle device_handle,
sched_mask = 1;
}
+ if (sched_mask > 1)
+ sched_mask_arm(sysfs, sched_mask);
+
for (ring_id = 0; (0x1 << ring_id) <= sched_mask; ring_id++) {
/* check sched is ready is on the ring. */
if (!((1 << ring_id) & sched_mask))
@@ -733,6 +781,7 @@ amdgpu_dispatch_hang_slow_helper(amdgpu_device_handle device_handle,
snprintf(cmd, sizeof(cmd) - 1, "sudo echo 0x%lx > %s",sched_mask, sysfs);
r = system(cmd);
igt_assert_eq(r, 0);
+ sched_mask_dirty = false;
}
}
@@ -769,6 +818,9 @@ void amdgpu_gfx_dispatch_test(amdgpu_device_handle device_handle, uint32_t ip_ty
}
}
+ if (sched_mask > 1)
+ sched_mask_arm(sysfs, sched_mask);
+
for (ring_id = 0; (0x1 << ring_id) <= sched_mask; ring_id++) {
/* check sched is ready is on the ring. */
if (!((1 << ring_id) & sched_mask))
@@ -811,6 +863,7 @@ void amdgpu_gfx_dispatch_test(amdgpu_device_handle device_handle, uint32_t ip_ty
snprintf(cmd, sizeof(cmd) - 1, "sudo echo 0x%lx > %s",sched_mask, sysfs);
r = system(cmd);
igt_assert_eq(r, 0);
+ sched_mask_dirty = false;
}
}
--
2.53.0
next reply other threads:[~2026-04-09 1:44 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-09 1:43 vitaly.prosyak [this message]
2026-04-09 1:57 ` [PATCH] lib/amdgpu: fix sched_mask leak on igt_assert failure in dispatch tests Zhang, Jesse(Jie)
2026-04-09 23:03 ` ✓ Xe.CI.BAT: success for " Patchwork
2026-04-09 23:18 ` ✓ i915.CI.BAT: " Patchwork
2026-04-10 0:56 ` ✓ Xe.CI.FULL: " Patchwork
2026-04-10 16:24 ` ✗ i915.CI.Full: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260409014335.79604-1-vitaly.prosyak@amd.com \
--to=vitaly.prosyak@amd.com \
--cc=alexander.deucher@amd.com \
--cc=christian.koenig@amd.com \
--cc=igt-dev@lists.freedesktop.org \
--cc=jesse.zhang@amd.com \
--cc=pierre-eric.pelloux-prayer@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox