public inbox for igt-dev@lists.freedesktop.org
 help / color / mirror / Atom feed
From: <vitaly.prosyak@amd.com>
To: <igt-dev@lists.freedesktop.org>
Cc: "Vitaly Prosyak" <vitaly.prosyak@amd.com>,
	"Pierre-Eric Pelloux-Prayer" <pierre-eric.pelloux-prayer@amd.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Alex Deucher" <alexander.deucher@amd.com>,
	"Jesse Zhang" <jesse.zhang@amd.com>
Subject: [PATCH] lib/amdgpu: fix sched_mask leak on igt_assert failure in dispatch tests
Date: Wed, 8 Apr 2026 21:43:18 -0400	[thread overview]
Message-ID: <20260409014335.79604-1-vitaly.prosyak@amd.com> (raw)

From: Vitaly Prosyak <vitaly.prosyak@amd.com>

amdgpu_dispatch_hang_slow_helper() and amdgpu_gfx_dispatch_test() isolate
individual compute/gfx rings by writing a single-bit mask to the sysfs
sched_mask file.  The original mask is restored at the end of the function,
after the ring iteration loop.

When any igt_assert fires inside the loop body (or in called dispatch
helpers like amdgpu_memcpy_dispatch_hang_slow_test), IGT's failure path
igt_fail() -> exit_subtest() -> siglongjmp() unwinds directly back to the
subtest entry point in igt_subtest_with_dynamic, completely bypassing the
mask restore code.  This leaves all-but-one hardware ring permanently
disabled for the remainder of the test process.

Subsequent subtests then see the drm_gpu_scheduler report ready=false for
the disabled engines, which can lead to NULL-pointer dereferences in the
kernel scheduler when attempting to pick a fence from a disabled ring.

Fix this with a three-layer safety net:

1. igt_install_exit_handler() -- a file-scoped exit handler registered
   once, which runs automatically during igt_exit() and on fatal signals.
   If the mask is still dirty (not restored), the handler writes the
   original mask back to sysfs.

2. Lazy restore at function entry -- sched_mask_arm() checks whether a
   prior subtest left the mask dirty and restores it before proceeding.
   This protects subsequent subtests within the same test binary.

3. Normal restore path unchanged -- the existing code at function end
   still runs on the happy path and now clears the dirty flag on success.

The static variables (sched_mask_sysfs, sched_mask_saved, sched_mask_dirty)
are file-scoped and shared between both functions, which is safe because
igt_runner executes tests sequentially (no parallel subtests).

Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
---
 lib/amdgpu/compute_utils/amd_dispatch.c | 53 +++++++++++++++++++++++++
 1 file changed, 53 insertions(+)

diff --git a/lib/amdgpu/compute_utils/amd_dispatch.c b/lib/amdgpu/compute_utils/amd_dispatch.c
index 23c15ef1b..138222cde 100644
--- a/lib/amdgpu/compute_utils/amd_dispatch.c
+++ b/lib/amdgpu/compute_utils/amd_dispatch.c
@@ -12,6 +12,51 @@
 #include "amdgpu/amd_ip_blocks.h"
 #include "amdgpu/shaders/amd_shaders.h"
 
+/*
+ * Static state for sched_mask cleanup on abnormal subtest exit.
+ *
+ * When amdgpu_dispatch_hang_slow_helper() or amdgpu_gfx_dispatch_test()
+ * isolate a single compute/gfx ring via sysfs sched_mask, an igt_assert
+ * failure inside the dispatch helpers triggers siglongjmp() back to the
+ * subtest entry point, bypassing the mask restore at the end of the
+ * function.  This leaves all other HW rings disabled, which causes
+ * drm_sched to see ready == false and can lead to NULL-pointer
+ * dereferences on subsequent tests.
+ *
+ * Saving the original mask in file-scoped variables and registering an
+ * IGT exit handler guarantees restoration on both normal and abnormal
+ * exit paths (siglongjmp, signals, process exit).
+ */
+static char sched_mask_sysfs[256];
+static long sched_mask_saved;
+static bool sched_mask_dirty;
+
+static void sched_mask_exit_handler(int sig)
+{
+	char cmd[1024];
+
+	if (!sched_mask_dirty)
+		return;
+
+	sched_mask_dirty = false;
+	snprintf(cmd, sizeof(cmd) - 1, "sudo echo  0x%lx > %s",
+		 sched_mask_saved, sched_mask_sysfs);
+	system(cmd);
+}
+
+static void sched_mask_arm(const char *sysfs, long mask)
+{
+	/* If a prior subtest left the mask dirty, restore it first */
+	if (sched_mask_dirty)
+		sched_mask_exit_handler(0);
+
+	strncpy(sched_mask_sysfs, sysfs, sizeof(sched_mask_sysfs) - 1);
+	sched_mask_sysfs[sizeof(sched_mask_sysfs) - 1] = '\0';
+	sched_mask_saved = mask;
+	sched_mask_dirty = true;
+	igt_install_exit_handler(sched_mask_exit_handler);
+}
+
 static void
 amdgpu_memset_dispatch_test(amdgpu_device_handle device_handle,
 			    uint32_t ip_type, uint32_t priority,
@@ -687,6 +732,9 @@ amdgpu_dispatch_hang_slow_helper(amdgpu_device_handle device_handle,
 		sched_mask = 1;
 	}
 
+	if (sched_mask > 1)
+		sched_mask_arm(sysfs, sched_mask);
+
 	for (ring_id = 0; (0x1 << ring_id) <= sched_mask; ring_id++) {
 		/* check sched is ready is on the ring. */
 		if (!((1 << ring_id) & sched_mask))
@@ -733,6 +781,7 @@ amdgpu_dispatch_hang_slow_helper(amdgpu_device_handle device_handle,
 		snprintf(cmd, sizeof(cmd) - 1, "sudo echo  0x%lx > %s",sched_mask, sysfs);
 		r = system(cmd);
 		igt_assert_eq(r, 0);
+		sched_mask_dirty = false;
 	}
 }
 
@@ -769,6 +818,9 @@ void amdgpu_gfx_dispatch_test(amdgpu_device_handle device_handle, uint32_t ip_ty
 		}
 	}
 
+	if (sched_mask > 1)
+		sched_mask_arm(sysfs, sched_mask);
+
 	for (ring_id = 0; (0x1 << ring_id) <= sched_mask; ring_id++) {
 		/* check sched is ready is on the ring. */
 		if (!((1 << ring_id) & sched_mask))
@@ -811,6 +863,7 @@ void amdgpu_gfx_dispatch_test(amdgpu_device_handle device_handle, uint32_t ip_ty
 		snprintf(cmd, sizeof(cmd) - 1, "sudo echo  0x%lx > %s",sched_mask, sysfs);
 		r = system(cmd);
 		igt_assert_eq(r, 0);
+		sched_mask_dirty = false;
 	}
 }
 
-- 
2.53.0


             reply	other threads:[~2026-04-09  1:44 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-09  1:43 vitaly.prosyak [this message]
2026-04-09  1:57 ` [PATCH] lib/amdgpu: fix sched_mask leak on igt_assert failure in dispatch tests Zhang, Jesse(Jie)
2026-04-09 23:03 ` ✓ Xe.CI.BAT: success for " Patchwork
2026-04-09 23:18 ` ✓ i915.CI.BAT: " Patchwork
2026-04-10  0:56 ` ✓ Xe.CI.FULL: " Patchwork
2026-04-10 16:24 ` ✗ i915.CI.Full: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260409014335.79604-1-vitaly.prosyak@amd.com \
    --to=vitaly.prosyak@amd.com \
    --cc=alexander.deucher@amd.com \
    --cc=christian.koenig@amd.com \
    --cc=igt-dev@lists.freedesktop.org \
    --cc=jesse.zhang@amd.com \
    --cc=pierre-eric.pelloux-prayer@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox