[PATCH 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT

dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_RUNNING
@ 2025-05-03 20:59 Maíra Canal
  2025-05-03 20:59 ` [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running Maíra Canal
                   ` (7 more replies)
  0 siblings, 8 replies; 34+ messages in thread
From: Maíra Canal @ 2025-05-03 20:59 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Tvrtko Ursulin, Simona Vetter, Melissa Wen,
	Lucas Stach, Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe, Maíra Canal

When the DRM scheduler times out, it's possible that the GPU isn't hung;
instead, a job may still be running, and there may be no valid reason to
reset the hardware. This can occur in two situations:

  1. The GPU exposes some mechanism that ensures the GPU is still making
     progress. By checking this mechanism, we can safely skip the reset,
     rearm the timeout, and allow the job to continue running until
     completion. This is the case for v3d and Etnaviv.

  2. TDR has fired before the IRQ that signals the fence. Consequently,
     the job actually finishes, but it triggers a timeout before signaling
     the completion fence.

These two scenarios are problematic because we remove the job from the
`sched->pending_list` before calling `sched->ops->timedout_job()`. This
means that when the job finally signals completion (e.g. in the IRQ
handler), the scheduler won't call `sched->ops->free_job()`. As a result,
the job and its resources won't be freed, leading to a memory leak.

For v3d specifically, we have observed that these memory leaks can be
significant in certain scenarios, as reported by users in [1][2]. To
address this situation, I submitted a patch similar to commit 704d3d60fec4
("drm/etnaviv: don't block scheduler when GPU is still active") for v3d [3].
This patch has already landed in drm-misc-fixes and successfully resolved
the users' issues.

However, as I mentioned in [3], exposing the scheduler's internals within
the drivers isn't ideal and I believe this specific situation can be
addressed within the DRM scheduler framework.

This series aims to resolve this issue by adding a new DRM sched status
that allows a driver to skip the reset. This new status will indicate that
the job should be reinserted into the pending list, and the driver will
still signal its completion.

The series can be divided into three parts:

  * Patch 1: Implementation of the new status in the DRM scheduler.
  * Patches 2-4: Some fixes to the DRM scheduler KUnit tests and the
    addition of a test for the new status.
  * Patches 5-8: Usage the new status in four different drivers.

[1] https://gitlab.freedesktop.org/mesa/mesa/-/issues/12227
[2] https://github.com/raspberrypi/linux/issues/6817
[3] https://lore.kernel.org/dri-devel/20250430210643.57924-1-mcanal@igalia.com/T/

Best Regards,
- Maíra

---
Maíra Canal (8):
      drm/sched: Allow drivers to skip the reset and keep on running
      drm/sched: Always free the job after the timeout
      drm/sched: Reduce scheduler's timeout for timeout tests
      drm/sched: Add new test for DRM_GPU_SCHED_STAT_RUNNING
      drm/v3d: Use DRM_GPU_SCHED_STAT_RUNNING to skip the reset
      drm/etnaviv: Use DRM_GPU_SCHED_STAT_RUNNING to skip the reset
      drm/xe: Use DRM_GPU_SCHED_STAT_RUNNING to skip the reset
      drm/panfrost: Use DRM_GPU_SCHED_STAT_RUNNING to skip the reset

 drivers/gpu/drm/etnaviv/etnaviv_sched.c          | 12 ++---
 drivers/gpu/drm/panfrost/panfrost_job.c          |  8 ++--
 drivers/gpu/drm/scheduler/sched_main.c           | 14 ++++++
 drivers/gpu/drm/scheduler/tests/mock_scheduler.c | 13 ++++++
 drivers/gpu/drm/scheduler/tests/tests_basic.c    | 57 ++++++++++++++++++++++--
 drivers/gpu/drm/v3d/v3d_sched.c                  |  4 +-
 drivers/gpu/drm/xe/xe_guc_submit.c               |  8 +---
 include/drm/gpu_scheduler.h                      |  2 +
 8 files changed, 94 insertions(+), 24 deletions(-)
---
base-commit: 760e296124ef3b6e14cd1d940f2a01c5ed7c0dac
change-id: 20250502-sched-skip-reset-bf7c163233da

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running
  2025-05-03 20:59 [PATCH 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_RUNNING Maíra Canal
@ 2025-05-03 20:59 ` Maíra Canal
  2025-05-06  2:41   ` Matthew Brost
                     ` (3 more replies)
  2025-05-03 20:59 ` [PATCH 2/8] drm/sched: Always free the job after the timeout Maíra Canal
                   ` (6 subsequent siblings)
  7 siblings, 4 replies; 34+ messages in thread
From: Maíra Canal @ 2025-05-03 20:59 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Tvrtko Ursulin, Simona Vetter, Melissa Wen,
	Lucas Stach, Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe, Maíra Canal

When the DRM scheduler times out, it's possible that the GPU isn't hung;
instead, a job may still be running, and there may be no valid reason to
reset the hardware. This can occur in two situations:

  1. The GPU exposes some mechanism that ensures the GPU is still making
     progress. By checking this mechanism, we can safely skip the reset,
     rearm the timeout, and allow the job to continue running until
     completion. This is the case for v3d and Etnaviv.
  2. TDR has fired before the IRQ that signals the fence. Consequently,
     the job actually finishes, but it triggers a timeout before signaling
     the completion fence.

These two scenarios are problematic because we remove the job from the
`sched->pending_list` before calling `sched->ops->timedout_job()`. This
means that when the job finally signals completion (e.g. in the IRQ
handler), the scheduler won't call `sched->ops->free_job()`. As a result,
the job and its resources won't be freed, leading to a memory leak.

To resolve this issue, we create a new `drm_gpu_sched_stat` that allows a
driver to skip the reset. This new status will indicate that the job
should be reinserted into the pending list, and the driver will still
signal its completion.

Signed-off-by: Maíra Canal <mcanal@igalia.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 14 ++++++++++++++
 include/drm/gpu_scheduler.h            |  2 ++
 2 files changed, 16 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 829579c41c6b5d8b2abce5ad373c7017469b7680..68ca827d77e32187a034309f881135dbc639a9b4 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -568,6 +568,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
 			job->sched->ops->free_job(job);
 			sched->free_guilty = false;
 		}
+
+		/*
+		 * If the driver indicated that the GPU is still running and wants to skip
+		 * the reset, reinsert the job back into the pending list and realarm the
+		 * timeout.
+		 */
+		if (status == DRM_GPU_SCHED_STAT_RUNNING) {
+			spin_lock(&sched->job_list_lock);
+			list_add(&job->list, &sched->pending_list);
+			spin_unlock(&sched->job_list_lock);
+		}
 	} else {
 		spin_unlock(&sched->job_list_lock);
 	}
@@ -590,6 +601,9 @@ static void drm_sched_job_timedout(struct work_struct *work)
  * This function is typically used for reset recovery (see the docu of
  * drm_sched_backend_ops.timedout_job() for details). Do not call it for
  * scheduler teardown, i.e., before calling drm_sched_fini().
+ *
+ * As it's used for reset recovery, drm_sched_stop() shouldn't be called
+ * if the scheduler skipped the timeout (DRM_SCHED_STAT_RUNNING).
  */
 void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
 {
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 1a7e377d4cbb4fc12ed93c548b236970217945e8..fe9043b6d43141bee831b5fc16b927202a507d51 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -389,11 +389,13 @@ struct drm_sched_job {
  * @DRM_GPU_SCHED_STAT_NONE: Reserved. Do not use.
  * @DRM_GPU_SCHED_STAT_NOMINAL: Operation succeeded.
  * @DRM_GPU_SCHED_STAT_ENODEV: Error: Device is not available anymore.
+ * @DRM_GPU_SCHED_STAT_RUNNING: GPU is still running, so skip the reset.
  */
 enum drm_gpu_sched_stat {
 	DRM_GPU_SCHED_STAT_NONE,
 	DRM_GPU_SCHED_STAT_NOMINAL,
 	DRM_GPU_SCHED_STAT_ENODEV,
+	DRM_GPU_SCHED_STAT_RUNNING,
 };
 
 /**

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/8] drm/sched: Always free the job after the timeout
  2025-05-03 20:59 [PATCH 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_RUNNING Maíra Canal
  2025-05-03 20:59 ` [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running Maíra Canal
@ 2025-05-03 20:59 ` Maíra Canal
  2025-05-06 11:49   ` Tvrtko Ursulin
  2025-05-03 20:59 ` [PATCH 3/8] drm/sched: Reduce scheduler's timeout for timeout tests Maíra Canal
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 34+ messages in thread
From: Maíra Canal @ 2025-05-03 20:59 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Tvrtko Ursulin, Simona Vetter, Melissa Wen,
	Lucas Stach, Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe, Maíra Canal

Currently, if we add the assertions presented in this commit to the mock
scheduler, we will see the following output:

[15:47:08] ============== [PASSED] drm_sched_basic_tests ==============
[15:47:08] ======== drm_sched_basic_timeout_tests (1 subtest) =========
[15:47:08] # drm_sched_basic_timeout: ASSERTION FAILED at drivers/gpu/drm/scheduler/tests/tests_basic.c:246
[15:47:08] Expected list_empty(&sched->job_list) to be true, but is false
[15:47:08] [FAILED] drm_sched_basic_timeout
[15:47:08] # module: drm_sched_tests

This occurs because `mock_sched_timedout_job()` doesn't properly handle
the hang. From the DRM sched documentation, `drm_sched_stop()` and
`drm_sched_start()` are typically used for reset recovery. If these
functions are not used, the offending job won't be freed and should be
freed by the caller.

Currently, the mock scheduler doesn't use the functions provided by the
API, nor does it handle the freeing of the job. As a result, the job isn't
removed from the job list.

This commit mocks a GPU reset by stopping the scheduler affected by the
reset, waiting a couple of microseconds to mimic a hardware reset, and
then restart the affected scheduler.

Signed-off-by: Maíra Canal <mcanal@igalia.com>
---
 drivers/gpu/drm/scheduler/tests/mock_scheduler.c | 10 ++++++++++
 drivers/gpu/drm/scheduler/tests/tests_basic.c    |  3 +++
 2 files changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
index f999c8859cf7adb8f06fc8a37969656dd3249fa7..e9af202d84bd55ea5cc048215e39f5407bc84458 100644
--- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
+++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
@@ -1,6 +1,8 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright (c) 2025 Valve Corporation */
 
+#include <linux/delay.h>
+
 #include "sched_tests.h"
 
 /*
@@ -203,10 +205,18 @@ static struct dma_fence *mock_sched_run_job(struct drm_sched_job *sched_job)
 static enum drm_gpu_sched_stat
 mock_sched_timedout_job(struct drm_sched_job *sched_job)
 {
+	struct drm_mock_scheduler *sched =
+		drm_sched_to_mock_sched(sched_job->sched);
 	struct drm_mock_sched_job *job = drm_sched_job_to_mock_job(sched_job);
 
 	job->flags |= DRM_MOCK_SCHED_JOB_TIMEDOUT;
 
+	drm_sched_stop(&sched->base, &job->base);
+
+	usleep_range(200, 500);
+
+	drm_sched_start(&sched->base, 0);
+
 	return DRM_GPU_SCHED_STAT_NOMINAL;
 }
 
diff --git a/drivers/gpu/drm/scheduler/tests/tests_basic.c b/drivers/gpu/drm/scheduler/tests/tests_basic.c
index 7230057e0594c6246f02608f07fcb1f8d738ac75..8f960f0fd31d0af7873f410ceba2d636f58a5474 100644
--- a/drivers/gpu/drm/scheduler/tests/tests_basic.c
+++ b/drivers/gpu/drm/scheduler/tests/tests_basic.c
@@ -241,6 +241,9 @@ static void drm_sched_basic_timeout(struct kunit *test)
 			job->flags & DRM_MOCK_SCHED_JOB_TIMEDOUT,
 			DRM_MOCK_SCHED_JOB_TIMEDOUT);
 
+	KUNIT_ASSERT_TRUE(test, list_empty(&sched->job_list));
+	KUNIT_ASSERT_TRUE(test, list_empty(&sched->done_list));
+
 	drm_mock_sched_entity_free(entity);
 }
 

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/8] drm/sched: Reduce scheduler's timeout for timeout tests
  2025-05-03 20:59 [PATCH 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_RUNNING Maíra Canal
  2025-05-03 20:59 ` [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running Maíra Canal
  2025-05-03 20:59 ` [PATCH 2/8] drm/sched: Always free the job after the timeout Maíra Canal
@ 2025-05-03 20:59 ` Maíra Canal
  2025-05-06 12:03   ` Tvrtko Ursulin
  2025-05-03 20:59 ` [PATCH 4/8] drm/sched: Add new test for DRM_GPU_SCHED_STAT_RUNNING Maíra Canal
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 34+ messages in thread
From: Maíra Canal @ 2025-05-03 20:59 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Tvrtko Ursulin, Simona Vetter, Melissa Wen,
	Lucas Stach, Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe, Maíra Canal

As more KUnit tests are introduced to evaluate the basic capabilities of
the `timedout_job()` hook, the test suite will continue to increase in
duration. To reduce the overall running time of the test suite, decrease
the scheduler's timeout for the timeout tests.

Before this commit:

[15:42:26] Elapsed time: 15.637s total, 0.002s configuring, 10.387s building, 5.229s running

After this commit:

[15:45:26] Elapsed time: 9.263s total, 0.002s configuring, 5.168s building, 4.037s running

Signed-off-by: Maíra Canal <mcanal@igalia.com>
---
 drivers/gpu/drm/scheduler/tests/tests_basic.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/tests/tests_basic.c b/drivers/gpu/drm/scheduler/tests/tests_basic.c
index 8f960f0fd31d0af7873f410ceba2d636f58a5474..00c691cb3c306f609684f554f17fcb54ba74cb95 100644
--- a/drivers/gpu/drm/scheduler/tests/tests_basic.c
+++ b/drivers/gpu/drm/scheduler/tests/tests_basic.c
@@ -5,6 +5,8 @@
 
 #include "sched_tests.h"
 
+#define MOCK_TIMEOUT (HZ / 5)
+
 /*
  * DRM scheduler basic tests should check the basic functional correctness of
  * the scheduler, including some very light smoke testing. More targeted tests,
@@ -28,7 +30,7 @@ static void drm_sched_basic_exit(struct kunit *test)
 
 static int drm_sched_timeout_init(struct kunit *test)
 {
-	test->priv = drm_mock_sched_new(test, HZ);
+	test->priv = drm_mock_sched_new(test, MOCK_TIMEOUT);
 
 	return 0;
 }
@@ -224,17 +226,17 @@ static void drm_sched_basic_timeout(struct kunit *test)
 
 	drm_mock_sched_job_submit(job);
 
-	done = drm_mock_sched_job_wait_scheduled(job, HZ);
+	done = drm_mock_sched_job_wait_scheduled(job, MOCK_TIMEOUT);
 	KUNIT_ASSERT_TRUE(test, done);
 
-	done = drm_mock_sched_job_wait_finished(job, HZ / 2);
+	done = drm_mock_sched_job_wait_finished(job, MOCK_TIMEOUT / 2);
 	KUNIT_ASSERT_FALSE(test, done);
 
 	KUNIT_ASSERT_EQ(test,
 			job->flags & DRM_MOCK_SCHED_JOB_TIMEDOUT,
 			0);
 
-	done = drm_mock_sched_job_wait_finished(job, HZ);
+	done = drm_mock_sched_job_wait_finished(job, MOCK_TIMEOUT);
 	KUNIT_ASSERT_FALSE(test, done);
 
 	KUNIT_ASSERT_EQ(test,

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 4/8] drm/sched: Add new test for DRM_GPU_SCHED_STAT_RUNNING
  2025-05-03 20:59 [PATCH 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_RUNNING Maíra Canal
                   ` (2 preceding siblings ...)
  2025-05-03 20:59 ` [PATCH 3/8] drm/sched: Reduce scheduler's timeout for timeout tests Maíra Canal
@ 2025-05-03 20:59 ` Maíra Canal
  2025-05-06 13:58   ` Tvrtko Ursulin
  2025-05-03 20:59 ` [PATCH 5/8] drm/v3d: Use DRM_GPU_SCHED_STAT_RUNNING to skip the reset Maíra Canal
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 34+ messages in thread
From: Maíra Canal @ 2025-05-03 20:59 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Tvrtko Ursulin, Simona Vetter, Melissa Wen,
	Lucas Stach, Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe, Maíra Canal

Add a test to submit a single job against a scheduler with the timeout
configured and verify that if the job is still running, the timeout
handler will skip the reset and allow the job to complete.

Signed-off-by: Maíra Canal <mcanal@igalia.com>
---
 drivers/gpu/drm/scheduler/tests/mock_scheduler.c |  3 ++
 drivers/gpu/drm/scheduler/tests/tests_basic.c    | 44 ++++++++++++++++++++++++
 2 files changed, 47 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
index e9af202d84bd55ea5cc048215e39f5407bc84458..9d594cb5bf567be25e018ddbcd28b70a7e994260 100644
--- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
+++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
@@ -211,6 +211,9 @@ mock_sched_timedout_job(struct drm_sched_job *sched_job)
 
 	job->flags |= DRM_MOCK_SCHED_JOB_TIMEDOUT;
 
+	if (job->finish_at && ktime_before(ktime_get(), job->finish_at))
+		return DRM_GPU_SCHED_STAT_RUNNING;
+
 	drm_sched_stop(&sched->base, &job->base);
 
 	usleep_range(200, 500);
diff --git a/drivers/gpu/drm/scheduler/tests/tests_basic.c b/drivers/gpu/drm/scheduler/tests/tests_basic.c
index 00c691cb3c306f609684f554f17fcb54ba74cb95..669a211b216ee298544ac237abb866077d856586 100644
--- a/drivers/gpu/drm/scheduler/tests/tests_basic.c
+++ b/drivers/gpu/drm/scheduler/tests/tests_basic.c
@@ -249,8 +249,52 @@ static void drm_sched_basic_timeout(struct kunit *test)
 	drm_mock_sched_entity_free(entity);
 }
 
+static void drm_sched_skip_reset(struct kunit *test)
+{
+	struct drm_mock_scheduler *sched = test->priv;
+	struct drm_mock_sched_entity *entity;
+	struct drm_mock_sched_job *job;
+	bool done;
+
+	/*
+	 * Submit a single job against a scheduler with the timeout configured
+	 * and verify that if the job is still running, the timeout handler
+	 * will skip the reset and allow the job to complete.
+	 */
+
+	entity = drm_mock_sched_entity_new(test,
+					   DRM_SCHED_PRIORITY_NORMAL,
+					   sched);
+	job = drm_mock_sched_job_new(test, entity);
+
+	drm_mock_sched_job_set_duration_us(job, jiffies_to_usecs(2 * MOCK_TIMEOUT));
+	drm_mock_sched_job_submit(job);
+
+	done = drm_mock_sched_job_wait_finished(job, MOCK_TIMEOUT);
+	KUNIT_ASSERT_FALSE(test, done);
+
+	KUNIT_ASSERT_EQ(test,
+			job->flags & DRM_MOCK_SCHED_JOB_TIMEDOUT,
+			DRM_MOCK_SCHED_JOB_TIMEDOUT);
+
+	KUNIT_ASSERT_FALSE(test, list_empty(&sched->job_list));
+
+	done = drm_mock_sched_job_wait_finished(job, MOCK_TIMEOUT);
+	KUNIT_ASSERT_TRUE(test, done);
+
+	KUNIT_ASSERT_EQ(test,
+			job->flags & DRM_MOCK_SCHED_JOB_TIMEDOUT,
+			DRM_MOCK_SCHED_JOB_TIMEDOUT);
+
+	KUNIT_ASSERT_TRUE(test, list_empty(&sched->job_list));
+	KUNIT_ASSERT_TRUE(test, list_empty(&sched->done_list));
+
+	drm_mock_sched_entity_free(entity);
+}
+
 static struct kunit_case drm_sched_timeout_tests[] = {
 	KUNIT_CASE(drm_sched_basic_timeout),
+	KUNIT_CASE(drm_sched_skip_reset),
 	{}
 };
 

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 5/8] drm/v3d: Use DRM_GPU_SCHED_STAT_RUNNING to skip the reset
  2025-05-03 20:59 [PATCH 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_RUNNING Maíra Canal
                   ` (3 preceding siblings ...)
  2025-05-03 20:59 ` [PATCH 4/8] drm/sched: Add new test for DRM_GPU_SCHED_STAT_RUNNING Maíra Canal
@ 2025-05-03 20:59 ` Maíra Canal
  2025-05-03 20:59 ` [PATCH 6/8] drm/etnaviv: " Maíra Canal
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 34+ messages in thread
From: Maíra Canal @ 2025-05-03 20:59 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Tvrtko Ursulin, Simona Vetter, Melissa Wen,
	Lucas Stach, Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe, Maíra Canal

When a CL/CSD job times out, we check if the GPU has made any progress
since the last timeout. If so, instead of resetting the hardware, we skip
the reset and allow the timer to be rearmed. This gives long-running jobs
a chance to complete.

Use the DRM_GPU_SCHED_STAT_RUNNING status to skip the reset and rearm
the timer.

Signed-off-by: Maíra Canal <mcanal@igalia.com>
---
 drivers/gpu/drm/v3d/v3d_sched.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
index b3be08b0ca9188564f9fb6aa32694940a5fadc9d..51770b6686d0befffa3e87c290bbdc1a12a19ad5 100644
--- a/drivers/gpu/drm/v3d/v3d_sched.c
+++ b/drivers/gpu/drm/v3d/v3d_sched.c
@@ -751,7 +751,7 @@ v3d_cl_job_timedout(struct drm_sched_job *sched_job, enum v3d_queue q,
 	if (*timedout_ctca != ctca || *timedout_ctra != ctra) {
 		*timedout_ctca = ctca;
 		*timedout_ctra = ctra;
-		return DRM_GPU_SCHED_STAT_NOMINAL;
+		return DRM_GPU_SCHED_STAT_RUNNING;
 	}
 
 	return v3d_gpu_reset_for_timeout(v3d, sched_job);
@@ -795,7 +795,7 @@ v3d_csd_job_timedout(struct drm_sched_job *sched_job)
 	 */
 	if (job->timedout_batches != batches) {
 		job->timedout_batches = batches;
-		return DRM_GPU_SCHED_STAT_NOMINAL;
+		return DRM_GPU_SCHED_STAT_RUNNING;
 	}
 
 	return v3d_gpu_reset_for_timeout(v3d, sched_job);

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 6/8] drm/etnaviv: Use DRM_GPU_SCHED_STAT_RUNNING to skip the reset
  2025-05-03 20:59 [PATCH 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_RUNNING Maíra Canal
                   ` (4 preceding siblings ...)
  2025-05-03 20:59 ` [PATCH 5/8] drm/v3d: Use DRM_GPU_SCHED_STAT_RUNNING to skip the reset Maíra Canal
@ 2025-05-03 20:59 ` Maíra Canal
  2025-05-03 20:59 ` [PATCH 7/8] drm/xe: " Maíra Canal
  2025-05-03 20:59 ` [PATCH 8/8] drm/panfrost: " Maíra Canal
  7 siblings, 0 replies; 34+ messages in thread
From: Maíra Canal @ 2025-05-03 20:59 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Tvrtko Ursulin, Simona Vetter, Melissa Wen,
	Lucas Stach, Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe, Maíra Canal

Etnaviv can skip a hardware reset in two situations:

  1. TDR has fired before the IRQ and the timeout is spurious.
  2. The GPU is still making progress on the front-end and we can give
     the job a chance to complete.

Instead of relying on the scheduler internals, use the
DRM_GPU_SCHED_STAT_RUNNING status to skip the reset and rearm the timer.

Signed-off-by: Maíra Canal <mcanal@igalia.com>
---
 drivers/gpu/drm/etnaviv/etnaviv_sched.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
index 76a3a3e517d8d9f654fb6b9e98e72910795cfc7a..b87ffdb4136aebade736d78b3677de2f21d52ebc 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
@@ -40,11 +40,11 @@ static enum drm_gpu_sched_stat etnaviv_sched_timedout_job(struct drm_sched_job
 	int change;
 
 	/*
-	 * If the GPU managed to complete this jobs fence, the timout is
-	 * spurious. Bail out.
+	 * If the GPU managed to complete this jobs fence, TDR has fired before
+	 * IRQ and the timeout is spurious. Bail out.
 	 */
 	if (dma_fence_is_signaled(submit->out_fence))
-		goto out_no_timeout;
+		return DRM_GPU_SCHED_STAT_RUNNING;
 
 	/*
 	 * If the GPU is still making forward progress on the front-end (which
@@ -70,7 +70,7 @@ static enum drm_gpu_sched_stat etnaviv_sched_timedout_job(struct drm_sched_job
 		gpu->hangcheck_dma_addr = dma_addr;
 		gpu->hangcheck_primid = primid;
 		gpu->hangcheck_fence = gpu->completed_fence;
-		goto out_no_timeout;
+		return DRM_GPU_SCHED_STAT_RUNNING;
 	}
 
 	/* block scheduler */
@@ -87,10 +87,6 @@ static enum drm_gpu_sched_stat etnaviv_sched_timedout_job(struct drm_sched_job
 
 	drm_sched_start(&gpu->sched, 0);
 	return DRM_GPU_SCHED_STAT_NOMINAL;
-
-out_no_timeout:
-	list_add(&sched_job->list, &sched_job->sched->pending_list);
-	return DRM_GPU_SCHED_STAT_NOMINAL;
 }
 
 static void etnaviv_sched_free_job(struct drm_sched_job *sched_job)

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 7/8] drm/xe: Use DRM_GPU_SCHED_STAT_RUNNING to skip the reset
  2025-05-03 20:59 [PATCH 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_RUNNING Maíra Canal
                   ` (5 preceding siblings ...)
  2025-05-03 20:59 ` [PATCH 6/8] drm/etnaviv: " Maíra Canal
@ 2025-05-03 20:59 ` Maíra Canal
  2025-05-06  4:25   ` Matthew Brost
  2025-05-03 20:59 ` [PATCH 8/8] drm/panfrost: " Maíra Canal
  7 siblings, 1 reply; 34+ messages in thread
From: Maíra Canal @ 2025-05-03 20:59 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Tvrtko Ursulin, Simona Vetter, Melissa Wen,
	Lucas Stach, Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe, Maíra Canal

Xe can skip the reset if TDR has fired before the free job worker. Instead
of using the scheduler internals to add the job to the pending list, use
the DRM_GPU_SCHED_STAT_RUNNING status to skip the reset and rearm the
timer.

Note that there is no need to restart submission if it hasn't been
stopped.

Signed-off-by: Maíra Canal <mcanal@igalia.com>
---
 drivers/gpu/drm/xe/xe_guc_submit.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 31bc2022bfc2d80f0ef54726dfeb8d7f8e6b32c8..4c40d3921d4a5e190d3413736a68c6e7295223dd 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -1058,12 +1058,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 	 * list so job can be freed and kick scheduler ensuring free job is not
 	 * lost.
 	 */
-	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &job->fence->flags)) {
-		xe_sched_add_pending_job(sched, job);
-		xe_sched_submission_start(sched);
-
-		return DRM_GPU_SCHED_STAT_NOMINAL;
-	}
+	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &job->fence->flags))
+		return DRM_GPU_SCHED_STAT_RUNNING;
 
 	/* Kill the run_job entry point */
 	xe_sched_submission_stop(sched);

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 8/8] drm/panfrost: Use DRM_GPU_SCHED_STAT_RUNNING to skip the reset
  2025-05-03 20:59 [PATCH 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_RUNNING Maíra Canal
                   ` (6 preceding siblings ...)
  2025-05-03 20:59 ` [PATCH 7/8] drm/xe: " Maíra Canal
@ 2025-05-03 20:59 ` Maíra Canal
  2025-05-08 15:04   ` Steven Price
  7 siblings, 1 reply; 34+ messages in thread
From: Maíra Canal @ 2025-05-03 20:59 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Tvrtko Ursulin, Simona Vetter, Melissa Wen,
	Lucas Stach, Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe, Maíra Canal

Panfrost can skip the reset if TDR has fired before the IRQ handler.
Currently, since Panfrost doesn't take any action on these scenarios, the
job is being leaked, considering that `free_job()` won't be called.

To avoid such leaks, use the DRM_GPU_SCHED_STAT_RUNNING status to skip the
reset and rearm the timer.

Signed-off-by: Maíra Canal <mcanal@igalia.com>
---
 drivers/gpu/drm/panfrost/panfrost_job.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
index 5657106c2f7d0a0ca6162850767f58f3200cce13..2948d5c02115544a0e0babffd850f1506152849d 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -751,11 +751,11 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job
 	int js = panfrost_job_get_slot(job);
 
 	/*
-	 * If the GPU managed to complete this jobs fence, the timeout is
-	 * spurious. Bail out.
+	 * If the GPU managed to complete this jobs fence, TDR has fired before
+	 * IRQ and the timeout is spurious. Bail out.
 	 */
 	if (dma_fence_is_signaled(job->done_fence))
-		return DRM_GPU_SCHED_STAT_NOMINAL;
+		return DRM_GPU_SCHED_STAT_RUNNING;
 
 	/*
 	 * Panfrost IRQ handler may take a long time to process an interrupt
@@ -770,7 +770,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job
 
 	if (dma_fence_is_signaled(job->done_fence)) {
 		dev_warn(pfdev->dev, "unexpectedly high interrupt latency\n");
-		return DRM_GPU_SCHED_STAT_NOMINAL;
+		return DRM_GPU_SCHED_STAT_RUNNING;
 	}
 
 	dev_err(pfdev->dev, "gpu sched timeout, js=%d, config=0x%x, status=0x%x, head=0x%x, tail=0x%x, sched_job=%p",

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running
  2025-05-03 20:59 ` [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running Maíra Canal
@ 2025-05-06  2:41   ` Matthew Brost
  2025-05-06 14:32     ` Matthew Brost
  2025-05-06 11:32   ` Tvrtko Ursulin
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 34+ messages in thread
From: Matthew Brost @ 2025-05-06  2:41 UTC (permalink / raw)
  To: Maíra Canal
  Cc: Danilo Krummrich, Philipp Stanner, Christian König,
	Tvrtko Ursulin, Simona Vetter, Melissa Wen, Lucas Stach,
	Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price, kernel-dev, dri-devel, etnaviv, intel-xe

On Sat, May 03, 2025 at 05:59:52PM -0300, Maíra Canal wrote:
> When the DRM scheduler times out, it's possible that the GPU isn't hung;
> instead, a job may still be running, and there may be no valid reason to
> reset the hardware. This can occur in two situations:
> 
>   1. The GPU exposes some mechanism that ensures the GPU is still making
>      progress. By checking this mechanism, we can safely skip the reset,
>      rearm the timeout, and allow the job to continue running until
>      completion. This is the case for v3d and Etnaviv.
>   2. TDR has fired before the IRQ that signals the fence. Consequently,
>      the job actually finishes, but it triggers a timeout before signaling
>      the completion fence.
> 

We have both of these cases in Xe too. We implement the requeuing in Xe
via driver side function - xe_sched_add_pending_job but this looks
better and will make use of this.

> These two scenarios are problematic because we remove the job from the
> `sched->pending_list` before calling `sched->ops->timedout_job()`. This
> means that when the job finally signals completion (e.g. in the IRQ
> handler), the scheduler won't call `sched->ops->free_job()`. As a result,
> the job and its resources won't be freed, leading to a memory leak.
> 
> To resolve this issue, we create a new `drm_gpu_sched_stat` that allows a
> driver to skip the reset. This new status will indicate that the job
> should be reinserted into the pending list, and the driver will still
> signal its completion.
> 
> Signed-off-by: Maíra Canal <mcanal@igalia.com>

Reviewed-by: Matthew Brost <matthew.brost@intel.com>

> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 14 ++++++++++++++
>  include/drm/gpu_scheduler.h            |  2 ++
>  2 files changed, 16 insertions(+)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 829579c41c6b5d8b2abce5ad373c7017469b7680..68ca827d77e32187a034309f881135dbc639a9b4 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -568,6 +568,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
>  			job->sched->ops->free_job(job);
>  			sched->free_guilty = false;
>  		}
> +
> +		/*
> +		 * If the driver indicated that the GPU is still running and wants to skip
> +		 * the reset, reinsert the job back into the pending list and realarm the
> +		 * timeout.
> +		 */
> +		if (status == DRM_GPU_SCHED_STAT_RUNNING) {
> +			spin_lock(&sched->job_list_lock);
> +			list_add(&job->list, &sched->pending_list);
> +			spin_unlock(&sched->job_list_lock);
> +		}
>  	} else {
>  		spin_unlock(&sched->job_list_lock);
>  	}
> @@ -590,6 +601,9 @@ static void drm_sched_job_timedout(struct work_struct *work)
>   * This function is typically used for reset recovery (see the docu of
>   * drm_sched_backend_ops.timedout_job() for details). Do not call it for
>   * scheduler teardown, i.e., before calling drm_sched_fini().
> + *
> + * As it's used for reset recovery, drm_sched_stop() shouldn't be called
> + * if the scheduler skipped the timeout (DRM_SCHED_STAT_RUNNING).
>   */
>  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>  {
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 1a7e377d4cbb4fc12ed93c548b236970217945e8..fe9043b6d43141bee831b5fc16b927202a507d51 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -389,11 +389,13 @@ struct drm_sched_job {
>   * @DRM_GPU_SCHED_STAT_NONE: Reserved. Do not use.
>   * @DRM_GPU_SCHED_STAT_NOMINAL: Operation succeeded.
>   * @DRM_GPU_SCHED_STAT_ENODEV: Error: Device is not available anymore.
> + * @DRM_GPU_SCHED_STAT_RUNNING: GPU is still running, so skip the reset.
>   */
>  enum drm_gpu_sched_stat {
>  	DRM_GPU_SCHED_STAT_NONE,
>  	DRM_GPU_SCHED_STAT_NOMINAL,
>  	DRM_GPU_SCHED_STAT_ENODEV,
> +	DRM_GPU_SCHED_STAT_RUNNING,
>  };
>  
>  /**
> 
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 7/8] drm/xe: Use DRM_GPU_SCHED_STAT_RUNNING to skip the reset
  2025-05-03 20:59 ` [PATCH 7/8] drm/xe: " Maíra Canal
@ 2025-05-06  4:25   ` Matthew Brost
  0 siblings, 0 replies; 34+ messages in thread
From: Matthew Brost @ 2025-05-06  4:25 UTC (permalink / raw)
  To: Maíra Canal
  Cc: Danilo Krummrich, Philipp Stanner, Christian König,
	Tvrtko Ursulin, Simona Vetter, Melissa Wen, Lucas Stach,
	Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price, kernel-dev, dri-devel, etnaviv, intel-xe

On Sat, May 03, 2025 at 05:59:58PM -0300, Maíra Canal wrote:
> Xe can skip the reset if TDR has fired before the free job worker. Instead
> of using the scheduler internals to add the job to the pending list, use
> the DRM_GPU_SCHED_STAT_RUNNING status to skip the reset and rearm the
> timer.
> 
> Note that there is no need to restart submission if it hasn't been
> stopped.
> 
> Signed-off-by: Maíra Canal <mcanal@igalia.com>
> ---
>  drivers/gpu/drm/xe/xe_guc_submit.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 31bc2022bfc2d80f0ef54726dfeb8d7f8e6b32c8..4c40d3921d4a5e190d3413736a68c6e7295223dd 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -1058,12 +1058,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
>  	 * list so job can be freed and kick scheduler ensuring free job is not
>  	 * lost.
>  	 */
> -	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &job->fence->flags)) {
> -		xe_sched_add_pending_job(sched, job);
> -		xe_sched_submission_start(sched);
> -
> -		return DRM_GPU_SCHED_STAT_NOMINAL;
> -	}
> +	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &job->fence->flags))
> +		return DRM_GPU_SCHED_STAT_RUNNING;

There are a couple of other calls to xe_sched_add_pending_job in this
function which I can believe can be dropped in favor of return
DRM_GPU_SCHED_STAT_RUNNING too.

Matt 

>  
>  	/* Kill the run_job entry point */
>  	xe_sched_submission_stop(sched);
> 
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running
  2025-05-03 20:59 ` [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running Maíra Canal
  2025-05-06  2:41   ` Matthew Brost
@ 2025-05-06 11:32   ` Tvrtko Ursulin
  2025-05-07 12:33     ` Maíra Canal
  2025-05-12 11:04   ` Philipp Stanner
  2025-05-13  7:26   ` Philipp Stanner
  3 siblings, 1 reply; 34+ messages in thread
From: Tvrtko Ursulin @ 2025-05-06 11:32 UTC (permalink / raw)
  To: Maíra Canal, Matthew Brost, Danilo Krummrich,
	Philipp Stanner, Christian König, Simona Vetter, Melissa Wen,
	Lucas Stach, Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe


On 03/05/2025 21:59, Maíra Canal wrote:
> When the DRM scheduler times out, it's possible that the GPU isn't hung;
> instead, a job may still be running, and there may be no valid reason to
> reset the hardware. This can occur in two situations:
> 
>    1. The GPU exposes some mechanism that ensures the GPU is still making
>       progress. By checking this mechanism, we can safely skip the reset,
>       rearm the timeout, and allow the job to continue running until
>       completion. This is the case for v3d and Etnaviv.
>    2. TDR has fired before the IRQ that signals the fence. Consequently,
>       the job actually finishes, but it triggers a timeout before signaling
>       the completion fence.
> 
> These two scenarios are problematic because we remove the job from the
> `sched->pending_list` before calling `sched->ops->timedout_job()`. This
> means that when the job finally signals completion (e.g. in the IRQ
> handler), the scheduler won't call `sched->ops->free_job()`. As a result,
> the job and its resources won't be freed, leading to a memory leak.
> 
> To resolve this issue, we create a new `drm_gpu_sched_stat` that allows a
> driver to skip the reset. This new status will indicate that the job
> should be reinserted into the pending list, and the driver will still
> signal its completion.

Since this is de facto what drivers do today I agree it makes sense to 
formalise handling for it in the scheduler itself.

Acked-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

Some minor comments below.

> Signed-off-by: Maíra Canal <mcanal@igalia.com>
> ---
>   drivers/gpu/drm/scheduler/sched_main.c | 14 ++++++++++++++
>   include/drm/gpu_scheduler.h            |  2 ++
>   2 files changed, 16 insertions(+)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 829579c41c6b5d8b2abce5ad373c7017469b7680..68ca827d77e32187a034309f881135dbc639a9b4 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -568,6 +568,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
>   			job->sched->ops->free_job(job);
>   			sched->free_guilty = false;
>   		}
> +
> +		/*
> +		 * If the driver indicated that the GPU is still running and wants to skip
> +		 * the reset, reinsert the job back into the pending list and realarm the

re-arm

> +		 * timeout.
> +		 */
> +		if (status == DRM_GPU_SCHED_STAT_RUNNING) {
> +			spin_lock(&sched->job_list_lock);
> +			list_add(&job->list, &sched->pending_list);
> +			spin_unlock(&sched->job_list_lock);
> +		}
>   	} else {
>   		spin_unlock(&sched->job_list_lock);
>   	}
> @@ -590,6 +601,9 @@ static void drm_sched_job_timedout(struct work_struct *work)
>    * This function is typically used for reset recovery (see the docu of
>    * drm_sched_backend_ops.timedout_job() for details). Do not call it for
>    * scheduler teardown, i.e., before calling drm_sched_fini().
> + *
> + * As it's used for reset recovery, drm_sched_stop() shouldn't be called
> + * if the scheduler skipped the timeout (DRM_SCHED_STAT_RUNNING).

s/scheduler/driver/ ?

>    */
>   void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>   {
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 1a7e377d4cbb4fc12ed93c548b236970217945e8..fe9043b6d43141bee831b5fc16b927202a507d51 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -389,11 +389,13 @@ struct drm_sched_job {
>    * @DRM_GPU_SCHED_STAT_NONE: Reserved. Do not use.
>    * @DRM_GPU_SCHED_STAT_NOMINAL: Operation succeeded.
>    * @DRM_GPU_SCHED_STAT_ENODEV: Error: Device is not available anymore.
> + * @DRM_GPU_SCHED_STAT_RUNNING: GPU is still running, so skip the reset.

s/GPU/job/ ?

>    */
>   enum drm_gpu_sched_stat {
>   	DRM_GPU_SCHED_STAT_NONE,
>   	DRM_GPU_SCHED_STAT_NOMINAL,
>   	DRM_GPU_SCHED_STAT_ENODEV,
> +	DRM_GPU_SCHED_STAT_RUNNING,

I am wondering if we could make it more obvious what is the difference 
between "nominal" and "running" and from whose point of view should 
those statuses be considered.

So far we have "nominal" which means scheduler/hardware is working fine 
but the job may or may have not been cancelled. With "running" we kind 
of split it into two sub-statuses and it would be nice for that to be 
intuitively visible from the naming. But I struggle to suggest an 
elegant name while preserving nominal as is.

Thinking out loud here - perhaps that is pointing towards an alternative 
that instead of a new status, a new helper to re-insert the single job 
(like drm_sched_resubmit_job(sched, job)) would fit better? Although it 
would be more churn.

Regards,

Tvrtko

>   };
>   
>   /**
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] drm/sched: Always free the job after the timeout
  2025-05-03 20:59 ` [PATCH 2/8] drm/sched: Always free the job after the timeout Maíra Canal
@ 2025-05-06 11:49   ` Tvrtko Ursulin
  2025-05-06 12:46     ` Maíra Canal
  0 siblings, 1 reply; 34+ messages in thread
From: Tvrtko Ursulin @ 2025-05-06 11:49 UTC (permalink / raw)
  To: Maíra Canal, Matthew Brost, Danilo Krummrich,
	Philipp Stanner, Christian König, Simona Vetter, Melissa Wen,
	Lucas Stach, Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe


On 03/05/2025 21:59, Maíra Canal wrote:
> Currently, if we add the assertions presented in this commit to the mock
> scheduler, we will see the following output:
> 
> [15:47:08] ============== [PASSED] drm_sched_basic_tests ==============
> [15:47:08] ======== drm_sched_basic_timeout_tests (1 subtest) =========
> [15:47:08] # drm_sched_basic_timeout: ASSERTION FAILED at drivers/gpu/drm/scheduler/tests/tests_basic.c:246
> [15:47:08] Expected list_empty(&sched->job_list) to be true, but is false
> [15:47:08] [FAILED] drm_sched_basic_timeout
> [15:47:08] # module: drm_sched_tests
> 
> This occurs because `mock_sched_timedout_job()` doesn't properly handle
> the hang. From the DRM sched documentation, `drm_sched_stop()` and
> `drm_sched_start()` are typically used for reset recovery. If these
> functions are not used, the offending job won't be freed and should be
> freed by the caller.
> 
> Currently, the mock scheduler doesn't use the functions provided by the
> API, nor does it handle the freeing of the job. As a result, the job isn't
> removed from the job list.

For the record the job does gets freed via the kunit managed allocation.

It was a design choice for this test to be a *strict* unit test which 
tests only a _single_ thing. And that is that the timedout_job() hook 
gets called. As such the hook was implemented to satisfy that single 
requirement only.

But I also do not oppose making it test multiple things in one test per se.

> This commit mocks a GPU reset by stopping the scheduler affected by the
> reset, waiting a couple of microseconds to mimic a hardware reset, and
> then restart the affected scheduler.
> 
> Signed-off-by: Maíra Canal <mcanal@igalia.com>
> ---
>   drivers/gpu/drm/scheduler/tests/mock_scheduler.c | 10 ++++++++++
>   drivers/gpu/drm/scheduler/tests/tests_basic.c    |  3 +++
>   2 files changed, 13 insertions(+)
> 
> diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> index f999c8859cf7adb8f06fc8a37969656dd3249fa7..e9af202d84bd55ea5cc048215e39f5407bc84458 100644
> --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> @@ -1,6 +1,8 @@
>   // SPDX-License-Identifier: GPL-2.0
>   /* Copyright (c) 2025 Valve Corporation */
>   
> +#include <linux/delay.h>
> +
>   #include "sched_tests.h"
>   
>   /*
> @@ -203,10 +205,18 @@ static struct dma_fence *mock_sched_run_job(struct drm_sched_job *sched_job)
>   static enum drm_gpu_sched_stat
>   mock_sched_timedout_job(struct drm_sched_job *sched_job)
>   {
> +	struct drm_mock_scheduler *sched =
> +		drm_sched_to_mock_sched(sched_job->sched);
>   	struct drm_mock_sched_job *job = drm_sched_job_to_mock_job(sched_job);
>   
>   	job->flags |= DRM_MOCK_SCHED_JOB_TIMEDOUT;
>   
> +	drm_sched_stop(&sched->base, &job->base);
> +
> +	usleep_range(200, 500);

msleep(10) or something to make it seem less like the actual numbers are 
relevant?

 > +> +	drm_sched_start(&sched->base, 0);
> +
>   	return DRM_GPU_SCHED_STAT_NOMINAL;
>   }
>   
> diff --git a/drivers/gpu/drm/scheduler/tests/tests_basic.c b/drivers/gpu/drm/scheduler/tests/tests_basic.c
> index 7230057e0594c6246f02608f07fcb1f8d738ac75..8f960f0fd31d0af7873f410ceba2d636f58a5474 100644
> --- a/drivers/gpu/drm/scheduler/tests/tests_basic.c
> +++ b/drivers/gpu/drm/scheduler/tests/tests_basic.c
> @@ -241,6 +241,9 @@ static void drm_sched_basic_timeout(struct kunit *test)
>   			job->flags & DRM_MOCK_SCHED_JOB_TIMEDOUT,
>   			DRM_MOCK_SCHED_JOB_TIMEDOUT);
>   
> +	KUNIT_ASSERT_TRUE(test, list_empty(&sched->job_list));

Hmm I think this assert could be racy because it appears to rely on the 
free worker to run and cleanup the "finished" job in the window between 
drm_mock_sched_job_wait_finished() (or drm_sched_start(), depends how 
you look at it) and here. Am I missing something?

Regards,

Tvrtko

> +	KUNIT_ASSERT_TRUE(test, list_empty(&sched->done_list));
 > +>   	drm_mock_sched_entity_free(entity);
>   }
>   
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/8] drm/sched: Reduce scheduler's timeout for timeout tests
  2025-05-03 20:59 ` [PATCH 3/8] drm/sched: Reduce scheduler's timeout for timeout tests Maíra Canal
@ 2025-05-06 12:03   ` Tvrtko Ursulin
  2025-05-06 12:56     ` Maíra Canal
  0 siblings, 1 reply; 34+ messages in thread
From: Tvrtko Ursulin @ 2025-05-06 12:03 UTC (permalink / raw)
  To: Maíra Canal, Matthew Brost, Danilo Krummrich,
	Philipp Stanner, Christian König, Simona Vetter, Melissa Wen,
	Lucas Stach, Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe


On 03/05/2025 21:59, Maíra Canal wrote:
> As more KUnit tests are introduced to evaluate the basic capabilities of
> the `timedout_job()` hook, the test suite will continue to increase in
> duration. To reduce the overall running time of the test suite, decrease
> the scheduler's timeout for the timeout tests.
> 
> Before this commit:
> 
> [15:42:26] Elapsed time: 15.637s total, 0.002s configuring, 10.387s building, 5.229s running
> 
> After this commit:
> 
> [15:45:26] Elapsed time: 9.263s total, 0.002s configuring, 5.168s building, 4.037s running
> 
> Signed-off-by: Maíra Canal <mcanal@igalia.com>
> ---
>   drivers/gpu/drm/scheduler/tests/tests_basic.c | 10 ++++++----
>   1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/tests/tests_basic.c b/drivers/gpu/drm/scheduler/tests/tests_basic.c
> index 8f960f0fd31d0af7873f410ceba2d636f58a5474..00c691cb3c306f609684f554f17fcb54ba74cb95 100644
> --- a/drivers/gpu/drm/scheduler/tests/tests_basic.c
> +++ b/drivers/gpu/drm/scheduler/tests/tests_basic.c
> @@ -5,6 +5,8 @@
>   
>   #include "sched_tests.h"
>   
> +#define MOCK_TIMEOUT (HZ / 5)
> +
>   /*
>    * DRM scheduler basic tests should check the basic functional correctness of
>    * the scheduler, including some very light smoke testing. More targeted tests,
> @@ -28,7 +30,7 @@ static void drm_sched_basic_exit(struct kunit *test)
>   
>   static int drm_sched_timeout_init(struct kunit *test)
>   {
> -	test->priv = drm_mock_sched_new(test, HZ);
> +	test->priv = drm_mock_sched_new(test, MOCK_TIMEOUT);
>   
>   	return 0;
>   }
> @@ -224,17 +226,17 @@ static void drm_sched_basic_timeout(struct kunit *test)
>   
>   	drm_mock_sched_job_submit(job);
>   
> -	done = drm_mock_sched_job_wait_scheduled(job, HZ);
> +	done = drm_mock_sched_job_wait_scheduled(job, MOCK_TIMEOUT);

This wait is accounting for the fact sched->wq needs to run and call 
->run_job() before job will become scheduled. It is not related to 
timeout handling. I was going for a safe value and I think decreasing it 
will not speed up the test but may cause sporadic failures.

>   	KUNIT_ASSERT_TRUE(test, done);
>   
> -	done = drm_mock_sched_job_wait_finished(job, HZ / 2);
> +	done = drm_mock_sched_job_wait_finished(job, MOCK_TIMEOUT / 2);
>   	KUNIT_ASSERT_FALSE(test, done);
>   
>   	KUNIT_ASSERT_EQ(test,
>   			job->flags & DRM_MOCK_SCHED_JOB_TIMEDOUT,
>   			0);
>   
> -	done = drm_mock_sched_job_wait_finished(job, HZ);
> +	done = drm_mock_sched_job_wait_finished(job, MOCK_TIMEOUT);
>   	KUNIT_ASSERT_FALSE(test, done);

Above two are related to timeout handling and should be safe to change.

With HZ / 5 first assert could have a false negative if timeout work 
would run, but later than 100ms (HZ / 5 / 2). And the second a false 
negative if it fails to run in 300ms (HZ / 5 / 2 + HZ / 5). Neither 
failure sounds likely in the kunit environment so, again, I think those 
two are okay to speed up.

Regards,

Tvrtko

>   
>   	KUNIT_ASSERT_EQ(test,
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] drm/sched: Always free the job after the timeout
  2025-05-06 11:49   ` Tvrtko Ursulin
@ 2025-05-06 12:46     ` Maíra Canal
  2025-05-06 13:28       ` Tvrtko Ursulin
  0 siblings, 1 reply; 34+ messages in thread
From: Maíra Canal @ 2025-05-06 12:46 UTC (permalink / raw)
  To: Tvrtko Ursulin, Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Simona Vetter, Melissa Wen, Lucas Stach,
	Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe

Hi Tvrtko,

Thanks for your review!

On 06/05/25 08:49, Tvrtko Ursulin wrote:
> 
> On 03/05/2025 21:59, Maíra Canal wrote:
>> Currently, if we add the assertions presented in this commit to the mock
>> scheduler, we will see the following output:
>>
>> [15:47:08] ============== [PASSED] drm_sched_basic_tests ==============
>> [15:47:08] ======== drm_sched_basic_timeout_tests (1 subtest) =========
>> [15:47:08] # drm_sched_basic_timeout: ASSERTION FAILED at drivers/gpu/ 
>> drm/scheduler/tests/tests_basic.c:246
>> [15:47:08] Expected list_empty(&sched->job_list) to be true, but is false
>> [15:47:08] [FAILED] drm_sched_basic_timeout
>> [15:47:08] # module: drm_sched_tests
>>
>> This occurs because `mock_sched_timedout_job()` doesn't properly handle
>> the hang. From the DRM sched documentation, `drm_sched_stop()` and
>> `drm_sched_start()` are typically used for reset recovery. If these
>> functions are not used, the offending job won't be freed and should be
>> freed by the caller.
>>
>> Currently, the mock scheduler doesn't use the functions provided by the
>> API, nor does it handle the freeing of the job. As a result, the job 
>> isn't
>> removed from the job list.
> 
> For the record the job does gets freed via the kunit managed allocation.

Sorry, I didn't express myself correctly. Indeed, it is. I meant that
the DRM scheduler didn't free the job.

> 
> It was a design choice for this test to be a *strict* unit test which 
> tests only a _single_ thing. And that is that the timedout_job() hook 
> gets called. As such the hook was implemented to satisfy that single 
> requirement only.
> 

What do you think about checking that `sched->job_list` won't be empty?

I wanted to add such assertion to make sure that the behavior of the
timeout won't change in future (e.g. a patch makes a change that calls
`free_job()` for the guilty job at timeout). Does it make sense to you?

> But I also do not oppose making it test multiple things in one test per se.
> 
>> This commit mocks a GPU reset by stopping the scheduler affected by the
>> reset, waiting a couple of microseconds to mimic a hardware reset, and
>> then restart the affected scheduler.
>>
>> Signed-off-by: Maíra Canal <mcanal@igalia.com>
>> ---
>>   drivers/gpu/drm/scheduler/tests/mock_scheduler.c | 10 ++++++++++
>>   drivers/gpu/drm/scheduler/tests/tests_basic.c    |  3 +++
>>   2 files changed, 13 insertions(+)
>>

[...]

>> diff --git a/drivers/gpu/drm/scheduler/tests/tests_basic.c b/drivers/ 
>> gpu/drm/scheduler/tests/tests_basic.c
>> index 
>> 7230057e0594c6246f02608f07fcb1f8d738ac75..8f960f0fd31d0af7873f410ceba2d636f58a5474 100644
>> --- a/drivers/gpu/drm/scheduler/tests/tests_basic.c
>> +++ b/drivers/gpu/drm/scheduler/tests/tests_basic.c
>> @@ -241,6 +241,9 @@ static void drm_sched_basic_timeout(struct kunit 
>> *test)
>>               job->flags & DRM_MOCK_SCHED_JOB_TIMEDOUT,
>>               DRM_MOCK_SCHED_JOB_TIMEDOUT);
>> +    KUNIT_ASSERT_TRUE(test, list_empty(&sched->job_list));
> 
> Hmm I think this assert could be racy because it appears to rely on the 
> free worker to run and cleanup the "finished" job in the window between 
> drm_mock_sched_job_wait_finished() (or drm_sched_start(), depends how 
> you look at it) and here. Am I missing something?

 From what I understand, the job is freed by the timeout worker [1] after
`drm_sched_stop()` marked the job as guilty.

Therefore, if the timeout was called (and we asserted that through
`job->flags`), we can be sure that the job was freed.

[1] 
https://gitlab.freedesktop.org/drm/misc/kernel/-/blob/drm-misc-next/drivers/gpu/drm/scheduler/sched_main.c#L568

Best Regards,
- Maíra

> 
> Regards,
> 
> Tvrtko
> 
>> +    KUNIT_ASSERT_TRUE(test, list_empty(&sched->done_list));
>  > +>       drm_mock_sched_entity_free(entity);
>>   }
>>
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/8] drm/sched: Reduce scheduler's timeout for timeout tests
  2025-05-06 12:03   ` Tvrtko Ursulin
@ 2025-05-06 12:56     ` Maíra Canal
  2025-05-06 13:20       ` Tvrtko Ursulin
  0 siblings, 1 reply; 34+ messages in thread
From: Maíra Canal @ 2025-05-06 12:56 UTC (permalink / raw)
  To: Tvrtko Ursulin, Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Simona Vetter, Melissa Wen, Lucas Stach,
	Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe

Hi Tvrtko,

Thanks for your review!

On 06/05/25 09:03, Tvrtko Ursulin wrote:
> 
> On 03/05/2025 21:59, Maíra Canal wrote:
>> As more KUnit tests are introduced to evaluate the basic capabilities of
>> the `timedout_job()` hook, the test suite will continue to increase in
>> duration. To reduce the overall running time of the test suite, decrease
>> the scheduler's timeout for the timeout tests.
>>
>> Before this commit:
>>
>> [15:42:26] Elapsed time: 15.637s total, 0.002s configuring, 10.387s 
>> building, 5.229s running
>>
>> After this commit:
>>
>> [15:45:26] Elapsed time: 9.263s total, 0.002s configuring, 5.168s 
>> building, 4.037s running
>>
>> Signed-off-by: Maíra Canal <mcanal@igalia.com>
>> ---
>>   drivers/gpu/drm/scheduler/tests/tests_basic.c | 10 ++++++----
>>   1 file changed, 6 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/tests/tests_basic.c b/drivers/ 
>> gpu/drm/scheduler/tests/tests_basic.c
>> index 
>> 8f960f0fd31d0af7873f410ceba2d636f58a5474..00c691cb3c306f609684f554f17fcb54ba74cb95 100644
>> --- a/drivers/gpu/drm/scheduler/tests/tests_basic.c
>> +++ b/drivers/gpu/drm/scheduler/tests/tests_basic.c
>> @@ -5,6 +5,8 @@
>>   #include "sched_tests.h"
>> +#define MOCK_TIMEOUT (HZ / 5)
>> +
>>   /*
>>    * DRM scheduler basic tests should check the basic functional 
>> correctness of
>>    * the scheduler, including some very light smoke testing. More 
>> targeted tests,
>> @@ -28,7 +30,7 @@ static void drm_sched_basic_exit(struct kunit *test)
>>   static int drm_sched_timeout_init(struct kunit *test)
>>   {
>> -    test->priv = drm_mock_sched_new(test, HZ);
>> +    test->priv = drm_mock_sched_new(test, MOCK_TIMEOUT);
>>       return 0;
>>   }
>> @@ -224,17 +226,17 @@ static void drm_sched_basic_timeout(struct kunit 
>> *test)
>>       drm_mock_sched_job_submit(job);
>> -    done = drm_mock_sched_job_wait_scheduled(job, HZ);
>> +    done = drm_mock_sched_job_wait_scheduled(job, MOCK_TIMEOUT);
> 
> This wait is accounting for the fact sched->wq needs to run and call - 
>  >run_job() before job will become scheduled. It is not related to 
> timeout handling. I was going for a safe value and I think decreasing it 
> will not speed up the test but may cause sporadic failures.

I'll address it in v2.

> 
>>       KUNIT_ASSERT_TRUE(test, done);
>> -    done = drm_mock_sched_job_wait_finished(job, HZ / 2);
>> +    done = drm_mock_sched_job_wait_finished(job, MOCK_TIMEOUT / 2);
>>       KUNIT_ASSERT_FALSE(test, done);
>>       KUNIT_ASSERT_EQ(test,
>>               job->flags & DRM_MOCK_SCHED_JOB_TIMEDOUT,
>>               0);
>> -    done = drm_mock_sched_job_wait_finished(job, HZ);
>> +    done = drm_mock_sched_job_wait_finished(job, MOCK_TIMEOUT);
>>       KUNIT_ASSERT_FALSE(test, done);
> 
> Above two are related to timeout handling and should be safe to change.
> 
> With HZ / 5 first assert could have a false negative if timeout work 
> would run, but later than 100ms (HZ / 5 / 2). And the second a false 
> negative if it fails to run in 300ms (HZ / 5 / 2 + HZ / 5). Neither 
> failure sounds likely in the kunit environment so, again, I think those 
> two are okay to speed up.

What do you think about using a slightly bigger timeout? Maybe HZ / 4 or
HZ / 2.

Best Regards,
- Maíra

> 
> Regards,
> 
> Tvrtko
> 
>>       KUNIT_ASSERT_EQ(test,
>>
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/8] drm/sched: Reduce scheduler's timeout for timeout tests
  2025-05-06 12:56     ` Maíra Canal
@ 2025-05-06 13:20       ` Tvrtko Ursulin
  0 siblings, 0 replies; 34+ messages in thread
From: Tvrtko Ursulin @ 2025-05-06 13:20 UTC (permalink / raw)
  To: Maíra Canal, Matthew Brost, Danilo Krummrich,
	Philipp Stanner, Christian König, Simona Vetter, Melissa Wen,
	Lucas Stach, Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe


On 06/05/2025 13:56, Maíra Canal wrote:
> Hi Tvrtko,
> 
> Thanks for your review!
> 
> On 06/05/25 09:03, Tvrtko Ursulin wrote:
>>
>> On 03/05/2025 21:59, Maíra Canal wrote:
>>> As more KUnit tests are introduced to evaluate the basic capabilities of
>>> the `timedout_job()` hook, the test suite will continue to increase in
>>> duration. To reduce the overall running time of the test suite, decrease
>>> the scheduler's timeout for the timeout tests.
>>>
>>> Before this commit:
>>>
>>> [15:42:26] Elapsed time: 15.637s total, 0.002s configuring, 10.387s 
>>> building, 5.229s running
>>>
>>> After this commit:
>>>
>>> [15:45:26] Elapsed time: 9.263s total, 0.002s configuring, 5.168s 
>>> building, 4.037s running
>>>
>>> Signed-off-by: Maíra Canal <mcanal@igalia.com>
>>> ---
>>>   drivers/gpu/drm/scheduler/tests/tests_basic.c | 10 ++++++----
>>>   1 file changed, 6 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/tests/tests_basic.c b/drivers/ 
>>> gpu/drm/scheduler/tests/tests_basic.c
>>> index 
>>> 8f960f0fd31d0af7873f410ceba2d636f58a5474..00c691cb3c306f609684f554f17fcb54ba74cb95 100644
>>> --- a/drivers/gpu/drm/scheduler/tests/tests_basic.c
>>> +++ b/drivers/gpu/drm/scheduler/tests/tests_basic.c
>>> @@ -5,6 +5,8 @@
>>>   #include "sched_tests.h"
>>> +#define MOCK_TIMEOUT (HZ / 5)
>>> +
>>>   /*
>>>    * DRM scheduler basic tests should check the basic functional 
>>> correctness of
>>>    * the scheduler, including some very light smoke testing. More 
>>> targeted tests,
>>> @@ -28,7 +30,7 @@ static void drm_sched_basic_exit(struct kunit *test)
>>>   static int drm_sched_timeout_init(struct kunit *test)
>>>   {
>>> -    test->priv = drm_mock_sched_new(test, HZ);
>>> +    test->priv = drm_mock_sched_new(test, MOCK_TIMEOUT);
>>>       return 0;
>>>   }
>>> @@ -224,17 +226,17 @@ static void drm_sched_basic_timeout(struct 
>>> kunit *test)
>>>       drm_mock_sched_job_submit(job);
>>> -    done = drm_mock_sched_job_wait_scheduled(job, HZ);
>>> +    done = drm_mock_sched_job_wait_scheduled(job, MOCK_TIMEOUT);
>>
>> This wait is accounting for the fact sched->wq needs to run and call - 
>>  >run_job() before job will become scheduled. It is not related to 
>> timeout handling. I was going for a safe value and I think decreasing 
>> it will not speed up the test but may cause sporadic failures.
> 
> I'll address it in v2.
> 
>>
>>>       KUNIT_ASSERT_TRUE(test, done);
>>> -    done = drm_mock_sched_job_wait_finished(job, HZ / 2);
>>> +    done = drm_mock_sched_job_wait_finished(job, MOCK_TIMEOUT / 2);
>>>       KUNIT_ASSERT_FALSE(test, done);
>>>       KUNIT_ASSERT_EQ(test,
>>>               job->flags & DRM_MOCK_SCHED_JOB_TIMEDOUT,
>>>               0);
>>> -    done = drm_mock_sched_job_wait_finished(job, HZ);
>>> +    done = drm_mock_sched_job_wait_finished(job, MOCK_TIMEOUT);
>>>       KUNIT_ASSERT_FALSE(test, done);
>>
>> Above two are related to timeout handling and should be safe to change.
>>
>> With HZ / 5 first assert could have a false negative if timeout work 
>> would run, but later than 100ms (HZ / 5 / 2). And the second a false 
>> negative if it fails to run in 300ms (HZ / 5 / 2 + HZ / 5). Neither 
>> failure sounds likely in the kunit environment so, again, I think 
>> those two are okay to speed up.
> 
> What do you think about using a slightly bigger timeout? Maybe HZ / 4 or
> HZ / 2.

I thought HZ / 5 would be safe so I'd leave it and only tweak if it 
turns out to be a problem.

Regards,

Tvrtko


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] drm/sched: Always free the job after the timeout
  2025-05-06 12:46     ` Maíra Canal
@ 2025-05-06 13:28       ` Tvrtko Ursulin
  2025-05-06 13:38         ` Maíra Canal
  0 siblings, 1 reply; 34+ messages in thread
From: Tvrtko Ursulin @ 2025-05-06 13:28 UTC (permalink / raw)
  To: Maíra Canal, Matthew Brost, Danilo Krummrich,
	Philipp Stanner, Christian König, Simona Vetter, Melissa Wen,
	Lucas Stach, Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe


On 06/05/2025 13:46, Maíra Canal wrote:
> Hi Tvrtko,
> 
> Thanks for your review!
> 
> On 06/05/25 08:49, Tvrtko Ursulin wrote:
>>
>> On 03/05/2025 21:59, Maíra Canal wrote:
>>> Currently, if we add the assertions presented in this commit to the mock
>>> scheduler, we will see the following output:
>>>
>>> [15:47:08] ============== [PASSED] drm_sched_basic_tests ==============
>>> [15:47:08] ======== drm_sched_basic_timeout_tests (1 subtest) =========
>>> [15:47:08] # drm_sched_basic_timeout: ASSERTION FAILED at drivers/ 
>>> gpu/ drm/scheduler/tests/tests_basic.c:246
>>> [15:47:08] Expected list_empty(&sched->job_list) to be true, but is 
>>> false
>>> [15:47:08] [FAILED] drm_sched_basic_timeout
>>> [15:47:08] # module: drm_sched_tests
>>>
>>> This occurs because `mock_sched_timedout_job()` doesn't properly handle
>>> the hang. From the DRM sched documentation, `drm_sched_stop()` and
>>> `drm_sched_start()` are typically used for reset recovery. If these
>>> functions are not used, the offending job won't be freed and should be
>>> freed by the caller.
>>>
>>> Currently, the mock scheduler doesn't use the functions provided by the
>>> API, nor does it handle the freeing of the job. As a result, the job 
>>> isn't
>>> removed from the job list.
>>
>> For the record the job does gets freed via the kunit managed allocation.
> 
> Sorry, I didn't express myself correctly. Indeed, it is. I meant that
> the DRM scheduler didn't free the job.
> 
>>
>> It was a design choice for this test to be a *strict* unit test which 
>> tests only a _single_ thing. And that is that the timedout_job() hook 
>> gets called. As such the hook was implemented to satisfy that single 
>> requirement only.
>>
> 
> What do you think about checking that `sched->job_list` won't be empty?
> 
> I wanted to add such assertion to make sure that the behavior of the
> timeout won't change in future (e.g. a patch makes a change that calls
> `free_job()` for the guilty job at timeout). Does it make sense to you?

Where would that assert be?

>> But I also do not oppose making it test multiple things in one test 
>> per se.
>>
>>> This commit mocks a GPU reset by stopping the scheduler affected by the
>>> reset, waiting a couple of microseconds to mimic a hardware reset, and
>>> then restart the affected scheduler.
>>>
>>> Signed-off-by: Maíra Canal <mcanal@igalia.com>
>>> ---
>>>   drivers/gpu/drm/scheduler/tests/mock_scheduler.c | 10 ++++++++++
>>>   drivers/gpu/drm/scheduler/tests/tests_basic.c    |  3 +++
>>>   2 files changed, 13 insertions(+)
>>>
> 
> [...]
> 
>>> diff --git a/drivers/gpu/drm/scheduler/tests/tests_basic.c b/drivers/ 
>>> gpu/drm/scheduler/tests/tests_basic.c
>>> index 
>>> 7230057e0594c6246f02608f07fcb1f8d738ac75..8f960f0fd31d0af7873f410ceba2d636f58a5474 100644
>>> --- a/drivers/gpu/drm/scheduler/tests/tests_basic.c
>>> +++ b/drivers/gpu/drm/scheduler/tests/tests_basic.c
>>> @@ -241,6 +241,9 @@ static void drm_sched_basic_timeout(struct kunit 
>>> *test)
>>>               job->flags & DRM_MOCK_SCHED_JOB_TIMEDOUT,
>>>               DRM_MOCK_SCHED_JOB_TIMEDOUT);
>>> +    KUNIT_ASSERT_TRUE(test, list_empty(&sched->job_list));
>>
>> Hmm I think this assert could be racy because it appears to rely on 
>> the free worker to run and cleanup the "finished" job in the window 
>> between drm_mock_sched_job_wait_finished() (or drm_sched_start(), 
>> depends how you look at it) and here. Am I missing something?
> 
>  From what I understand, the job is freed by the timeout worker [1] after
> `drm_sched_stop()` marked the job as guilty.
> 
> Therefore, if the timeout was called (and we asserted that through
> `job->flags`), we can be sure that the job was freed.
> 
> [1] https://gitlab.freedesktop.org/drm/misc/kernel/-/blob/drm-misc-next/ 
> drivers/gpu/drm/scheduler/sched_main.c#L568

Hm I thought it would end up on the dma_fence_remove_callback() == true 
branch in drm_sched_stop().

I gave it a quick spin locally and that indeed appears to be the case. 
So AFAICT it does rely on the free worker to have had executed before 
the assert.

Regards,

Tvrtko

> 
> Best Regards,
> - Maíra
> 
>>
>> Regards,
>>
>> Tvrtko
>>
>>> +    KUNIT_ASSERT_TRUE(test, list_empty(&sched->done_list));
>>  > +>       drm_mock_sched_entity_free(entity);
>>>   }
>>>
>>
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] drm/sched: Always free the job after the timeout
  2025-05-06 13:28       ` Tvrtko Ursulin
@ 2025-05-06 13:38         ` Maíra Canal
  2025-05-06 14:18           ` Tvrtko Ursulin
  0 siblings, 1 reply; 34+ messages in thread
From: Maíra Canal @ 2025-05-06 13:38 UTC (permalink / raw)
  To: Tvrtko Ursulin, Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Simona Vetter, Melissa Wen, Lucas Stach,
	Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe

Hi Tvrtko,

On 06/05/25 10:28, Tvrtko Ursulin wrote:
> 
> On 06/05/2025 13:46, Maíra Canal wrote:
>> Hi Tvrtko,
>>
>> Thanks for your review!
>>
>> On 06/05/25 08:49, Tvrtko Ursulin wrote:
>>>
>>> On 03/05/2025 21:59, Maíra Canal wrote:
>>>> Currently, if we add the assertions presented in this commit to the 
>>>> mock
>>>> scheduler, we will see the following output:
>>>>
>>>> [15:47:08] ============== [PASSED] drm_sched_basic_tests ==============
>>>> [15:47:08] ======== drm_sched_basic_timeout_tests (1 subtest) =========
>>>> [15:47:08] # drm_sched_basic_timeout: ASSERTION FAILED at drivers/ 
>>>> gpu/ drm/scheduler/tests/tests_basic.c:246
>>>> [15:47:08] Expected list_empty(&sched->job_list) to be true, but is 
>>>> false
>>>> [15:47:08] [FAILED] drm_sched_basic_timeout
>>>> [15:47:08] # module: drm_sched_tests
>>>>
>>>> This occurs because `mock_sched_timedout_job()` doesn't properly handle
>>>> the hang. From the DRM sched documentation, `drm_sched_stop()` and
>>>> `drm_sched_start()` are typically used for reset recovery. If these
>>>> functions are not used, the offending job won't be freed and should be
>>>> freed by the caller.
>>>>
>>>> Currently, the mock scheduler doesn't use the functions provided by the
>>>> API, nor does it handle the freeing of the job. As a result, the job 
>>>> isn't
>>>> removed from the job list.
>>>
>>> For the record the job does gets freed via the kunit managed allocation.
>>
>> Sorry, I didn't express myself correctly. Indeed, it is. I meant that
>> the DRM scheduler didn't free the job.
>>
>>>
>>> It was a design choice for this test to be a *strict* unit test which 
>>> tests only a _single_ thing. And that is that the timedout_job() hook 
>>> gets called. As such the hook was implemented to satisfy that single 
>>> requirement only.
>>>
>>
>> What do you think about checking that `sched->job_list` won't be empty?
>>
>> I wanted to add such assertion to make sure that the behavior of the
>> timeout won't change in future (e.g. a patch makes a change that calls
>> `free_job()` for the guilty job at timeout). Does it make sense to you?
> 
> Where would that assert be?
> 

I believe it would be in the same place as this patch assertions, but
instead of `KUNIT_ASSERT_TRUE(test, list_empty(&sched->job_list));`, it
would be `KUNIT_ASSERT_FALSE(test, list_empty(&sched->job_list));`.

But I don't feel strongly about it. I can drop the patch if you believe
it's a better option.

Best Regards,
- Maíra

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/8] drm/sched: Add new test for DRM_GPU_SCHED_STAT_RUNNING
  2025-05-03 20:59 ` [PATCH 4/8] drm/sched: Add new test for DRM_GPU_SCHED_STAT_RUNNING Maíra Canal
@ 2025-05-06 13:58   ` Tvrtko Ursulin
  0 siblings, 0 replies; 34+ messages in thread
From: Tvrtko Ursulin @ 2025-05-06 13:58 UTC (permalink / raw)
  To: Maíra Canal, Matthew Brost, Danilo Krummrich,
	Philipp Stanner, Christian König, Simona Vetter, Melissa Wen,
	Lucas Stach, Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe


On 03/05/2025 21:59, Maíra Canal wrote:
> Add a test to submit a single job against a scheduler with the timeout
> configured and verify that if the job is still running, the timeout
> handler will skip the reset and allow the job to complete.
> 
> Signed-off-by: Maíra Canal <mcanal@igalia.com>
> ---
>   drivers/gpu/drm/scheduler/tests/mock_scheduler.c |  3 ++
>   drivers/gpu/drm/scheduler/tests/tests_basic.c    | 44 ++++++++++++++++++++++++
>   2 files changed, 47 insertions(+)
> 
> diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> index e9af202d84bd55ea5cc048215e39f5407bc84458..9d594cb5bf567be25e018ddbcd28b70a7e994260 100644
> --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> @@ -211,6 +211,9 @@ mock_sched_timedout_job(struct drm_sched_job *sched_job)
>   
>   	job->flags |= DRM_MOCK_SCHED_JOB_TIMEDOUT;
>   
> +	if (job->finish_at && ktime_before(ktime_get(), job->finish_at))
> +		return DRM_GPU_SCHED_STAT_RUNNING;

Hmm this would prevent testing timeout handling with mock jobs with 
duration set for any future test. It works, but I am thinking if a more 
explicit way wouldn't be better. For example to add a new job flag like 
DRM_MOCK_SCHED_JOB_DONTRESET or similar?

That way test could use the timing insensitive drm_mock_sched_advance() 
and have explicit control of the execution. If you don't mind I think I 
would prefer that.

Regards,

Tvrtko

> +
>   	drm_sched_stop(&sched->base, &job->base);
>   
>   	usleep_range(200, 500);
> diff --git a/drivers/gpu/drm/scheduler/tests/tests_basic.c b/drivers/gpu/drm/scheduler/tests/tests_basic.c
> index 00c691cb3c306f609684f554f17fcb54ba74cb95..669a211b216ee298544ac237abb866077d856586 100644
> --- a/drivers/gpu/drm/scheduler/tests/tests_basic.c
> +++ b/drivers/gpu/drm/scheduler/tests/tests_basic.c
> @@ -249,8 +249,52 @@ static void drm_sched_basic_timeout(struct kunit *test)
>   	drm_mock_sched_entity_free(entity);
>   }
>   
> +static void drm_sched_skip_reset(struct kunit *test)
> +{
> +	struct drm_mock_scheduler *sched = test->priv;
> +	struct drm_mock_sched_entity *entity;
> +	struct drm_mock_sched_job *job;
> +	bool done;
> +
> +	/*
> +	 * Submit a single job against a scheduler with the timeout configured
> +	 * and verify that if the job is still running, the timeout handler
> +	 * will skip the reset and allow the job to complete.
> +	 */
> +
> +	entity = drm_mock_sched_entity_new(test,
> +					   DRM_SCHED_PRIORITY_NORMAL,
> +					   sched);
> +	job = drm_mock_sched_job_new(test, entity);
> +
> +	drm_mock_sched_job_set_duration_us(job, jiffies_to_usecs(2 * MOCK_TIMEOUT));
> +	drm_mock_sched_job_submit(job);
> +
> +	done = drm_mock_sched_job_wait_finished(job, MOCK_TIMEOUT);
> +	KUNIT_ASSERT_FALSE(test, done);
> +
> +	KUNIT_ASSERT_EQ(test,
> +			job->flags & DRM_MOCK_SCHED_JOB_TIMEDOUT,
> +			DRM_MOCK_SCHED_JOB_TIMEDOUT);
> +
> +	KUNIT_ASSERT_FALSE(test, list_empty(&sched->job_list));
> +
> +	done = drm_mock_sched_job_wait_finished(job, MOCK_TIMEOUT);
> +	KUNIT_ASSERT_TRUE(test, done);
> +
> +	KUNIT_ASSERT_EQ(test,
> +			job->flags & DRM_MOCK_SCHED_JOB_TIMEDOUT,
> +			DRM_MOCK_SCHED_JOB_TIMEDOUT);
> +
> +	KUNIT_ASSERT_TRUE(test, list_empty(&sched->job_list));
> +	KUNIT_ASSERT_TRUE(test, list_empty(&sched->done_list));
> +
> +	drm_mock_sched_entity_free(entity);
> +}
> +
>   static struct kunit_case drm_sched_timeout_tests[] = {
>   	KUNIT_CASE(drm_sched_basic_timeout),
> +	KUNIT_CASE(drm_sched_skip_reset),
>   	{}
>   };
>   
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] drm/sched: Always free the job after the timeout
  2025-05-06 13:38         ` Maíra Canal
@ 2025-05-06 14:18           ` Tvrtko Ursulin
  0 siblings, 0 replies; 34+ messages in thread
From: Tvrtko Ursulin @ 2025-05-06 14:18 UTC (permalink / raw)
  To: Maíra Canal, Matthew Brost, Danilo Krummrich,
	Philipp Stanner, Christian König, Simona Vetter, Melissa Wen,
	Lucas Stach, Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe


On 06/05/2025 14:38, Maíra Canal wrote:
> Hi Tvrtko,
> 
> On 06/05/25 10:28, Tvrtko Ursulin wrote:
>>
>> On 06/05/2025 13:46, Maíra Canal wrote:
>>> Hi Tvrtko,
>>>
>>> Thanks for your review!
>>>
>>> On 06/05/25 08:49, Tvrtko Ursulin wrote:
>>>>
>>>> On 03/05/2025 21:59, Maíra Canal wrote:
>>>>> Currently, if we add the assertions presented in this commit to the 
>>>>> mock
>>>>> scheduler, we will see the following output:
>>>>>
>>>>> [15:47:08] ============== [PASSED] drm_sched_basic_tests 
>>>>> ==============
>>>>> [15:47:08] ======== drm_sched_basic_timeout_tests (1 subtest) 
>>>>> =========
>>>>> [15:47:08] # drm_sched_basic_timeout: ASSERTION FAILED at drivers/ 
>>>>> gpu/ drm/scheduler/tests/tests_basic.c:246
>>>>> [15:47:08] Expected list_empty(&sched->job_list) to be true, but is 
>>>>> false
>>>>> [15:47:08] [FAILED] drm_sched_basic_timeout
>>>>> [15:47:08] # module: drm_sched_tests
>>>>>
>>>>> This occurs because `mock_sched_timedout_job()` doesn't properly 
>>>>> handle
>>>>> the hang. From the DRM sched documentation, `drm_sched_stop()` and
>>>>> `drm_sched_start()` are typically used for reset recovery. If these
>>>>> functions are not used, the offending job won't be freed and should be
>>>>> freed by the caller.
>>>>>
>>>>> Currently, the mock scheduler doesn't use the functions provided by 
>>>>> the
>>>>> API, nor does it handle the freeing of the job. As a result, the 
>>>>> job isn't
>>>>> removed from the job list.
>>>>
>>>> For the record the job does gets freed via the kunit managed 
>>>> allocation.
>>>
>>> Sorry, I didn't express myself correctly. Indeed, it is. I meant that
>>> the DRM scheduler didn't free the job.
>>>
>>>>
>>>> It was a design choice for this test to be a *strict* unit test 
>>>> which tests only a _single_ thing. And that is that the 
>>>> timedout_job() hook gets called. As such the hook was implemented to 
>>>> satisfy that single requirement only.
>>>>
>>>
>>> What do you think about checking that `sched->job_list` won't be empty?
>>>
>>> I wanted to add such assertion to make sure that the behavior of the
>>> timeout won't change in future (e.g. a patch makes a change that calls
>>> `free_job()` for the guilty job at timeout). Does it make sense to you?
>>
>> Where would that assert be?
>>
> 
> I believe it would be in the same place as this patch assertions, but
> instead of `KUNIT_ASSERT_TRUE(test, list_empty(&sched->job_list));`, it
> would be `KUNIT_ASSERT_FALSE(test, list_empty(&sched->job_list));`.
> 
> But I don't feel strongly about it. I can drop the patch if you believe
> it's a better option.

I don't mind this patch, I was just explaining how the current test was 
deliberately testing a single thing. But we can change it to test more, 
that's fine.

In this case it would go from testing that the timeout callback fires to 
testing the callback fires and something else happens.

Key is to define the "something else" part so it is not timing sensitive.

The drm_sched_stop+delay+drm_sched_start approach would perhaps work if 
you would signal the job with an errno set before drm_sched_stop?

Then it would hit the "free_guilty" path in drm_sched_stop and wouldn't 
be timing sensitive. As long as there aren't any locking considerations 
I am missing. That could then safely have the two list_empty asserts.

Regards,

Tvrtko


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running
  2025-05-06  2:41   ` Matthew Brost
@ 2025-05-06 14:32     ` Matthew Brost
  2025-05-12 11:13       ` Philipp Stanner
  0 siblings, 1 reply; 34+ messages in thread
From: Matthew Brost @ 2025-05-06 14:32 UTC (permalink / raw)
  To: Maíra Canal
  Cc: Danilo Krummrich, Philipp Stanner, Christian König,
	Tvrtko Ursulin, Simona Vetter, Melissa Wen, Lucas Stach,
	Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price, kernel-dev, dri-devel, etnaviv, intel-xe

On Mon, May 05, 2025 at 07:41:09PM -0700, Matthew Brost wrote:
> On Sat, May 03, 2025 at 05:59:52PM -0300, Maíra Canal wrote:
> > When the DRM scheduler times out, it's possible that the GPU isn't hung;
> > instead, a job may still be running, and there may be no valid reason to
> > reset the hardware. This can occur in two situations:
> > 
> >   1. The GPU exposes some mechanism that ensures the GPU is still making
> >      progress. By checking this mechanism, we can safely skip the reset,
> >      rearm the timeout, and allow the job to continue running until
> >      completion. This is the case for v3d and Etnaviv.
> >   2. TDR has fired before the IRQ that signals the fence. Consequently,
> >      the job actually finishes, but it triggers a timeout before signaling
> >      the completion fence.
> > 
> 
> We have both of these cases in Xe too. We implement the requeuing in Xe
> via driver side function - xe_sched_add_pending_job but this looks
> better and will make use of this.
> 
> > These two scenarios are problematic because we remove the job from the
> > `sched->pending_list` before calling `sched->ops->timedout_job()`. This
> > means that when the job finally signals completion (e.g. in the IRQ
> > handler), the scheduler won't call `sched->ops->free_job()`. As a result,
> > the job and its resources won't be freed, leading to a memory leak.
> > 
> > To resolve this issue, we create a new `drm_gpu_sched_stat` that allows a
> > driver to skip the reset. This new status will indicate that the job
> > should be reinserted into the pending list, and the driver will still
> > signal its completion.
> > 
> > Signed-off-by: Maíra Canal <mcanal@igalia.com>
> 
> Reviewed-by: Matthew Brost <matthew.brost@intel.com>
> 

Wait - nevermind I think one issue is below.

> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 14 ++++++++++++++
> >  include/drm/gpu_scheduler.h            |  2 ++
> >  2 files changed, 16 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index 829579c41c6b5d8b2abce5ad373c7017469b7680..68ca827d77e32187a034309f881135dbc639a9b4 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -568,6 +568,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
> >  			job->sched->ops->free_job(job);
> >  			sched->free_guilty = false;
> >  		}
> > +
> > +		/*
> > +		 * If the driver indicated that the GPU is still running and wants to skip
> > +		 * the reset, reinsert the job back into the pending list and realarm the
> > +		 * timeout.
> > +		 */
> > +		if (status == DRM_GPU_SCHED_STAT_RUNNING) {
> > +			spin_lock(&sched->job_list_lock);
> > +			list_add(&job->list, &sched->pending_list);
> > +			spin_unlock(&sched->job_list_lock);
> > +		}

I think you need to requeue free_job wq here. It is possible the
free_job wq ran, didn't find a job, goes to sleep, then we add a
signaled job here which will never get freed.

Matt

> >  	} else {
> >  		spin_unlock(&sched->job_list_lock);
> >  	}
> > @@ -590,6 +601,9 @@ static void drm_sched_job_timedout(struct work_struct *work)
> >   * This function is typically used for reset recovery (see the docu of
> >   * drm_sched_backend_ops.timedout_job() for details). Do not call it for
> >   * scheduler teardown, i.e., before calling drm_sched_fini().
> > + *
> > + * As it's used for reset recovery, drm_sched_stop() shouldn't be called
> > + * if the scheduler skipped the timeout (DRM_SCHED_STAT_RUNNING).
> >   */
> >  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> >  {
> > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > index 1a7e377d4cbb4fc12ed93c548b236970217945e8..fe9043b6d43141bee831b5fc16b927202a507d51 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -389,11 +389,13 @@ struct drm_sched_job {
> >   * @DRM_GPU_SCHED_STAT_NONE: Reserved. Do not use.
> >   * @DRM_GPU_SCHED_STAT_NOMINAL: Operation succeeded.
> >   * @DRM_GPU_SCHED_STAT_ENODEV: Error: Device is not available anymore.
> > + * @DRM_GPU_SCHED_STAT_RUNNING: GPU is still running, so skip the reset.
> >   */
> >  enum drm_gpu_sched_stat {
> >  	DRM_GPU_SCHED_STAT_NONE,
> >  	DRM_GPU_SCHED_STAT_NOMINAL,
> >  	DRM_GPU_SCHED_STAT_ENODEV,
> > +	DRM_GPU_SCHED_STAT_RUNNING,
> >  };
> >  
> >  /**
> > 
> > -- 
> > 2.49.0
> > 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running
  2025-05-06 11:32   ` Tvrtko Ursulin
@ 2025-05-07 12:33     ` Maíra Canal
  2025-05-07 12:50       ` Tvrtko Ursulin
  0 siblings, 1 reply; 34+ messages in thread
From: Maíra Canal @ 2025-05-07 12:33 UTC (permalink / raw)
  To: Tvrtko Ursulin, Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Simona Vetter, Melissa Wen, Lucas Stach,
	Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe

Hi Tvrtko,

Thanks for the review!

On 06/05/25 08:32, Tvrtko Ursulin wrote:
> 
> On 03/05/2025 21:59, Maíra Canal wrote:
>> When the DRM scheduler times out, it's possible that the GPU isn't hung;
>> instead, a job may still be running, and there may be no valid reason to
>> reset the hardware. This can occur in two situations:
>>
>>    1. The GPU exposes some mechanism that ensures the GPU is still making
>>       progress. By checking this mechanism, we can safely skip the reset,
>>       rearm the timeout, and allow the job to continue running until
>>       completion. This is the case for v3d and Etnaviv.
>>    2. TDR has fired before the IRQ that signals the fence. Consequently,
>>       the job actually finishes, but it triggers a timeout before 
>> signaling
>>       the completion fence.
>>
>> These two scenarios are problematic because we remove the job from the
>> `sched->pending_list` before calling `sched->ops->timedout_job()`. This
>> means that when the job finally signals completion (e.g. in the IRQ
>> handler), the scheduler won't call `sched->ops->free_job()`. As a result,
>> the job and its resources won't be freed, leading to a memory leak.
>>
>> To resolve this issue, we create a new `drm_gpu_sched_stat` that allows a
>> driver to skip the reset. This new status will indicate that the job
>> should be reinserted into the pending list, and the driver will still
>> signal its completion.
> 

[...]

>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>> index 
>> 1a7e377d4cbb4fc12ed93c548b236970217945e8..fe9043b6d43141bee831b5fc16b927202a507d51 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -389,11 +389,13 @@ struct drm_sched_job {
>>    * @DRM_GPU_SCHED_STAT_NONE: Reserved. Do not use.
>>    * @DRM_GPU_SCHED_STAT_NOMINAL: Operation succeeded.
>>    * @DRM_GPU_SCHED_STAT_ENODEV: Error: Device is not available anymore.
>> + * @DRM_GPU_SCHED_STAT_RUNNING: GPU is still running, so skip the reset.
> 
> s/GPU/job/ ?
> 
>>    */
>>   enum drm_gpu_sched_stat {
>>       DRM_GPU_SCHED_STAT_NONE,
>>       DRM_GPU_SCHED_STAT_NOMINAL,
>>       DRM_GPU_SCHED_STAT_ENODEV,
>> +    DRM_GPU_SCHED_STAT_RUNNING,
> 
> I am wondering if we could make it more obvious what is the difference 
> between "nominal" and "running" and from whose point of view should 
> those statuses be considered.
 > > So far we have "nominal" which means scheduler/hardware is working 
fine
> but the job may or may have not been cancelled. With "running" we kind 
> of split it into two sub-statuses and it would be nice for that to be 
> intuitively visible from the naming. But I struggle to suggest an 
> elegant name while preserving nominal as is.

I was thinking: how about changing DRM_GPU_SCHED_STAT_NOMINAL to
DRM_GPU_SCHED_STAT_RESET (the hardware is fine, but we reset it)?

Then, when we skip the reset, we would have DRM_GPU_SCHED_STAT_NOMINAL
(which means the hardware is fine and we didn't reset it).

I'm open to other suggestions.

> 
> Thinking out loud here - perhaps that is pointing towards an alternative 
> that instead of a new status, a new helper to re-insert the single job 
> (like drm_sched_resubmit_job(sched, job)) would fit better? Although it 
> would be more churn.
> 

Although your solution might be more elegant, I'm worried that such a
function could be used improperly by new users (e.g. being called in
contexts other than `timedout_job()`).

I'd prefer to have a new status as it'll be use solely for
`timedout_job()` (making it harder for users to use it inappropriately).
With the addition of Matthew's feedback (calling
`drm_sched_run_free_queue()` after adding the job to the pending list),
I think it makes even more sense to keep it inside the timeout function.

I hope others can chime in and give their opinions about your idea.

Best Regards,
- Maíra

> Regards,
> 
> Tvrtko
> 
>>   };
>>   /**
>>
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running
  2025-05-07 12:33     ` Maíra Canal
@ 2025-05-07 12:50       ` Tvrtko Ursulin
  2025-05-12  9:24         ` Philipp Stanner
  0 siblings, 1 reply; 34+ messages in thread
From: Tvrtko Ursulin @ 2025-05-07 12:50 UTC (permalink / raw)
  To: Maíra Canal, Matthew Brost, Danilo Krummrich,
	Philipp Stanner, Christian König, Simona Vetter, Melissa Wen,
	Lucas Stach, Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe


On 07/05/2025 13:33, Maíra Canal wrote:
> Hi Tvrtko,
> 
> Thanks for the review!
> 
> On 06/05/25 08:32, Tvrtko Ursulin wrote:
>>
>> On 03/05/2025 21:59, Maíra Canal wrote:
>>> When the DRM scheduler times out, it's possible that the GPU isn't hung;
>>> instead, a job may still be running, and there may be no valid reason to
>>> reset the hardware. This can occur in two situations:
>>>
>>>    1. The GPU exposes some mechanism that ensures the GPU is still 
>>> making
>>>       progress. By checking this mechanism, we can safely skip the 
>>> reset,
>>>       rearm the timeout, and allow the job to continue running until
>>>       completion. This is the case for v3d and Etnaviv.
>>>    2. TDR has fired before the IRQ that signals the fence. Consequently,
>>>       the job actually finishes, but it triggers a timeout before 
>>> signaling
>>>       the completion fence.
>>>
>>> These two scenarios are problematic because we remove the job from the
>>> `sched->pending_list` before calling `sched->ops->timedout_job()`. This
>>> means that when the job finally signals completion (e.g. in the IRQ
>>> handler), the scheduler won't call `sched->ops->free_job()`. As a 
>>> result,
>>> the job and its resources won't be freed, leading to a memory leak.
>>>
>>> To resolve this issue, we create a new `drm_gpu_sched_stat` that 
>>> allows a
>>> driver to skip the reset. This new status will indicate that the job
>>> should be reinserted into the pending list, and the driver will still
>>> signal its completion.
>>
> 
> [...]
> 
>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>> index 
>>> 1a7e377d4cbb4fc12ed93c548b236970217945e8..fe9043b6d43141bee831b5fc16b927202a507d51 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -389,11 +389,13 @@ struct drm_sched_job {
>>>    * @DRM_GPU_SCHED_STAT_NONE: Reserved. Do not use.
>>>    * @DRM_GPU_SCHED_STAT_NOMINAL: Operation succeeded.
>>>    * @DRM_GPU_SCHED_STAT_ENODEV: Error: Device is not available anymore.
>>> + * @DRM_GPU_SCHED_STAT_RUNNING: GPU is still running, so skip the 
>>> reset.
>>
>> s/GPU/job/ ?
>>
>>>    */
>>>   enum drm_gpu_sched_stat {
>>>       DRM_GPU_SCHED_STAT_NONE,
>>>       DRM_GPU_SCHED_STAT_NOMINAL,
>>>       DRM_GPU_SCHED_STAT_ENODEV,
>>> +    DRM_GPU_SCHED_STAT_RUNNING,
>>
>> I am wondering if we could make it more obvious what is the difference 
>> between "nominal" and "running" and from whose point of view should 
>> those statuses be considered.
>  > > So far we have "nominal" which means scheduler/hardware is working 
> fine
>> but the job may or may have not been cancelled. With "running" we kind 
>> of split it into two sub-statuses and it would be nice for that to be 
>> intuitively visible from the naming. But I struggle to suggest an 
>> elegant name while preserving nominal as is.
> 
> I was thinking: how about changing DRM_GPU_SCHED_STAT_NOMINAL to
> DRM_GPU_SCHED_STAT_RESET (the hardware is fine, but we reset it)?
> 
> Then, when we skip the reset, we would have DRM_GPU_SCHED_STAT_NOMINAL
> (which means the hardware is fine and we didn't reset it).
> 
> I'm open to other suggestions.

DRM_GPU_SCHED_STAT_RESET sounds like a good name and seems to paint a 
consistent story between running - reset - enodev.

>> Thinking out loud here - perhaps that is pointing towards an 
>> alternative that instead of a new status, a new helper to re-insert 
>> the single job (like drm_sched_resubmit_job(sched, job)) would fit 
>> better? Although it would be more churn.
>>
> 
> Although your solution might be more elegant, I'm worried that such a
> function could be used improperly by new users (e.g. being called in
> contexts other than `timedout_job()`).

We could call it drm_sched_untimedout_job(). </humour>

> I'd prefer to have a new status as it'll be use solely for
> `timedout_job()` (making it harder for users to use it inappropriately).
> With the addition of Matthew's feedback (calling
> `drm_sched_run_free_queue()` after adding the job to the pending list),
> I think it makes even more sense to keep it inside the timeout function.
> 
> I hope others can chime in and give their opinions about your idea.

Yeah - Philipp - Danilo - what do you prefer? Third enum with or a new 
helper?

Regards,

Tvrtko


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 8/8] drm/panfrost: Use DRM_GPU_SCHED_STAT_RUNNING to skip the reset
  2025-05-03 20:59 ` [PATCH 8/8] drm/panfrost: " Maíra Canal
@ 2025-05-08 15:04   ` Steven Price
  0 siblings, 0 replies; 34+ messages in thread
From: Steven Price @ 2025-05-08 15:04 UTC (permalink / raw)
  To: Maíra Canal, Matthew Brost, Danilo Krummrich,
	Philipp Stanner, Christian König, Tvrtko Ursulin,
	Simona Vetter, Melissa Wen, Lucas Stach, Russell King,
	Christian Gmeiner, Lucas De Marchi, Thomas Hellström,
	Rodrigo Vivi, Boris Brezillon, Rob Herring
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe

On 03/05/2025 21:59, Maíra Canal wrote:
> Panfrost can skip the reset if TDR has fired before the IRQ handler.
> Currently, since Panfrost doesn't take any action on these scenarios, the
> job is being leaked, considering that `free_job()` won't be called.
> 
> To avoid such leaks, use the DRM_GPU_SCHED_STAT_RUNNING status to skip the
> reset and rearm the timer.
> 
> Signed-off-by: Maíra Canal <mcanal@igalia.com>

Reviewed-by: Steven Price <steven.price@arm.com>

> ---
>  drivers/gpu/drm/panfrost/panfrost_job.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
> index 5657106c2f7d0a0ca6162850767f58f3200cce13..2948d5c02115544a0e0babffd850f1506152849d 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -751,11 +751,11 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job
>  	int js = panfrost_job_get_slot(job);
>  
>  	/*
> -	 * If the GPU managed to complete this jobs fence, the timeout is
> -	 * spurious. Bail out.
> +	 * If the GPU managed to complete this jobs fence, TDR has fired before
> +	 * IRQ and the timeout is spurious. Bail out.
>  	 */
>  	if (dma_fence_is_signaled(job->done_fence))
> -		return DRM_GPU_SCHED_STAT_NOMINAL;
> +		return DRM_GPU_SCHED_STAT_RUNNING;
>  
>  	/*
>  	 * Panfrost IRQ handler may take a long time to process an interrupt
> @@ -770,7 +770,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job
>  
>  	if (dma_fence_is_signaled(job->done_fence)) {
>  		dev_warn(pfdev->dev, "unexpectedly high interrupt latency\n");
> -		return DRM_GPU_SCHED_STAT_NOMINAL;
> +		return DRM_GPU_SCHED_STAT_RUNNING;
>  	}
>  
>  	dev_err(pfdev->dev, "gpu sched timeout, js=%d, config=0x%x, status=0x%x, head=0x%x, tail=0x%x, sched_job=%p",
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running
  2025-05-07 12:50       ` Tvrtko Ursulin
@ 2025-05-12  9:24         ` Philipp Stanner
  0 siblings, 0 replies; 34+ messages in thread
From: Philipp Stanner @ 2025-05-12  9:24 UTC (permalink / raw)
  To: Tvrtko Ursulin, Maíra Canal, Matthew Brost, Danilo Krummrich,
	Philipp Stanner, Christian König, Simona Vetter, Melissa Wen,
	Lucas Stach, Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe

On Wed, 2025-05-07 at 13:50 +0100, Tvrtko Ursulin wrote:
> 
> On 07/05/2025 13:33, Maíra Canal wrote:
> > Hi Tvrtko,
> > 
> > Thanks for the review!
> > 
> > On 06/05/25 08:32, Tvrtko Ursulin wrote:
> > > 
> > > On 03/05/2025 21:59, Maíra Canal wrote:
> > > > When the DRM scheduler times out, it's possible that the GPU
> > > > isn't hung;
> > > > instead, a job may still be running, and there may be no valid
> > > > reason to
> > > > reset the hardware. This can occur in two situations:
> > > > 
> > > >    1. The GPU exposes some mechanism that ensures the GPU is
> > > > still 
> > > > making
> > > >       progress. By checking this mechanism, we can safely skip
> > > > the 
> > > > reset,
> > > >       rearm the timeout, and allow the job to continue running
> > > > until
> > > >       completion. This is the case for v3d and Etnaviv.
> > > >    2. TDR has fired before the IRQ that signals the fence.
> > > > Consequently,
> > > >       the job actually finishes, but it triggers a timeout
> > > > before 
> > > > signaling
> > > >       the completion fence.
> > > > 
> > > > These two scenarios are problematic because we remove the job
> > > > from the
> > > > `sched->pending_list` before calling `sched->ops-
> > > > >timedout_job()`. This
> > > > means that when the job finally signals completion (e.g. in the
> > > > IRQ
> > > > handler), the scheduler won't call `sched->ops->free_job()`. As
> > > > a 
> > > > result,
> > > > the job and its resources won't be freed, leading to a memory
> > > > leak.
> > > > 
> > > > To resolve this issue, we create a new `drm_gpu_sched_stat`
> > > > that 
> > > > allows a
> > > > driver to skip the reset. This new status will indicate that
> > > > the job
> > > > should be reinserted into the pending list, and the driver will
> > > > still
> > > > signal its completion.
> > > 
> > 
> > [...]
> > 
> > > > diff --git a/include/drm/gpu_scheduler.h
> > > > b/include/drm/gpu_scheduler.h
> > > > index 
> > > > 1a7e377d4cbb4fc12ed93c548b236970217945e8..fe9043b6d43141bee831b
> > > > 5fc16b927202a507d51 100644
> > > > --- a/include/drm/gpu_scheduler.h
> > > > +++ b/include/drm/gpu_scheduler.h
> > > > @@ -389,11 +389,13 @@ struct drm_sched_job {
> > > >    * @DRM_GPU_SCHED_STAT_NONE: Reserved. Do not use.
> > > >    * @DRM_GPU_SCHED_STAT_NOMINAL: Operation succeeded.
> > > >    * @DRM_GPU_SCHED_STAT_ENODEV: Error: Device is not available
> > > > anymore.
> > > > + * @DRM_GPU_SCHED_STAT_RUNNING: GPU is still running, so skip
> > > > the 
> > > > reset.
> > > 
> > > s/GPU/job/ ?
> > > 
> > > >    */
> > > >   enum drm_gpu_sched_stat {
> > > >       DRM_GPU_SCHED_STAT_NONE,
> > > >       DRM_GPU_SCHED_STAT_NOMINAL,
> > > >       DRM_GPU_SCHED_STAT_ENODEV,
> > > > +    DRM_GPU_SCHED_STAT_RUNNING,
> > > 
> > > I am wondering if we could make it more obvious what is the
> > > difference 
> > > between "nominal" and "running" and from whose point of view
> > > should 
> > > those statuses be considered.
> >  > > So far we have "nominal" which means scheduler/hardware is
> > working 
> > fine
> > > but the job may or may have not been cancelled. With "running" we
> > > kind 
> > > of split it into two sub-statuses and it would be nice for that
> > > to be 
> > > intuitively visible from the naming. But I struggle to suggest an
> > > elegant name while preserving nominal as is.
> > 
> > I was thinking: how about changing DRM_GPU_SCHED_STAT_NOMINAL to
> > DRM_GPU_SCHED_STAT_RESET (the hardware is fine, but we reset it)?
> > 
> > Then, when we skip the reset, we would have
> > DRM_GPU_SCHED_STAT_NOMINAL
> > (which means the hardware is fine and we didn't reset it).
> > 
> > I'm open to other suggestions.
> 
> DRM_GPU_SCHED_STAT_RESET sounds like a good name and seems to paint a
> consistent story between running - reset - enodev.
> 
> > > Thinking out loud here - perhaps that is pointing towards an 
> > > alternative that instead of a new status, a new helper to re-
> > > insert 
> > > the single job (like drm_sched_resubmit_job(sched, job)) would
> > > fit 
> > > better? Although it would be more churn.
> > > 
> > 
> > Although your solution might be more elegant, I'm worried that such
> > a
> > function could be used improperly by new users (e.g. being called
> > in
> > contexts other than `timedout_job()`).
> 
> We could call it drm_sched_untimedout_job(). </humour>
> 
> > I'd prefer to have a new status as it'll be use solely for
> > `timedout_job()` (making it harder for users to use it
> > inappropriately).
> > With the addition of Matthew's feedback (calling
> > `drm_sched_run_free_queue()` after adding the job to the pending
> > list),
> > I think it makes even more sense to keep it inside the timeout
> > function.
> > 
> > I hope others can chime in and give their opinions about your idea.
> 
> Yeah - Philipp - Danilo - what do you prefer? Third enum with or a
> new 
> helper?

I'm also afraid that providing yet another helper for this specific
case opens the door to abuse. We had (and still have) issues with the
familiar drm_sched_resubmit_jobs() function. Christian has been very
clear that this was a bad idea, and I'd rather not walk a road that
looks similar to that one.

I tend to think that the status codes are the appropriate mechanism to
address this. They were, after all, invented to inform the scheduler
about what is going on inside the driver.

That said, currently, ENODEV is basically the only error, and
everything unequal ENODEV (i.e., NOMINAL) is the "OK state".

A timeout occurring and the GPU not hanging is, therefore, also "OK".
Whatever the name will be, the docu for NOMINAL must also be adjusted.

How about calling it "NORMAL" instead of "NOMINAL", since that state
actually describes what is both OK and "the norm", i.e., most commonly
the case?

And I wouldn't call it RUNNING, since the GPU is also running in
NOMINAL state. "NO_HANG" could hint more effectively at the fact that
the GPU is, contrary to the scheduler's believe, not hanging.

(I've been out for a few days and am catching up to a lot of things.
Just had time to get deeper into this series. Apologies if my picture
isn't complete yet)

Thanks,
P.


> 
> Regards,
> 
> Tvrtko
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running
  2025-05-03 20:59 ` [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running Maíra Canal
  2025-05-06  2:41   ` Matthew Brost
  2025-05-06 11:32   ` Tvrtko Ursulin
@ 2025-05-12 11:04   ` Philipp Stanner
  2025-05-12 13:59     ` Maíra Canal
  2025-05-13  7:26   ` Philipp Stanner
  3 siblings, 1 reply; 34+ messages in thread
From: Philipp Stanner @ 2025-05-12 11:04 UTC (permalink / raw)
  To: Maíra Canal, Matthew Brost, Danilo Krummrich,
	Philipp Stanner, Christian König, Tvrtko Ursulin,
	Simona Vetter, Melissa Wen, Lucas Stach, Russell King,
	Christian Gmeiner, Lucas De Marchi, Thomas Hellström,
	Rodrigo Vivi, Boris Brezillon, Rob Herring, Steven Price
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe

On Sat, 2025-05-03 at 17:59 -0300, Maíra Canal wrote:
> When the DRM scheduler times out, it's possible that the GPU isn't
> hung;
> instead, a job may still be running, and there may be no valid reason
> to
> reset the hardware. This can occur in two situations:
> 
>   1. The GPU exposes some mechanism that ensures the GPU is still
> making
>      progress. By checking this mechanism, we can safely skip the
> reset,
>      rearm the timeout, and allow the job to continue running until
>      completion. This is the case for v3d and Etnaviv.

Who is "we" and where is the reset skipped? In the timedout_job()
callback?

>   2. TDR has fired before the IRQ that signals the fence.

Any concern about saying "Timeout" instead of "TDR"? I think most of us
aren't familiar with that acronym.

> Consequently,
>      the job actually finishes, but it triggers a timeout before
> signaling
>      the completion fence.

That formulation doesn't seem correct. Once the timeout fired, the job,
as far as the GPU is concerned, is already finished, isn't it?

What is the "completion fence"? In the scheduler, we call the fence
returned by backend_ops.run_job() the "hardware fence".

And who is the "it" in "it triggers a timeout"? I assume you want to
say "the job has actually finished, but the scheduler triggers a
timeout anyways".


Also the purpose of that list is a bit unclear to me. It seems to be a
list of problems, but point 1 seems valid?


> 
> These two scenarios are problematic because we remove the job from
> the
> `sched->pending_list` before calling `sched->ops->timedout_job()`.

Who is "we"? :)


> This
> means that when the job finally signals completion (e.g. in the IRQ
> handler),

A job doesn't signal completion.

The hardware / driver signals job completion by signaling the hardware
fence.

>  the scheduler won't call `sched->ops->free_job()`. As a result,
> the job and its resources won't be freed, leading to a memory leak.

OK, I think I get it. But isn't another explanation of the issue that
the driver callback doesn't take care of cleaning up the job that has
timed out (from the scheduler's perspective)?

It's not clear to me that the scheduler actually contains a bug here,
but rather is designed in a way that doesn't consider that some GPUs
have special timeout requirements or, rather, can have bursts of
slowness that don't actually indicate a timeout.

I think the commit message should be very clear about whether this is
an improvement of a design weakness or an actual bug fix.

> 
> To resolve this issue, we create a new `drm_gpu_sched_stat` that
> allows a
> driver to skip the reset. This new status will indicate that the job
> should be reinserted into the pending list, and the driver will still
> signal its completion.

Hmm, yes, I think that this is the right way to address that problem.
+1


> 
> Signed-off-by: Maíra Canal <mcanal@igalia.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 14 ++++++++++++++
>  include/drm/gpu_scheduler.h            |  2 ++
>  2 files changed, 16 insertions(+)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index
> 829579c41c6b5d8b2abce5ad373c7017469b7680..68ca827d77e32187a034309f881
> 135dbc639a9b4 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -568,6 +568,17 @@ static void drm_sched_job_timedout(struct
> work_struct *work)
>  			job->sched->ops->free_job(job);
>  			sched->free_guilty = false;
>  		}
> +
> +		/*
> +		 * If the driver indicated that the GPU is still
> running and wants to skip
> +		 * the reset, reinsert the job back into the pending
> list and realarm the
> +		 * timeout.
> +		 */
> +		if (status == DRM_GPU_SCHED_STAT_RUNNING) {
> +			spin_lock(&sched->job_list_lock);
> +			list_add(&job->list, &sched->pending_list);
> +			spin_unlock(&sched->job_list_lock);
> +		}
>  	} else {
>  		spin_unlock(&sched->job_list_lock);
>  	}
> @@ -590,6 +601,9 @@ static void drm_sched_job_timedout(struct
> work_struct *work)
>   * This function is typically used for reset recovery (see the docu
> of
>   * drm_sched_backend_ops.timedout_job() for details). Do not call it
> for
>   * scheduler teardown, i.e., before calling drm_sched_fini().
> + *
> + * As it's used for reset recovery, drm_sched_stop() shouldn't be
> called
> + * if the scheduler skipped the timeout (DRM_SCHED_STAT_RUNNING).

The same comment then applies to the counterpart, drm_sched_start(). 

We might also want to look into who uses drm_sched_wqueue_{start,stop}
and consider if they need a comment. Though I don't expect you to do
that. Those functions are hacky legacy anyways.


P.

>   */
>  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct
> drm_sched_job *bad)
>  {
> diff --git a/include/drm/gpu_scheduler.h
> b/include/drm/gpu_scheduler.h
> index
> 1a7e377d4cbb4fc12ed93c548b236970217945e8..fe9043b6d43141bee831b5fc16b
> 927202a507d51 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -389,11 +389,13 @@ struct drm_sched_job {
>   * @DRM_GPU_SCHED_STAT_NONE: Reserved. Do not use.
>   * @DRM_GPU_SCHED_STAT_NOMINAL: Operation succeeded.
>   * @DRM_GPU_SCHED_STAT_ENODEV: Error: Device is not available
> anymore.
> + * @DRM_GPU_SCHED_STAT_RUNNING: GPU is still running, so skip the
> reset.
>   */
>  enum drm_gpu_sched_stat {
>  	DRM_GPU_SCHED_STAT_NONE,
>  	DRM_GPU_SCHED_STAT_NOMINAL,
>  	DRM_GPU_SCHED_STAT_ENODEV,
> +	DRM_GPU_SCHED_STAT_RUNNING,
>  };
>  
>  /**
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running
  2025-05-06 14:32     ` Matthew Brost
@ 2025-05-12 11:13       ` Philipp Stanner
  2025-05-12 14:04         ` Maíra Canal
  0 siblings, 1 reply; 34+ messages in thread
From: Philipp Stanner @ 2025-05-12 11:13 UTC (permalink / raw)
  To: Matthew Brost, Maíra Canal
  Cc: Danilo Krummrich, Philipp Stanner, Christian König,
	Tvrtko Ursulin, Simona Vetter, Melissa Wen, Lucas Stach,
	Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price, kernel-dev, dri-devel, etnaviv, intel-xe

On Tue, 2025-05-06 at 07:32 -0700, Matthew Brost wrote:
> On Mon, May 05, 2025 at 07:41:09PM -0700, Matthew Brost wrote:
> > On Sat, May 03, 2025 at 05:59:52PM -0300, Maíra Canal wrote:
> > > When the DRM scheduler times out, it's possible that the GPU
> > > isn't hung;
> > > instead, a job may still be running, and there may be no valid
> > > reason to
> > > reset the hardware. This can occur in two situations:
> > > 
> > >   1. The GPU exposes some mechanism that ensures the GPU is still
> > > making
> > >      progress. By checking this mechanism, we can safely skip the
> > > reset,
> > >      rearm the timeout, and allow the job to continue running
> > > until
> > >      completion. This is the case for v3d and Etnaviv.
> > >   2. TDR has fired before the IRQ that signals the fence.
> > > Consequently,
> > >      the job actually finishes, but it triggers a timeout before
> > > signaling
> > >      the completion fence.
> > > 
> > 
> > We have both of these cases in Xe too. We implement the requeuing
> > in Xe
> > via driver side function - xe_sched_add_pending_job but this looks
> > better and will make use of this.
> > 
> > > These two scenarios are problematic because we remove the job
> > > from the
> > > `sched->pending_list` before calling `sched->ops-
> > > >timedout_job()`. This
> > > means that when the job finally signals completion (e.g. in the
> > > IRQ
> > > handler), the scheduler won't call `sched->ops->free_job()`. As a
> > > result,
> > > the job and its resources won't be freed, leading to a memory
> > > leak.
> > > 
> > > To resolve this issue, we create a new `drm_gpu_sched_stat` that
> > > allows a
> > > driver to skip the reset. This new status will indicate that the
> > > job
> > > should be reinserted into the pending list, and the driver will
> > > still
> > > signal its completion.
> > > 
> > > Signed-off-by: Maíra Canal <mcanal@igalia.com>
> > 
> > Reviewed-by: Matthew Brost <matthew.brost@intel.com>
> > 
> 
> Wait - nevermind I think one issue is below.
> 
> > > ---
> > >  drivers/gpu/drm/scheduler/sched_main.c | 14 ++++++++++++++
> > >  include/drm/gpu_scheduler.h            |  2 ++
> > >  2 files changed, 16 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > index
> > > 829579c41c6b5d8b2abce5ad373c7017469b7680..68ca827d77e32187a034309
> > > f881135dbc639a9b4 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > @@ -568,6 +568,17 @@ static void drm_sched_job_timedout(struct
> > > work_struct *work)
> > >  			job->sched->ops->free_job(job);
> > >  			sched->free_guilty = false;
> > >  		}
> > > +
> > > +		/*
> > > +		 * If the driver indicated that the GPU is still
> > > running and wants to skip
> > > +		 * the reset, reinsert the job back into the
> > > pending list and realarm the
> > > +		 * timeout.
> > > +		 */
> > > +		if (status == DRM_GPU_SCHED_STAT_RUNNING) {
> > > +			spin_lock(&sched->job_list_lock);
> > > +			list_add(&job->list, &sched-
> > > >pending_list);
> > > +			spin_unlock(&sched->job_list_lock);
> > > +		}
> 
> I think you need to requeue free_job wq here. It is possible the
> free_job wq ran, didn't find a job, goes to sleep, then we add a
> signaled job here which will never get freed.

I wonder if that could be solved by holding job_list_lock a bit longer.
free_job_work will try to check the list for the next signaled job, but
will wait for the lock.

If that works, we could completely rely on the standard mechanism
without requeuing, which would be neat.

P.

> 
> Matt
> 
> > >  	} else {
> > >  		spin_unlock(&sched->job_list_lock);
> > >  	}
> > > @@ -590,6 +601,9 @@ static void drm_sched_job_timedout(struct
> > > work_struct *work)
> > >   * This function is typically used for reset recovery (see the
> > > docu of
> > >   * drm_sched_backend_ops.timedout_job() for details). Do not
> > > call it for
> > >   * scheduler teardown, i.e., before calling drm_sched_fini().
> > > + *
> > > + * As it's used for reset recovery, drm_sched_stop() shouldn't
> > > be called
> > > + * if the scheduler skipped the timeout
> > > (DRM_SCHED_STAT_RUNNING).
> > >   */
> > >  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct
> > > drm_sched_job *bad)
> > >  {
> > > diff --git a/include/drm/gpu_scheduler.h
> > > b/include/drm/gpu_scheduler.h
> > > index
> > > 1a7e377d4cbb4fc12ed93c548b236970217945e8..fe9043b6d43141bee831b5f
> > > c16b927202a507d51 100644
> > > --- a/include/drm/gpu_scheduler.h
> > > +++ b/include/drm/gpu_scheduler.h
> > > @@ -389,11 +389,13 @@ struct drm_sched_job {
> > >   * @DRM_GPU_SCHED_STAT_NONE: Reserved. Do not use.
> > >   * @DRM_GPU_SCHED_STAT_NOMINAL: Operation succeeded.
> > >   * @DRM_GPU_SCHED_STAT_ENODEV: Error: Device is not available
> > > anymore.
> > > + * @DRM_GPU_SCHED_STAT_RUNNING: GPU is still running, so skip
> > > the reset.
> > >   */
> > >  enum drm_gpu_sched_stat {
> > >  	DRM_GPU_SCHED_STAT_NONE,
> > >  	DRM_GPU_SCHED_STAT_NOMINAL,
> > >  	DRM_GPU_SCHED_STAT_ENODEV,
> > > +	DRM_GPU_SCHED_STAT_RUNNING,
> > >  };
> > >  
> > >  /**
> > > 
> > > -- 
> > > 2.49.0
> > > 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running
  2025-05-12 11:04   ` Philipp Stanner
@ 2025-05-12 13:59     ` Maíra Canal
  0 siblings, 0 replies; 34+ messages in thread
From: Maíra Canal @ 2025-05-12 13:59 UTC (permalink / raw)
  To: phasta, Matthew Brost, Danilo Krummrich, Christian König,
	Tvrtko Ursulin, Simona Vetter, Melissa Wen, Lucas Stach,
	Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe

Hi Philipp,

On 12/05/25 08:04, Philipp Stanner wrote:
> On Sat, 2025-05-03 at 17:59 -0300, Maíra Canal wrote:
>> When the DRM scheduler times out, it's possible that the GPU isn't
>> hung;
>> instead, a job may still be running, and there may be no valid reason
>> to
>> reset the hardware. This can occur in two situations:
>>
>>    1. The GPU exposes some mechanism that ensures the GPU is still
>> making
>>       progress. By checking this mechanism, we can safely skip the
>> reset,
>>       rearm the timeout, and allow the job to continue running until
>>       completion. This is the case for v3d and Etnaviv.
> 
> Who is "we" and where is the reset skipped? In the timedout_job()
> callback?

"we" is just a generic way to say "we, the driver developers". The reset
is skipped in the `timedout_job()` callback.

> 
>>    2. TDR has fired before the IRQ that signals the fence.
> 
> Any concern about saying "Timeout" instead of "TDR"? I think most of us
> aren't familiar with that acronym.

I'll use "Timeout" in v2.

> 
>> Consequently,
>>       the job actually finishes, but it triggers a timeout before
>> signaling
>>       the completion fence.
> 
> That formulation doesn't seem correct. Once the timeout fired, the job,
> as far as the GPU is concerned, is already finished, isn't it?

Yes, but there is a race-condition between the IRQ handler (which will
indeed signal the fence) and timeout handler.

> 
> What is the "completion fence"? In the scheduler, we call the fence
> returned by backend_ops.run_job() the "hardware fence".
 > > And who is the "it" in "it triggers a timeout"? I assume you want to
> say "the job has actually finished, but the scheduler triggers a
> timeout anyways".
> 

I'll address it in v2.

> 
> Also the purpose of that list is a bit unclear to me. It seems to be a
> list of problems, but point 1 seems valid?
> 

Point 2 is also valid, otherwise Xe, Panfrost, Lima, and Etnaviv
wouldn't have to handle this situation. TDR can "win" the race-condition
and although the hardware fence will be signaled in the IRQ handler, the
job won't be freed.

> 
>>
>> These two scenarios are problematic because we remove the job from
>> the
>> `sched->pending_list` before calling `sched->ops->timedout_job()`.
> 
> Who is "we"? :)

Ditto.

> 
> 
>> This
>> means that when the job finally signals completion (e.g. in the IRQ
>> handler),
> 
> A job doesn't signal completion.
> 
> The hardware / driver signals job completion by signaling the hardware
> fence.

I'll address it in v2.

> 
>>   the scheduler won't call `sched->ops->free_job()`. As a result,
>> the job and its resources won't be freed, leading to a memory leak.
> 
> OK, I think I get it. But isn't another explanation of the issue that
> the driver callback doesn't take care of cleaning up the job that has
> timed out (from the scheduler's perspective)?
> 
> It's not clear to me that the scheduler actually contains a bug here,
> but rather is designed in a way that doesn't consider that some GPUs
> have special timeout requirements or, rather, can have bursts of
> slowness that don't actually indicate a timeout.
> 
> I think the commit message should be very clear about whether this is
> an improvement of a design weakness or an actual bug fix.
> 
>>
>> To resolve this issue, we create a new `drm_gpu_sched_stat` that
>> allows a
>> driver to skip the reset. This new status will indicate that the job
>> should be reinserted into the pending list, and the driver will still
>> signal its completion.
> 
> Hmm, yes, I think that this is the right way to address that problem.
> +1
> 
> 
>>
>> Signed-off-by: Maíra Canal <mcanal@igalia.com>
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 14 ++++++++++++++
>>   include/drm/gpu_scheduler.h            |  2 ++
>>   2 files changed, 16 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index
>> 829579c41c6b5d8b2abce5ad373c7017469b7680..68ca827d77e32187a034309f881
>> 135dbc639a9b4 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -568,6 +568,17 @@ static void drm_sched_job_timedout(struct
>> work_struct *work)
>>   			job->sched->ops->free_job(job);
>>   			sched->free_guilty = false;
>>   		}
>> +
>> +		/*
>> +		 * If the driver indicated that the GPU is still
>> running and wants to skip
>> +		 * the reset, reinsert the job back into the pending
>> list and realarm the
>> +		 * timeout.
>> +		 */
>> +		if (status == DRM_GPU_SCHED_STAT_RUNNING) {
>> +			spin_lock(&sched->job_list_lock);
>> +			list_add(&job->list, &sched->pending_list);
>> +			spin_unlock(&sched->job_list_lock);
>> +		}
>>   	} else {
>>   		spin_unlock(&sched->job_list_lock);
>>   	}
>> @@ -590,6 +601,9 @@ static void drm_sched_job_timedout(struct
>> work_struct *work)
>>    * This function is typically used for reset recovery (see the docu
>> of
>>    * drm_sched_backend_ops.timedout_job() for details). Do not call it
>> for
>>    * scheduler teardown, i.e., before calling drm_sched_fini().
>> + *
>> + * As it's used for reset recovery, drm_sched_stop() shouldn't be
>> called
>> + * if the scheduler skipped the timeout (DRM_SCHED_STAT_RUNNING).
> 
> The same comment then applies to the counterpart, drm_sched_start().
> 

Thanks for your review! I'll address all your comments in v2.

Best Regards,
- Maíra

> We might also want to look into who uses drm_sched_wqueue_{start,stop}
> and consider if they need a comment. Though I don't expect you to do
> that. Those functions are hacky legacy anyways.
> 
> 
> P.
> 
>>    */
>>   void drm_sched_stop(struct drm_gpu_scheduler *sched, struct
>> drm_sched_job *bad)
>>   {
>> diff --git a/include/drm/gpu_scheduler.h
>> b/include/drm/gpu_scheduler.h
>> index
>> 1a7e377d4cbb4fc12ed93c548b236970217945e8..fe9043b6d43141bee831b5fc16b
>> 927202a507d51 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -389,11 +389,13 @@ struct drm_sched_job {
>>    * @DRM_GPU_SCHED_STAT_NONE: Reserved. Do not use.
>>    * @DRM_GPU_SCHED_STAT_NOMINAL: Operation succeeded.
>>    * @DRM_GPU_SCHED_STAT_ENODEV: Error: Device is not available
>> anymore.
>> + * @DRM_GPU_SCHED_STAT_RUNNING: GPU is still running, so skip the
>> reset.
>>    */
>>   enum drm_gpu_sched_stat {
>>   	DRM_GPU_SCHED_STAT_NONE,
>>   	DRM_GPU_SCHED_STAT_NOMINAL,
>>   	DRM_GPU_SCHED_STAT_ENODEV,
>> +	DRM_GPU_SCHED_STAT_RUNNING,
>>   };
>>   
>>   /**
>>
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running
  2025-05-12 11:13       ` Philipp Stanner
@ 2025-05-12 14:04         ` Maíra Canal
  2025-05-12 14:09           ` Philipp Stanner
  0 siblings, 1 reply; 34+ messages in thread
From: Maíra Canal @ 2025-05-12 14:04 UTC (permalink / raw)
  To: phasta, Matthew Brost
  Cc: Danilo Krummrich, Christian König, Tvrtko Ursulin,
	Simona Vetter, Melissa Wen, Lucas Stach, Russell King,
	Christian Gmeiner, Lucas De Marchi, Thomas Hellström,
	Rodrigo Vivi, Boris Brezillon, Rob Herring, Steven Price,
	kernel-dev, dri-devel, etnaviv, intel-xe

Hi Philipp,

On 12/05/25 08:13, Philipp Stanner wrote:
> On Tue, 2025-05-06 at 07:32 -0700, Matthew Brost wrote:
>> On Mon, May 05, 2025 at 07:41:09PM -0700, Matthew Brost wrote:
>>> On Sat, May 03, 2025 at 05:59:52PM -0300, Maíra Canal wrote:
>>>> When the DRM scheduler times out, it's possible that the GPU
>>>> isn't hung;
>>>> instead, a job may still be running, and there may be no valid
>>>> reason to
>>>> reset the hardware. This can occur in two situations:
>>>>
>>>>    1. The GPU exposes some mechanism that ensures the GPU is still
>>>> making
>>>>       progress. By checking this mechanism, we can safely skip the
>>>> reset,
>>>>       rearm the timeout, and allow the job to continue running
>>>> until
>>>>       completion. This is the case for v3d and Etnaviv.
>>>>    2. TDR has fired before the IRQ that signals the fence.
>>>> Consequently,
>>>>       the job actually finishes, but it triggers a timeout before
>>>> signaling
>>>>       the completion fence.
>>>>
>>>
>>> We have both of these cases in Xe too. We implement the requeuing
>>> in Xe
>>> via driver side function - xe_sched_add_pending_job but this looks
>>> better and will make use of this.
>>>
>>>> These two scenarios are problematic because we remove the job
>>>> from the
>>>> `sched->pending_list` before calling `sched->ops-
>>>>> timedout_job()`. This
>>>> means that when the job finally signals completion (e.g. in the
>>>> IRQ
>>>> handler), the scheduler won't call `sched->ops->free_job()`. As a
>>>> result,
>>>> the job and its resources won't be freed, leading to a memory
>>>> leak.
>>>>
>>>> To resolve this issue, we create a new `drm_gpu_sched_stat` that
>>>> allows a
>>>> driver to skip the reset. This new status will indicate that the
>>>> job
>>>> should be reinserted into the pending list, and the driver will
>>>> still
>>>> signal its completion.
>>>>
>>>> Signed-off-by: Maíra Canal <mcanal@igalia.com>
>>>
>>> Reviewed-by: Matthew Brost <matthew.brost@intel.com>
>>>
>>
>> Wait - nevermind I think one issue is below.
>>
>>>> ---
>>>>   drivers/gpu/drm/scheduler/sched_main.c | 14 ++++++++++++++
>>>>   include/drm/gpu_scheduler.h            |  2 ++
>>>>   2 files changed, 16 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index
>>>> 829579c41c6b5d8b2abce5ad373c7017469b7680..68ca827d77e32187a034309
>>>> f881135dbc639a9b4 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -568,6 +568,17 @@ static void drm_sched_job_timedout(struct
>>>> work_struct *work)
>>>>   			job->sched->ops->free_job(job);
>>>>   			sched->free_guilty = false;
>>>>   		}
>>>> +
>>>> +		/*
>>>> +		 * If the driver indicated that the GPU is still
>>>> running and wants to skip
>>>> +		 * the reset, reinsert the job back into the
>>>> pending list and realarm the
>>>> +		 * timeout.
>>>> +		 */
>>>> +		if (status == DRM_GPU_SCHED_STAT_RUNNING) {
>>>> +			spin_lock(&sched->job_list_lock);
>>>> +			list_add(&job->list, &sched-
>>>>> pending_list);
>>>> +			spin_unlock(&sched->job_list_lock);
>>>> +		}
>>
>> I think you need to requeue free_job wq here. It is possible the
>> free_job wq ran, didn't find a job, goes to sleep, then we add a
>> signaled job here which will never get freed.
> 
> I wonder if that could be solved by holding job_list_lock a bit longer.
> free_job_work will try to check the list for the next signaled job, but
> will wait for the lock.
> 
> If that works, we could completely rely on the standard mechanism
> without requeuing, which would be neat.

I believe it works. However, the tradeoff would be holding the lock for
the entire reset of the GPU (in the cases the GPU actually hanged),
which looks like a lot of time.

Do you think it's reasonable to do so?

Best Regards,
- Maíra

> 
> P.
> 
>>
>> Matt
>>



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running
  2025-05-12 14:04         ` Maíra Canal
@ 2025-05-12 14:09           ` Philipp Stanner
  2025-05-12 14:58             ` Philipp Stanner
  0 siblings, 1 reply; 34+ messages in thread
From: Philipp Stanner @ 2025-05-12 14:09 UTC (permalink / raw)
  To: Maíra Canal, phasta, Matthew Brost
  Cc: Danilo Krummrich, Christian König, Tvrtko Ursulin,
	Simona Vetter, Melissa Wen, Lucas Stach, Russell King,
	Christian Gmeiner, Lucas De Marchi, Thomas Hellström,
	Rodrigo Vivi, Boris Brezillon, Rob Herring, Steven Price,
	kernel-dev, dri-devel, etnaviv, intel-xe

On Mon, 2025-05-12 at 11:04 -0300, Maíra Canal wrote:
> Hi Philipp,
> 
> On 12/05/25 08:13, Philipp Stanner wrote:
> > On Tue, 2025-05-06 at 07:32 -0700, Matthew Brost wrote:
> > > On Mon, May 05, 2025 at 07:41:09PM -0700, Matthew Brost wrote:
> > > > On Sat, May 03, 2025 at 05:59:52PM -0300, Maíra Canal wrote:
> > > > > When the DRM scheduler times out, it's possible that the GPU
> > > > > isn't hung;
> > > > > instead, a job may still be running, and there may be no
> > > > > valid
> > > > > reason to
> > > > > reset the hardware. This can occur in two situations:
> > > > > 
> > > > >    1. The GPU exposes some mechanism that ensures the GPU is
> > > > > still
> > > > > making
> > > > >       progress. By checking this mechanism, we can safely
> > > > > skip the
> > > > > reset,
> > > > >       rearm the timeout, and allow the job to continue
> > > > > running
> > > > > until
> > > > >       completion. This is the case for v3d and Etnaviv.
> > > > >    2. TDR has fired before the IRQ that signals the fence.
> > > > > Consequently,
> > > > >       the job actually finishes, but it triggers a timeout
> > > > > before
> > > > > signaling
> > > > >       the completion fence.
> > > > > 
> > > > 
> > > > We have both of these cases in Xe too. We implement the
> > > > requeuing
> > > > in Xe
> > > > via driver side function - xe_sched_add_pending_job but this
> > > > looks
> > > > better and will make use of this.
> > > > 
> > > > > These two scenarios are problematic because we remove the job
> > > > > from the
> > > > > `sched->pending_list` before calling `sched->ops-
> > > > > > timedout_job()`. This
> > > > > means that when the job finally signals completion (e.g. in
> > > > > the
> > > > > IRQ
> > > > > handler), the scheduler won't call `sched->ops->free_job()`.
> > > > > As a
> > > > > result,
> > > > > the job and its resources won't be freed, leading to a memory
> > > > > leak.
> > > > > 
> > > > > To resolve this issue, we create a new `drm_gpu_sched_stat`
> > > > > that
> > > > > allows a
> > > > > driver to skip the reset. This new status will indicate that
> > > > > the
> > > > > job
> > > > > should be reinserted into the pending list, and the driver
> > > > > will
> > > > > still
> > > > > signal its completion.
> > > > > 
> > > > > Signed-off-by: Maíra Canal <mcanal@igalia.com>
> > > > 
> > > > Reviewed-by: Matthew Brost <matthew.brost@intel.com>
> > > > 
> > > 
> > > Wait - nevermind I think one issue is below.
> > > 
> > > > > ---
> > > > >   drivers/gpu/drm/scheduler/sched_main.c | 14 ++++++++++++++
> > > > >   include/drm/gpu_scheduler.h            |  2 ++
> > > > >   2 files changed, 16 insertions(+)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > index
> > > > > 829579c41c6b5d8b2abce5ad373c7017469b7680..68ca827d77e32187a03
> > > > > 4309
> > > > > f881135dbc639a9b4 100644
> > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > @@ -568,6 +568,17 @@ static void
> > > > > drm_sched_job_timedout(struct
> > > > > work_struct *work)
> > > > >   			job->sched->ops->free_job(job);
> > > > >   			sched->free_guilty = false;
> > > > >   		}
> > > > > +
> > > > > +		/*
> > > > > +		 * If the driver indicated that the GPU is
> > > > > still
> > > > > running and wants to skip
> > > > > +		 * the reset, reinsert the job back into the
> > > > > pending list and realarm the
> > > > > +		 * timeout.
> > > > > +		 */
> > > > > +		if (status == DRM_GPU_SCHED_STAT_RUNNING) {
> > > > > +			spin_lock(&sched->job_list_lock);
> > > > > +			list_add(&job->list, &sched-
> > > > > > pending_list);
> > > > > +			spin_unlock(&sched->job_list_lock);
> > > > > +		}
> > > 
> > > I think you need to requeue free_job wq here. It is possible the
> > > free_job wq ran, didn't find a job, goes to sleep, then we add a
> > > signaled job here which will never get freed.
> > 
> > I wonder if that could be solved by holding job_list_lock a bit
> > longer.
> > free_job_work will try to check the list for the next signaled job,
> > but
> > will wait for the lock.
> > 
> > If that works, we could completely rely on the standard mechanism
> > without requeuing, which would be neat.
> 
> I believe it works. However, the tradeoff would be holding the lock
> for
> the entire reset of the GPU (in the cases the GPU actually hanged),
> which looks like a lot of time.
> 
> Do you think it's reasonable to do so?

The scheduler only has three distinct work items, run_job, free_job and
timeout.

timeout runs only serially, so that's not relevant; and run_job() and
free_job() should be halted in the timeout handler through
drm_sched_stop() anyways.

Moreover, timeouts should be rare events.

So I'd say yes, the clarity of the code trumps here.

Cheers,
P.


> 
> Best Regards,
> - Maíra
> 
> > 
> > P.
> > 
> > > 
> > > Matt
> > > 
> 
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running
  2025-05-12 14:09           ` Philipp Stanner
@ 2025-05-12 14:58             ` Philipp Stanner
  0 siblings, 0 replies; 34+ messages in thread
From: Philipp Stanner @ 2025-05-12 14:58 UTC (permalink / raw)
  To: phasta, Maíra Canal, Matthew Brost
  Cc: Danilo Krummrich, Christian König, Tvrtko Ursulin,
	Simona Vetter, Melissa Wen, Lucas Stach, Russell King,
	Christian Gmeiner, Lucas De Marchi, Thomas Hellström,
	Rodrigo Vivi, Boris Brezillon, Rob Herring, Steven Price,
	kernel-dev, dri-devel, etnaviv, intel-xe

On Mon, 2025-05-12 at 16:09 +0200, Philipp Stanner wrote:
> On Mon, 2025-05-12 at 11:04 -0300, Maíra Canal wrote:
> > Hi Philipp,
> > 
> > On 12/05/25 08:13, Philipp Stanner wrote:
> > > On Tue, 2025-05-06 at 07:32 -0700, Matthew Brost wrote:
> > > > On Mon, May 05, 2025 at 07:41:09PM -0700, Matthew Brost wrote:
> > > > > On Sat, May 03, 2025 at 05:59:52PM -0300, Maíra Canal wrote:
> > > > > > When the DRM scheduler times out, it's possible that the
> > > > > > GPU
> > > > > > isn't hung;
> > > > > > instead, a job may still be running, and there may be no
> > > > > > valid
> > > > > > reason to
> > > > > > reset the hardware. This can occur in two situations:
> > > > > > 
> > > > > >    1. The GPU exposes some mechanism that ensures the GPU
> > > > > > is
> > > > > > still
> > > > > > making
> > > > > >       progress. By checking this mechanism, we can safely
> > > > > > skip the
> > > > > > reset,
> > > > > >       rearm the timeout, and allow the job to continue
> > > > > > running
> > > > > > until
> > > > > >       completion. This is the case for v3d and Etnaviv.
> > > > > >    2. TDR has fired before the IRQ that signals the fence.
> > > > > > Consequently,
> > > > > >       the job actually finishes, but it triggers a timeout
> > > > > > before
> > > > > > signaling
> > > > > >       the completion fence.
> > > > > > 
> > > > > 
> > > > > We have both of these cases in Xe too. We implement the
> > > > > requeuing
> > > > > in Xe
> > > > > via driver side function - xe_sched_add_pending_job but this
> > > > > looks
> > > > > better and will make use of this.
> > > > > 
> > > > > > These two scenarios are problematic because we remove the
> > > > > > job
> > > > > > from the
> > > > > > `sched->pending_list` before calling `sched->ops-
> > > > > > > timedout_job()`. This
> > > > > > means that when the job finally signals completion (e.g. in
> > > > > > the
> > > > > > IRQ
> > > > > > handler), the scheduler won't call `sched->ops-
> > > > > > >free_job()`.
> > > > > > As a
> > > > > > result,
> > > > > > the job and its resources won't be freed, leading to a
> > > > > > memory
> > > > > > leak.
> > > > > > 
> > > > > > To resolve this issue, we create a new `drm_gpu_sched_stat`
> > > > > > that
> > > > > > allows a
> > > > > > driver to skip the reset. This new status will indicate
> > > > > > that
> > > > > > the
> > > > > > job
> > > > > > should be reinserted into the pending list, and the driver
> > > > > > will
> > > > > > still
> > > > > > signal its completion.
> > > > > > 
> > > > > > Signed-off-by: Maíra Canal <mcanal@igalia.com>
> > > > > 
> > > > > Reviewed-by: Matthew Brost <matthew.brost@intel.com>
> > > > > 
> > > > 
> > > > Wait - nevermind I think one issue is below.
> > > > 
> > > > > > ---
> > > > > >   drivers/gpu/drm/scheduler/sched_main.c | 14
> > > > > > ++++++++++++++
> > > > > >   include/drm/gpu_scheduler.h            |  2 ++
> > > > > >   2 files changed, 16 insertions(+)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > index
> > > > > > 829579c41c6b5d8b2abce5ad373c7017469b7680..68ca827d77e32187a
> > > > > > 03
> > > > > > 4309
> > > > > > f881135dbc639a9b4 100644
> > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > @@ -568,6 +568,17 @@ static void
> > > > > > drm_sched_job_timedout(struct
> > > > > > work_struct *work)
> > > > > >   			job->sched->ops->free_job(job);
> > > > > >   			sched->free_guilty = false;
> > > > > >   		}
> > > > > > +
> > > > > > +		/*
> > > > > > +		 * If the driver indicated that the GPU is
> > > > > > still
> > > > > > running and wants to skip
> > > > > > +		 * the reset, reinsert the job back into
> > > > > > the
> > > > > > pending list and realarm the
> > > > > > +		 * timeout.
> > > > > > +		 */
> > > > > > +		if (status == DRM_GPU_SCHED_STAT_RUNNING)
> > > > > > {
> > > > > > +			spin_lock(&sched->job_list_lock);
> > > > > > +			list_add(&job->list, &sched-
> > > > > > > pending_list);
> > > > > > +			spin_unlock(&sched-
> > > > > > >job_list_lock);
> > > > > > +		}
> > > > 
> > > > I think you need to requeue free_job wq here. It is possible
> > > > the
> > > > free_job wq ran, didn't find a job, goes to sleep, then we add
> > > > a
> > > > signaled job here which will never get freed.
> > > 
> > > I wonder if that could be solved by holding job_list_lock a bit
> > > longer.
> > > free_job_work will try to check the list for the next signaled
> > > job,
> > > but
> > > will wait for the lock.
> > > 
> > > If that works, we could completely rely on the standard mechanism
> > > without requeuing, which would be neat.
> > 
> > I believe it works. However, the tradeoff would be holding the lock
> > for
> > the entire reset of the GPU (in the cases the GPU actually hanged),
> > which looks like a lot of time.
> > 
> > Do you think it's reasonable to do so?
> 
> The scheduler only has three distinct work items, run_job, free_job
> and
> timeout.
> 
> timeout runs only serially, so that's not relevant; and run_job() and
> free_job() should be halted in the timeout handler through
> drm_sched_stop() anyways.
> 
> Moreover, timeouts should be rare events.
> 
> So I'd say yes, the clarity of the code trumps here.

Forget about that, turns out to be an insane idea. Thx to Danilo for
the feedback.

Many drivers take locks in timedout_job(), for example
pvr_queue_timedout_job().

So that's insane deadlock danger. Not good.



So it seems the only option left is Matt's idea?

But that certainly needs an explanatory comment, too.


P.


> 
> Cheers,
> P.
> 
> 
> > 
> > Best Regards,
> > - Maíra
> > 
> > > 
> > > P.
> > > 
> > > > 
> > > > Matt
> > > > 
> > 
> > 
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running
  2025-05-03 20:59 ` [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running Maíra Canal
                     ` (2 preceding siblings ...)
  2025-05-12 11:04   ` Philipp Stanner
@ 2025-05-13  7:26   ` Philipp Stanner
  2025-05-24 13:33     ` Maíra Canal
  3 siblings, 1 reply; 34+ messages in thread
From: Philipp Stanner @ 2025-05-13  7:26 UTC (permalink / raw)
  To: Maíra Canal, Matthew Brost, Danilo Krummrich,
	Philipp Stanner, Christian König, Tvrtko Ursulin,
	Simona Vetter, Melissa Wen, Lucas Stach, Russell King,
	Christian Gmeiner, Lucas De Marchi, Thomas Hellström,
	Rodrigo Vivi, Boris Brezillon, Rob Herring, Steven Price
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe

On Sat, 2025-05-03 at 17:59 -0300, Maíra Canal wrote:
> When the DRM scheduler times out, it's possible that the GPU isn't
> hung;
> instead, a job may still be running, and there may be no valid reason
> to
> reset the hardware. This can occur in two situations:
> 
>   1. The GPU exposes some mechanism that ensures the GPU is still
> making
>      progress. By checking this mechanism, we can safely skip the
> reset,
>      rearm the timeout, and allow the job to continue running until
>      completion. This is the case for v3d and Etnaviv.
>   2. TDR has fired before the IRQ that signals the fence.
> Consequently,
>      the job actually finishes, but it triggers a timeout before
> signaling
>      the completion fence.
> 
> These two scenarios are problematic because we remove the job from
> the
> `sched->pending_list` before calling `sched->ops->timedout_job()`.
> This
> means that when the job finally signals completion (e.g. in the IRQ
> handler), the scheduler won't call `sched->ops->free_job()`. As a
> result,
> the job and its resources won't be freed, leading to a memory leak.

We have discussed this and discovered another, related issue. See
below.

> 
> To resolve this issue, we create a new `drm_gpu_sched_stat` that
> allows a
> driver to skip the reset. This new status will indicate that the job
> should be reinserted into the pending list, and the driver will still
> signal its completion.
> 
> Signed-off-by: Maíra Canal <mcanal@igalia.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 14 ++++++++++++++
>  include/drm/gpu_scheduler.h            |  2 ++
>  2 files changed, 16 insertions(+)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index
> 829579c41c6b5d8b2abce5ad373c7017469b7680..68ca827d77e32187a034309f881
> 135dbc639a9b4 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -568,6 +568,17 @@ static void drm_sched_job_timedout(struct
> work_struct *work)

So, the fundamental design problem we have is that the scheduler
assumes that when a timeout occurs, the GPU is completely hung. Your
patch addresses another aspect of that very problem.

But if the GPU is not hung, it can signal the hardware fence at any
moment. So that's racy.

It could, theoretically, lead to backend_ops.timedout_job() being
called with a signaled job, i.e., a job that is not really timed out.

Would you say this is *the same* issue you're describing, or a separate
one? It seems to me that it's a separate one.

Anyways. What I propose is that we wait until your series here has been
merged. Once that's done, we should document that drivers should expect
that backend_ops.timedout_job() can get called with a job that has not
actually timed out, and tell the scheduler about it through
DRM_GPU_SCHED_STAT_NOT_HANGING. Then the scheduler reverts the
timeout's actions, as you propose here.


>  			job->sched->ops->free_job(job);
>  			sched->free_guilty = false;
>  		}
> +
> +		/*
> +		 * If the driver indicated that the GPU is still
> running and wants to skip
> +		 * the reset, reinsert the job back into the pending
> list and realarm the
> +		 * timeout.
> +		 */
> +		if (status == DRM_GPU_SCHED_STAT_RUNNING) {
> +			spin_lock(&sched->job_list_lock);
> +			list_add(&job->list, &sched->pending_list);
> +			spin_unlock(&sched->job_list_lock);
> +		}

btw, if you go for Matt's requeue work item approach, it'll be better
to write a helper function with a clear name for all that.

drm_sched_job_reinsert_on_false_timout() maybe.


P.


>  	} else {
>  		spin_unlock(&sched->job_list_lock);
>  	}
> @@ -590,6 +601,9 @@ static void drm_sched_job_timedout(struct
> work_struct *work)
>   * This function is typically used for reset recovery (see the docu
> of
>   * drm_sched_backend_ops.timedout_job() for details). Do not call it
> for
>   * scheduler teardown, i.e., before calling drm_sched_fini().
> + *
> + * As it's used for reset recovery, drm_sched_stop() shouldn't be
> called
> + * if the scheduler skipped the timeout (DRM_SCHED_STAT_RUNNING).
>   */
>  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct
> drm_sched_job *bad)
>  {
> diff --git a/include/drm/gpu_scheduler.h
> b/include/drm/gpu_scheduler.h
> index
> 1a7e377d4cbb4fc12ed93c548b236970217945e8..fe9043b6d43141bee831b5fc16b
> 927202a507d51 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -389,11 +389,13 @@ struct drm_sched_job {
>   * @DRM_GPU_SCHED_STAT_NONE: Reserved. Do not use.
>   * @DRM_GPU_SCHED_STAT_NOMINAL: Operation succeeded.
>   * @DRM_GPU_SCHED_STAT_ENODEV: Error: Device is not available
> anymore.
> + * @DRM_GPU_SCHED_STAT_RUNNING: GPU is still running, so skip the
> reset.
>   */
>  enum drm_gpu_sched_stat {
>  	DRM_GPU_SCHED_STAT_NONE,
>  	DRM_GPU_SCHED_STAT_NOMINAL,
>  	DRM_GPU_SCHED_STAT_ENODEV,
> +	DRM_GPU_SCHED_STAT_RUNNING,
>  };
>  
>  /**
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running
  2025-05-13  7:26   ` Philipp Stanner
@ 2025-05-24 13:33     ` Maíra Canal
  0 siblings, 0 replies; 34+ messages in thread
From: Maíra Canal @ 2025-05-24 13:33 UTC (permalink / raw)
  To: phasta, Matthew Brost, Danilo Krummrich, Christian König,
	Tvrtko Ursulin, Simona Vetter, Melissa Wen, Lucas Stach,
	Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe

Hi Philipp,

Sorry, I was OoO for a couple of weeks.

On 13/05/25 04:26, Philipp Stanner wrote:
> On Sat, 2025-05-03 at 17:59 -0300, Maíra Canal wrote:
>> When the DRM scheduler times out, it's possible that the GPU isn't
>> hung;
>> instead, a job may still be running, and there may be no valid reason
>> to
>> reset the hardware. This can occur in two situations:
>>
>>    1. The GPU exposes some mechanism that ensures the GPU is still
>> making
>>       progress. By checking this mechanism, we can safely skip the
>> reset,
>>       rearm the timeout, and allow the job to continue running until
>>       completion. This is the case for v3d and Etnaviv.
>>    2. TDR has fired before the IRQ that signals the fence.
>> Consequently,
>>       the job actually finishes, but it triggers a timeout before
>> signaling
>>       the completion fence.
>>
>> These two scenarios are problematic because we remove the job from
>> the
>> `sched->pending_list` before calling `sched->ops->timedout_job()`.
>> This
>> means that when the job finally signals completion (e.g. in the IRQ
>> handler), the scheduler won't call `sched->ops->free_job()`. As a
>> result,
>> the job and its resources won't be freed, leading to a memory leak.
> 
> We have discussed this and discovered another, related issue. See
> below.
> 
>>
>> To resolve this issue, we create a new `drm_gpu_sched_stat` that
>> allows a
>> driver to skip the reset. This new status will indicate that the job
>> should be reinserted into the pending list, and the driver will still
>> signal its completion.
>>
>> Signed-off-by: Maíra Canal <mcanal@igalia.com>
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 14 ++++++++++++++
>>   include/drm/gpu_scheduler.h            |  2 ++
>>   2 files changed, 16 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index
>> 829579c41c6b5d8b2abce5ad373c7017469b7680..68ca827d77e32187a034309f881
>> 135dbc639a9b4 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -568,6 +568,17 @@ static void drm_sched_job_timedout(struct
>> work_struct *work)
> 
> So, the fundamental design problem we have is that the scheduler
> assumes that when a timeout occurs, the GPU is completely hung. Your
> patch addresses another aspect of that very problem.
> 
> But if the GPU is not hung, it can signal the hardware fence at any
> moment. So that's racy.

Unfortunately, this already happens, which would be the second point of
the list in the commit message.

> 
> It could, theoretically, lead to backend_ops.timedout_job() being
> called with a signaled job, i.e., a job that is not really timed out.
> 
> Would you say this is *the same* issue you're describing, or a separate
> one? It seems to me that it's a separate one.

It isn't the issue that I'm describing in the sense that the scheduler
itself won't do anything to address this issue. However, several drivers
already handle with this situation internally by checking the result of
`dma_fence_is_signaled()` and bailing out of the timeout if the job is
signaled. We would provide a proper status code for those drivers and
avoid memory leaks.

> 
> Anyways. What I propose is that we wait until your series here has been
> merged. Once that's done, we should document that drivers should expect
> that backend_ops.timedout_job() can get called with a job that has not
> actually timed out, and tell the scheduler about it through
> DRM_GPU_SCHED_STAT_NOT_HANGING. Then the scheduler reverts the
> timeout's actions, as you propose here.
> 
> 
>>   			job->sched->ops->free_job(job);
>>   			sched->free_guilty = false;
>>   		}
>> +
>> +		/*
>> +		 * If the driver indicated that the GPU is still
>> running and wants to skip
>> +		 * the reset, reinsert the job back into the pending
>> list and realarm the
>> +		 * timeout.
>> +		 */
>> +		if (status == DRM_GPU_SCHED_STAT_RUNNING) {
>> +			spin_lock(&sched->job_list_lock);
>> +			list_add(&job->list, &sched->pending_list);
>> +			spin_unlock(&sched->job_list_lock);
>> +		}
> 
> btw, if you go for Matt's requeue work item approach, it'll be better
> to write a helper function with a clear name for all that.
> 
> drm_sched_job_reinsert_on_false_timout() maybe.

Thanks for the review, I'll send v2 soon.

Best Regards,
- Maíra

> 
> 
> P.
> 
> 
>>   	} else {
>>   		spin_unlock(&sched->job_list_lock);
>>   	}

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2025-05-24 13:34 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-03 20:59 [PATCH 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_RUNNING Maíra Canal
2025-05-03 20:59 ` [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running Maíra Canal
2025-05-06  2:41   ` Matthew Brost
2025-05-06 14:32     ` Matthew Brost
2025-05-12 11:13       ` Philipp Stanner
2025-05-12 14:04         ` Maíra Canal
2025-05-12 14:09           ` Philipp Stanner
2025-05-12 14:58             ` Philipp Stanner
2025-05-06 11:32   ` Tvrtko Ursulin
2025-05-07 12:33     ` Maíra Canal
2025-05-07 12:50       ` Tvrtko Ursulin
2025-05-12  9:24         ` Philipp Stanner
2025-05-12 11:04   ` Philipp Stanner
2025-05-12 13:59     ` Maíra Canal
2025-05-13  7:26   ` Philipp Stanner
2025-05-24 13:33     ` Maíra Canal
2025-05-03 20:59 ` [PATCH 2/8] drm/sched: Always free the job after the timeout Maíra Canal
2025-05-06 11:49   ` Tvrtko Ursulin
2025-05-06 12:46     ` Maíra Canal
2025-05-06 13:28       ` Tvrtko Ursulin
2025-05-06 13:38         ` Maíra Canal
2025-05-06 14:18           ` Tvrtko Ursulin
2025-05-03 20:59 ` [PATCH 3/8] drm/sched: Reduce scheduler's timeout for timeout tests Maíra Canal
2025-05-06 12:03   ` Tvrtko Ursulin
2025-05-06 12:56     ` Maíra Canal
2025-05-06 13:20       ` Tvrtko Ursulin
2025-05-03 20:59 ` [PATCH 4/8] drm/sched: Add new test for DRM_GPU_SCHED_STAT_RUNNING Maíra Canal
2025-05-06 13:58   ` Tvrtko Ursulin
2025-05-03 20:59 ` [PATCH 5/8] drm/v3d: Use DRM_GPU_SCHED_STAT_RUNNING to skip the reset Maíra Canal
2025-05-03 20:59 ` [PATCH 6/8] drm/etnaviv: " Maíra Canal
2025-05-03 20:59 ` [PATCH 7/8] drm/xe: " Maíra Canal
2025-05-06  4:25   ` Matthew Brost
2025-05-03 20:59 ` [PATCH 8/8] drm/panfrost: " Maíra Canal
2025-05-08 15:04   ` Steven Price

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).