[PATCH 0/6] drm/sched: Fix memory leaks with cancel

dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] drm/sched: Fix memory leaks with cancel_job() callback
@ 2025-07-01 13:21 Philipp Stanner
  2025-07-01 13:21 ` [PATCH 1/6] drm/sched: Avoid " Philipp Stanner
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Philipp Stanner @ 2025-07-01 13:21 UTC (permalink / raw)
  To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Philipp Stanner, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
	Tvrtko Ursulin, Pierre-Eric Pelloux-Prayer
  Cc: dri-devel, nouveau, linux-kernel, linux-media, linaro-mm-sig

Changes since the RFC:
  - Rename helper function for drm_sched_fini(). (Tvrtko)
  - Add Link to Tvrtko's RFC patch to patch 1.


Since a long time, drm_sched_fini() can leak jobs that are still in
drm_sched.pending_list.

This series solves the leaks in a backwards-compatible manner by adding
a new, optional callback. If that callback is implemented, the scheduler
uses it to cancel all jobs from pending_list and then frees them.


Philipp Stanner (6):
  drm/sched: Avoid memory leaks with cancel_job() callback
  drm/sched/tests: Port to cancel_job()
  drm/sched: Warn if pending list is not empty
  drm/nouveau: Make fence container helper usable driver-wide
  drm/nouveau: Add new callback for scheduler teardown
  drm/nouveau: Remove waitque for sched teardown

 drivers/gpu/drm/nouveau/nouveau_fence.c       | 35 ++++++----
 drivers/gpu/drm/nouveau/nouveau_fence.h       |  7 ++
 drivers/gpu/drm/nouveau/nouveau_sched.c       | 35 ++++++----
 drivers/gpu/drm/nouveau/nouveau_sched.h       |  9 +--
 drivers/gpu/drm/nouveau/nouveau_uvmm.c        |  8 +--
 drivers/gpu/drm/scheduler/sched_main.c        | 37 +++++++----
 .../gpu/drm/scheduler/tests/mock_scheduler.c  | 66 +++++++------------
 include/drm/gpu_scheduler.h                   | 18 +++++
 8 files changed, 122 insertions(+), 93 deletions(-)

-- 
2.49.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/6] drm/sched: Avoid memory leaks with cancel_job() callback
  2025-07-01 13:21 [PATCH 0/6] drm/sched: Fix memory leaks with cancel_job() callback Philipp Stanner
@ 2025-07-01 13:21 ` Philipp Stanner
  2025-07-02 10:19   ` Tvrtko Ursulin
  2025-07-04 12:18   ` Maíra Canal
  2025-07-01 13:21 ` [PATCH 2/6] drm/sched/tests: Port to cancel_job() Philipp Stanner
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 15+ messages in thread
From: Philipp Stanner @ 2025-07-01 13:21 UTC (permalink / raw)
  To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Philipp Stanner, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
	Tvrtko Ursulin, Pierre-Eric Pelloux-Prayer
  Cc: dri-devel, nouveau, linux-kernel, linux-media, linaro-mm-sig

Since its inception, the GPU scheduler can leak memory if the driver
calls drm_sched_fini() while there are still jobs in flight.

The simplest way to solve this in a backwards compatible manner is by
adding a new callback, drm_sched_backend_ops.cancel_job(), which
instructs the driver to signal the hardware fence associated with the
job. Afterwards, the scheduler can savely use the established free_job()
callback for freeing the job.

Implement the new backend_ops callback cancel_job().

Suggested-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Link: https://lore.kernel.org/dri-devel/20250418113211.69956-1-tvrtko.ursulin@igalia.com/
Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/scheduler/sched_main.c | 34 ++++++++++++++++----------
 include/drm/gpu_scheduler.h            | 18 ++++++++++++++
 2 files changed, 39 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index c63543132f9d..1239954f5f7c 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1353,6 +1353,18 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
 }
 EXPORT_SYMBOL(drm_sched_init);
 
+static void drm_sched_cancel_remaining_jobs(struct drm_gpu_scheduler *sched)
+{
+	struct drm_sched_job *job, *tmp;
+
+	/* All other accessors are stopped. No locking necessary. */
+	list_for_each_entry_safe_reverse(job, tmp, &sched->pending_list, list) {
+		sched->ops->cancel_job(job);
+		list_del(&job->list);
+		sched->ops->free_job(job);
+	}
+}
+
 /**
  * drm_sched_fini - Destroy a gpu scheduler
  *
@@ -1360,19 +1372,11 @@ EXPORT_SYMBOL(drm_sched_init);
  *
  * Tears down and cleans up the scheduler.
  *
- * This stops submission of new jobs to the hardware through
- * drm_sched_backend_ops.run_job(). Consequently, drm_sched_backend_ops.free_job()
- * will not be called for all jobs still in drm_gpu_scheduler.pending_list.
- * There is no solution for this currently. Thus, it is up to the driver to make
- * sure that:
- *
- *  a) drm_sched_fini() is only called after for all submitted jobs
- *     drm_sched_backend_ops.free_job() has been called or that
- *  b) the jobs for which drm_sched_backend_ops.free_job() has not been called
- *     after drm_sched_fini() ran are freed manually.
- *
- * FIXME: Take care of the above problem and prevent this function from leaking
- * the jobs in drm_gpu_scheduler.pending_list under any circumstances.
+ * This stops submission of new jobs to the hardware through &struct
+ * drm_sched_backend_ops.run_job. If &struct drm_sched_backend_ops.cancel_job
+ * is implemented, all jobs will be canceled through it and afterwards cleaned
+ * up through &struct drm_sched_backend_ops.free_job. If cancel_job is not
+ * implemented, memory could leak.
  */
 void drm_sched_fini(struct drm_gpu_scheduler *sched)
 {
@@ -1402,6 +1406,10 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
 	/* Confirm no work left behind accessing device structures */
 	cancel_delayed_work_sync(&sched->work_tdr);
 
+	/* Avoid memory leaks if supported by the driver. */
+	if (sched->ops->cancel_job)
+		drm_sched_cancel_remaining_jobs(sched);
+
 	if (sched->own_submit_wq)
 		destroy_workqueue(sched->submit_wq);
 	sched->ready = false;
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index e62a7214e052..190844370f48 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -512,6 +512,24 @@ struct drm_sched_backend_ops {
          * and it's time to clean it up.
 	 */
 	void (*free_job)(struct drm_sched_job *sched_job);
+
+	/**
+	 * @cancel_job: Used by the scheduler to guarantee remaining jobs' fences
+	 * get signaled in drm_sched_fini().
+	 *
+	 * Used by the scheduler to cancel all jobs that have not been executed
+	 * with &struct drm_sched_backend_ops.run_job by the time
+	 * drm_sched_fini() gets invoked.
+	 *
+	 * Drivers need to signal the passed job's hardware fence with an
+	 * appropriate error code (e.g., -ECANCELED) in this callback. They
+	 * must not free the job.
+	 *
+	 * The scheduler will only call this callback once it stopped calling
+	 * all other callbacks forever, with the exception of &struct
+	 * drm_sched_backend_ops.free_job.
+	 */
+	void (*cancel_job)(struct drm_sched_job *sched_job);
 };
 
 /**
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/6] drm/sched/tests: Port to cancel_job()
  2025-07-01 13:21 [PATCH 0/6] drm/sched: Fix memory leaks with cancel_job() callback Philipp Stanner
  2025-07-01 13:21 ` [PATCH 1/6] drm/sched: Avoid " Philipp Stanner
@ 2025-07-01 13:21 ` Philipp Stanner
  2025-07-02 10:36   ` Tvrtko Ursulin
  2025-07-01 13:21 ` [PATCH 3/6] drm/sched: Warn if pending list is not empty Philipp Stanner
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Philipp Stanner @ 2025-07-01 13:21 UTC (permalink / raw)
  To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Philipp Stanner, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
	Tvrtko Ursulin, Pierre-Eric Pelloux-Prayer
  Cc: dri-devel, nouveau, linux-kernel, linux-media, linaro-mm-sig

The GPU Scheduler now supports a new callback, cancel_job(), which lets
the scheduler cancel all jobs which might not yet be freed when
drm_sched_fini() runs. Using this callback allows for significantly
simplifying the mock scheduler teardown code.

Implement the cancel_job() callback and adjust the code where necessary.

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 .../gpu/drm/scheduler/tests/mock_scheduler.c  | 66 +++++++------------
 1 file changed, 23 insertions(+), 43 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
index 49d067fecd67..2d3169d95200 100644
--- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
+++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
@@ -63,7 +63,7 @@ static void drm_mock_sched_job_complete(struct drm_mock_sched_job *job)
 	lockdep_assert_held(&sched->lock);
 
 	job->flags |= DRM_MOCK_SCHED_JOB_DONE;
-	list_move_tail(&job->link, &sched->done_list);
+	list_del(&job->link);
 	dma_fence_signal_locked(&job->hw_fence);
 	complete(&job->done);
 }
@@ -236,26 +236,39 @@ mock_sched_timedout_job(struct drm_sched_job *sched_job)
 
 static void mock_sched_free_job(struct drm_sched_job *sched_job)
 {
-	struct drm_mock_scheduler *sched =
-			drm_sched_to_mock_sched(sched_job->sched);
 	struct drm_mock_sched_job *job = drm_sched_job_to_mock_job(sched_job);
-	unsigned long flags;
 
-	/* Remove from the scheduler done list. */
-	spin_lock_irqsave(&sched->lock, flags);
-	list_del(&job->link);
-	spin_unlock_irqrestore(&sched->lock, flags);
 	dma_fence_put(&job->hw_fence);
-
 	drm_sched_job_cleanup(sched_job);
 
 	/* Mock job itself is freed by the kunit framework. */
 }
 
+static void mock_sched_cancel_job(struct drm_sched_job *sched_job)
+{
+	struct drm_mock_scheduler *sched = drm_sched_to_mock_sched(sched_job->sched);
+	struct drm_mock_sched_job *job = drm_sched_job_to_mock_job(sched_job);
+	unsigned long flags;
+
+	hrtimer_cancel(&job->timer);
+
+	spin_lock_irqsave(&sched->lock, flags);
+	if (!dma_fence_is_signaled_locked(&job->hw_fence)) {
+		list_del(&job->link);
+		dma_fence_set_error(&job->hw_fence, -ECANCELED);
+		dma_fence_signal_locked(&job->hw_fence);
+	}
+	spin_unlock_irqrestore(&sched->lock, flags);
+
+	/* The GPU Scheduler will call drm_sched_backend_ops.free_job(), still.
+	 * Mock job itself is freed by the kunit framework. */
+}
+
 static const struct drm_sched_backend_ops drm_mock_scheduler_ops = {
 	.run_job = mock_sched_run_job,
 	.timedout_job = mock_sched_timedout_job,
-	.free_job = mock_sched_free_job
+	.free_job = mock_sched_free_job,
+	.cancel_job = mock_sched_cancel_job,
 };
 
 /**
@@ -289,7 +302,6 @@ struct drm_mock_scheduler *drm_mock_sched_new(struct kunit *test, long timeout)
 	sched->hw_timeline.context = dma_fence_context_alloc(1);
 	atomic_set(&sched->hw_timeline.next_seqno, 0);
 	INIT_LIST_HEAD(&sched->job_list);
-	INIT_LIST_HEAD(&sched->done_list);
 	spin_lock_init(&sched->lock);
 
 	return sched;
@@ -304,38 +316,6 @@ struct drm_mock_scheduler *drm_mock_sched_new(struct kunit *test, long timeout)
  */
 void drm_mock_sched_fini(struct drm_mock_scheduler *sched)
 {
-	struct drm_mock_sched_job *job, *next;
-	unsigned long flags;
-	LIST_HEAD(list);
-
-	drm_sched_wqueue_stop(&sched->base);
-
-	/* Force complete all unfinished jobs. */
-	spin_lock_irqsave(&sched->lock, flags);
-	list_for_each_entry_safe(job, next, &sched->job_list, link)
-		list_move_tail(&job->link, &list);
-	spin_unlock_irqrestore(&sched->lock, flags);
-
-	list_for_each_entry(job, &list, link)
-		hrtimer_cancel(&job->timer);
-
-	spin_lock_irqsave(&sched->lock, flags);
-	list_for_each_entry_safe(job, next, &list, link)
-		drm_mock_sched_job_complete(job);
-	spin_unlock_irqrestore(&sched->lock, flags);
-
-	/*
-	 * Free completed jobs and jobs not yet processed by the DRM scheduler
-	 * free worker.
-	 */
-	spin_lock_irqsave(&sched->lock, flags);
-	list_for_each_entry_safe(job, next, &sched->done_list, link)
-		list_move_tail(&job->link, &list);
-	spin_unlock_irqrestore(&sched->lock, flags);
-
-	list_for_each_entry_safe(job, next, &list, link)
-		mock_sched_free_job(&job->base);
-
 	drm_sched_fini(&sched->base);
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/6] drm/sched: Warn if pending list is not empty
  2025-07-01 13:21 [PATCH 0/6] drm/sched: Fix memory leaks with cancel_job() callback Philipp Stanner
  2025-07-01 13:21 ` [PATCH 1/6] drm/sched: Avoid " Philipp Stanner
  2025-07-01 13:21 ` [PATCH 2/6] drm/sched/tests: Port to cancel_job() Philipp Stanner
@ 2025-07-01 13:21 ` Philipp Stanner
  2025-07-01 13:21 ` [PATCH 4/6] drm/nouveau: Make fence container helper usable driver-wide Philipp Stanner
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Philipp Stanner @ 2025-07-01 13:21 UTC (permalink / raw)
  To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Philipp Stanner, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
	Tvrtko Ursulin, Pierre-Eric Pelloux-Prayer
  Cc: dri-devel, nouveau, linux-kernel, linux-media, linaro-mm-sig

drm_sched_fini() can leak jobs under certain circumstances.

Warn if that happens.

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/scheduler/sched_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 1239954f5f7c..dadf1a22ddf6 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1415,6 +1415,9 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
 	sched->ready = false;
 	kfree(sched->sched_rq);
 	sched->sched_rq = NULL;
+
+	if (!list_empty(&sched->pending_list))
+		dev_err(sched->dev, "Tearing down scheduler while jobs are pending!\n");
 }
 EXPORT_SYMBOL(drm_sched_fini);
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/6] drm/nouveau: Make fence container helper usable driver-wide
  2025-07-01 13:21 [PATCH 0/6] drm/sched: Fix memory leaks with cancel_job() callback Philipp Stanner
                   ` (2 preceding siblings ...)
  2025-07-01 13:21 ` [PATCH 3/6] drm/sched: Warn if pending list is not empty Philipp Stanner
@ 2025-07-01 13:21 ` Philipp Stanner
  2025-07-01 13:21 ` [PATCH 5/6] drm/nouveau: Add new callback for scheduler teardown Philipp Stanner
  2025-07-01 13:21 ` [PATCH 6/6] drm/nouveau: Remove waitque for sched teardown Philipp Stanner
  5 siblings, 0 replies; 15+ messages in thread
From: Philipp Stanner @ 2025-07-01 13:21 UTC (permalink / raw)
  To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Philipp Stanner, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
	Tvrtko Ursulin, Pierre-Eric Pelloux-Prayer
  Cc: dri-devel, nouveau, linux-kernel, linux-media, linaro-mm-sig

In order to implement a new DRM GPU scheduler callback in Nouveau, a
helper for obtaining a nouveau_fence from a dma_fence is necessary. Such
a helper exists already inside nouveau_fence.c, called from_fence().

Make that helper available to other C files with a more precise name.

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/nouveau/nouveau_fence.c | 20 +++++++-------------
 drivers/gpu/drm/nouveau/nouveau_fence.h |  6 ++++++
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/drivers/gpu/drm/nouveau/nouveau_fence.c
index d5654e26d5bc..869d4335c0f4 100644
--- a/drivers/gpu/drm/nouveau/nouveau_fence.c
+++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
@@ -38,12 +38,6 @@
 static const struct dma_fence_ops nouveau_fence_ops_uevent;
 static const struct dma_fence_ops nouveau_fence_ops_legacy;
 
-static inline struct nouveau_fence *
-from_fence(struct dma_fence *fence)
-{
-	return container_of(fence, struct nouveau_fence, base);
-}
-
 static inline struct nouveau_fence_chan *
 nouveau_fctx(struct nouveau_fence *fence)
 {
@@ -77,7 +71,7 @@ nouveau_local_fence(struct dma_fence *fence, struct nouveau_drm *drm)
 	    fence->ops != &nouveau_fence_ops_uevent)
 		return NULL;
 
-	return from_fence(fence);
+	return to_nouveau_fence(fence);
 }
 
 void
@@ -268,7 +262,7 @@ nouveau_fence_done(struct nouveau_fence *fence)
 static long
 nouveau_fence_wait_legacy(struct dma_fence *f, bool intr, long wait)
 {
-	struct nouveau_fence *fence = from_fence(f);
+	struct nouveau_fence *fence = to_nouveau_fence(f);
 	unsigned long sleep_time = NSEC_PER_MSEC / 1000;
 	unsigned long t = jiffies, timeout = t + wait;
 
@@ -448,7 +442,7 @@ static const char *nouveau_fence_get_get_driver_name(struct dma_fence *fence)
 
 static const char *nouveau_fence_get_timeline_name(struct dma_fence *f)
 {
-	struct nouveau_fence *fence = from_fence(f);
+	struct nouveau_fence *fence = to_nouveau_fence(f);
 	struct nouveau_fence_chan *fctx = nouveau_fctx(fence);
 
 	return !fctx->dead ? fctx->name : "dead channel";
@@ -462,7 +456,7 @@ static const char *nouveau_fence_get_timeline_name(struct dma_fence *f)
  */
 static bool nouveau_fence_is_signaled(struct dma_fence *f)
 {
-	struct nouveau_fence *fence = from_fence(f);
+	struct nouveau_fence *fence = to_nouveau_fence(f);
 	struct nouveau_fence_chan *fctx = nouveau_fctx(fence);
 	struct nouveau_channel *chan;
 	bool ret = false;
@@ -478,7 +472,7 @@ static bool nouveau_fence_is_signaled(struct dma_fence *f)
 
 static bool nouveau_fence_no_signaling(struct dma_fence *f)
 {
-	struct nouveau_fence *fence = from_fence(f);
+	struct nouveau_fence *fence = to_nouveau_fence(f);
 
 	/*
 	 * caller should have a reference on the fence,
@@ -503,7 +497,7 @@ static bool nouveau_fence_no_signaling(struct dma_fence *f)
 
 static void nouveau_fence_release(struct dma_fence *f)
 {
-	struct nouveau_fence *fence = from_fence(f);
+	struct nouveau_fence *fence = to_nouveau_fence(f);
 	struct nouveau_fence_chan *fctx = nouveau_fctx(fence);
 
 	kref_put(&fctx->fence_ref, nouveau_fence_context_put);
@@ -521,7 +515,7 @@ static const struct dma_fence_ops nouveau_fence_ops_legacy = {
 
 static bool nouveau_fence_enable_signaling(struct dma_fence *f)
 {
-	struct nouveau_fence *fence = from_fence(f);
+	struct nouveau_fence *fence = to_nouveau_fence(f);
 	struct nouveau_fence_chan *fctx = nouveau_fctx(fence);
 	bool ret;
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.h b/drivers/gpu/drm/nouveau/nouveau_fence.h
index 6a983dd9f7b9..183dd43ecfff 100644
--- a/drivers/gpu/drm/nouveau/nouveau_fence.h
+++ b/drivers/gpu/drm/nouveau/nouveau_fence.h
@@ -17,6 +17,12 @@ struct nouveau_fence {
 	unsigned long timeout;
 };
 
+static inline struct nouveau_fence *
+to_nouveau_fence(struct dma_fence *fence)
+{
+	return container_of(fence, struct nouveau_fence, base);
+}
+
 int  nouveau_fence_create(struct nouveau_fence **, struct nouveau_channel *);
 int  nouveau_fence_new(struct nouveau_fence **, struct nouveau_channel *);
 void nouveau_fence_unref(struct nouveau_fence **);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/6] drm/nouveau: Add new callback for scheduler teardown
  2025-07-01 13:21 [PATCH 0/6] drm/sched: Fix memory leaks with cancel_job() callback Philipp Stanner
                   ` (3 preceding siblings ...)
  2025-07-01 13:21 ` [PATCH 4/6] drm/nouveau: Make fence container helper usable driver-wide Philipp Stanner
@ 2025-07-01 13:21 ` Philipp Stanner
  2025-07-01 13:21 ` [PATCH 6/6] drm/nouveau: Remove waitque for sched teardown Philipp Stanner
  5 siblings, 0 replies; 15+ messages in thread
From: Philipp Stanner @ 2025-07-01 13:21 UTC (permalink / raw)
  To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Philipp Stanner, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
	Tvrtko Ursulin, Pierre-Eric Pelloux-Prayer
  Cc: dri-devel, nouveau, linux-kernel, linux-media, linaro-mm-sig

There is a new callback for always tearing the scheduler down in a
leak-free, deadlock-free manner.

Port Nouveau as its first user by providing the scheduler with a
callback that ensures the fence context gets killed in drm_sched_fini().

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/nouveau/nouveau_fence.c | 15 +++++++++++++++
 drivers/gpu/drm/nouveau/nouveau_fence.h |  1 +
 drivers/gpu/drm/nouveau/nouveau_sched.c | 15 ++++++++++++++-
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/drivers/gpu/drm/nouveau/nouveau_fence.c
index 869d4335c0f4..9f345a008717 100644
--- a/drivers/gpu/drm/nouveau/nouveau_fence.c
+++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
@@ -240,6 +240,21 @@ nouveau_fence_emit(struct nouveau_fence *fence)
 	return ret;
 }
 
+void
+nouveau_fence_cancel(struct nouveau_fence *fence)
+{
+	struct nouveau_fence_chan *fctx = nouveau_fctx(fence);
+	unsigned long flags;
+
+	spin_lock_irqsave(&fctx->lock, flags);
+	if (!dma_fence_is_signaled_locked(&fence->base)) {
+		dma_fence_set_error(&fence->base, -ECANCELED);
+		if (nouveau_fence_signal(fence))
+			nvif_event_block(&fctx->event);
+	}
+	spin_unlock_irqrestore(&fctx->lock, flags);
+}
+
 bool
 nouveau_fence_done(struct nouveau_fence *fence)
 {
diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.h b/drivers/gpu/drm/nouveau/nouveau_fence.h
index 183dd43ecfff..9957a919bd38 100644
--- a/drivers/gpu/drm/nouveau/nouveau_fence.h
+++ b/drivers/gpu/drm/nouveau/nouveau_fence.h
@@ -29,6 +29,7 @@ void nouveau_fence_unref(struct nouveau_fence **);
 
 int  nouveau_fence_emit(struct nouveau_fence *);
 bool nouveau_fence_done(struct nouveau_fence *);
+void nouveau_fence_cancel(struct nouveau_fence *fence);
 int  nouveau_fence_wait(struct nouveau_fence *, bool lazy, bool intr);
 int  nouveau_fence_sync(struct nouveau_bo *, struct nouveau_channel *, bool exclusive, bool intr);
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
index 460a5fb02412..2ec62059c351 100644
--- a/drivers/gpu/drm/nouveau/nouveau_sched.c
+++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
@@ -11,6 +11,7 @@
 #include "nouveau_exec.h"
 #include "nouveau_abi16.h"
 #include "nouveau_sched.h"
+#include "nouveau_chan.h"
 
 #define NOUVEAU_SCHED_JOB_TIMEOUT_MS		10000
 
@@ -393,10 +394,23 @@ nouveau_sched_free_job(struct drm_sched_job *sched_job)
 	nouveau_job_fini(job);
 }
 
+static void
+nouveau_sched_cancel_job(struct drm_sched_job *sched_job)
+{
+	struct nouveau_fence *fence;
+	struct nouveau_job *job;
+
+	job = to_nouveau_job(sched_job);
+	fence = to_nouveau_fence(job->done_fence);
+
+	nouveau_fence_cancel(fence);
+}
+
 static const struct drm_sched_backend_ops nouveau_sched_ops = {
 	.run_job = nouveau_sched_run_job,
 	.timedout_job = nouveau_sched_timedout_job,
 	.free_job = nouveau_sched_free_job,
+	.cancel_job = nouveau_sched_cancel_job,
 };
 
 static int
@@ -482,7 +496,6 @@ nouveau_sched_create(struct nouveau_sched **psched, struct nouveau_drm *drm,
 	return 0;
 }
 
-
 static void
 nouveau_sched_fini(struct nouveau_sched *sched)
 {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 6/6] drm/nouveau: Remove waitque for sched teardown
  2025-07-01 13:21 [PATCH 0/6] drm/sched: Fix memory leaks with cancel_job() callback Philipp Stanner
                   ` (4 preceding siblings ...)
  2025-07-01 13:21 ` [PATCH 5/6] drm/nouveau: Add new callback for scheduler teardown Philipp Stanner
@ 2025-07-01 13:21 ` Philipp Stanner
  5 siblings, 0 replies; 15+ messages in thread
From: Philipp Stanner @ 2025-07-01 13:21 UTC (permalink / raw)
  To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Philipp Stanner, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
	Tvrtko Ursulin, Pierre-Eric Pelloux-Prayer
  Cc: dri-devel, nouveau, linux-kernel, linux-media, linaro-mm-sig

struct nouveau_sched contains a waitque needed to prevent
drm_sched_fini() from being called while there are still jobs pending.
Doing so so far would have caused memory leaks.

With the new memleak-free mode of operation switched on in
drm_sched_fini() by providing the callback
nouveau_sched_fence_context_kill() the waitque is not necessary anymore.

Remove the waitque.

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/nouveau/nouveau_sched.c | 20 +++++++-------------
 drivers/gpu/drm/nouveau/nouveau_sched.h |  9 +++------
 drivers/gpu/drm/nouveau/nouveau_uvmm.c  |  8 ++++----
 3 files changed, 14 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
index 2ec62059c351..7d9c3418e76b 100644
--- a/drivers/gpu/drm/nouveau/nouveau_sched.c
+++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
@@ -122,11 +122,9 @@ nouveau_job_done(struct nouveau_job *job)
 {
 	struct nouveau_sched *sched = job->sched;
 
-	spin_lock(&sched->job.list.lock);
+	spin_lock(&sched->job_list.lock);
 	list_del(&job->entry);
-	spin_unlock(&sched->job.list.lock);
-
-	wake_up(&sched->job.wq);
+	spin_unlock(&sched->job_list.lock);
 }
 
 void
@@ -307,9 +305,9 @@ nouveau_job_submit(struct nouveau_job *job)
 	}
 
 	/* Submit was successful; add the job to the schedulers job list. */
-	spin_lock(&sched->job.list.lock);
-	list_add(&job->entry, &sched->job.list.head);
-	spin_unlock(&sched->job.list.lock);
+	spin_lock(&sched->job_list.lock);
+	list_add(&job->entry, &sched->job_list.head);
+	spin_unlock(&sched->job_list.lock);
 
 	drm_sched_job_arm(&job->base);
 	job->done_fence = dma_fence_get(&job->base.s_fence->finished);
@@ -460,9 +458,8 @@ nouveau_sched_init(struct nouveau_sched *sched, struct nouveau_drm *drm,
 		goto fail_sched;
 
 	mutex_init(&sched->mutex);
-	spin_lock_init(&sched->job.list.lock);
-	INIT_LIST_HEAD(&sched->job.list.head);
-	init_waitqueue_head(&sched->job.wq);
+	spin_lock_init(&sched->job_list.lock);
+	INIT_LIST_HEAD(&sched->job_list.head);
 
 	return 0;
 
@@ -502,9 +499,6 @@ nouveau_sched_fini(struct nouveau_sched *sched)
 	struct drm_gpu_scheduler *drm_sched = &sched->base;
 	struct drm_sched_entity *entity = &sched->entity;
 
-	rmb(); /* for list_empty to work without lock */
-	wait_event(sched->job.wq, list_empty(&sched->job.list.head));
-
 	drm_sched_entity_fini(entity);
 	drm_sched_fini(drm_sched);
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.h b/drivers/gpu/drm/nouveau/nouveau_sched.h
index 20cd1da8db73..b98c3f0bef30 100644
--- a/drivers/gpu/drm/nouveau/nouveau_sched.h
+++ b/drivers/gpu/drm/nouveau/nouveau_sched.h
@@ -103,12 +103,9 @@ struct nouveau_sched {
 	struct mutex mutex;
 
 	struct {
-		struct {
-			struct list_head head;
-			spinlock_t lock;
-		} list;
-		struct wait_queue_head wq;
-	} job;
+		struct list_head head;
+		spinlock_t lock;
+	} job_list;
 };
 
 int nouveau_sched_create(struct nouveau_sched **psched, struct nouveau_drm *drm,
diff --git a/drivers/gpu/drm/nouveau/nouveau_uvmm.c b/drivers/gpu/drm/nouveau/nouveau_uvmm.c
index 48f105239f42..ddfc46bc1b3e 100644
--- a/drivers/gpu/drm/nouveau/nouveau_uvmm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_uvmm.c
@@ -1019,8 +1019,8 @@ bind_validate_map_sparse(struct nouveau_job *job, u64 addr, u64 range)
 	u64 end = addr + range;
 
 again:
-	spin_lock(&sched->job.list.lock);
-	list_for_each_entry(__job, &sched->job.list.head, entry) {
+	spin_lock(&sched->job_list.lock);
+	list_for_each_entry(__job, &sched->job_list.head, entry) {
 		struct nouveau_uvmm_bind_job *bind_job = to_uvmm_bind_job(__job);
 
 		list_for_each_op(op, &bind_job->ops) {
@@ -1030,7 +1030,7 @@ bind_validate_map_sparse(struct nouveau_job *job, u64 addr, u64 range)
 
 				if (!(end <= op_addr || addr >= op_end)) {
 					nouveau_uvmm_bind_job_get(bind_job);
-					spin_unlock(&sched->job.list.lock);
+					spin_unlock(&sched->job_list.lock);
 					wait_for_completion(&bind_job->complete);
 					nouveau_uvmm_bind_job_put(bind_job);
 					goto again;
@@ -1038,7 +1038,7 @@ bind_validate_map_sparse(struct nouveau_job *job, u64 addr, u64 range)
 			}
 		}
 	}
-	spin_unlock(&sched->job.list.lock);
+	spin_unlock(&sched->job_list.lock);
 }
 
 static int
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/6] drm/sched: Avoid memory leaks with cancel_job() callback
  2025-07-01 13:21 ` [PATCH 1/6] drm/sched: Avoid " Philipp Stanner
@ 2025-07-02 10:19   ` Tvrtko Ursulin
  2025-07-04 12:18   ` Maíra Canal
  1 sibling, 0 replies; 15+ messages in thread
From: Tvrtko Ursulin @ 2025-07-02 10:19 UTC (permalink / raw)
  To: Philipp Stanner, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Matthew Brost, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
	Pierre-Eric Pelloux-Prayer
  Cc: dri-devel, nouveau, linux-kernel, linux-media, linaro-mm-sig


On 01/07/2025 14:21, Philipp Stanner wrote:
> Since its inception, the GPU scheduler can leak memory if the driver
> calls drm_sched_fini() while there are still jobs in flight.
> 
> The simplest way to solve this in a backwards compatible manner is by
> adding a new callback, drm_sched_backend_ops.cancel_job(), which
> instructs the driver to signal the hardware fence associated with the
> job. Afterwards, the scheduler can savely use the established free_job()
> callback for freeing the job.
> 
> Implement the new backend_ops callback cancel_job().
> 
> Suggested-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Link: https://lore.kernel.org/dri-devel/20250418113211.69956-1-tvrtko.ursulin@igalia.com/
> Signed-off-by: Philipp Stanner <phasta@kernel.org>
> ---
>   drivers/gpu/drm/scheduler/sched_main.c | 34 ++++++++++++++++----------
>   include/drm/gpu_scheduler.h            | 18 ++++++++++++++
>   2 files changed, 39 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index c63543132f9d..1239954f5f7c 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1353,6 +1353,18 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
>   }
>   EXPORT_SYMBOL(drm_sched_init);
>   
> +static void drm_sched_cancel_remaining_jobs(struct drm_gpu_scheduler *sched)
> +{
> +	struct drm_sched_job *job, *tmp;
> +
> +	/* All other accessors are stopped. No locking necessary. */
> +	list_for_each_entry_safe_reverse(job, tmp, &sched->pending_list, list) {
> +		sched->ops->cancel_job(job);
> +		list_del(&job->list);

List_del is just for the warning in 3/6 I guess? You could in theory zap 
the whole list in one go and avoid the safe iterator. Not that it 
matters really so:

Acked-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

Regards,

Tvrtko

> +		sched->ops->free_job(job);
> +	}
> +}
> +
>   /**
>    * drm_sched_fini - Destroy a gpu scheduler
>    *
> @@ -1360,19 +1372,11 @@ EXPORT_SYMBOL(drm_sched_init);
>    *
>    * Tears down and cleans up the scheduler.
>    *
> - * This stops submission of new jobs to the hardware through
> - * drm_sched_backend_ops.run_job(). Consequently, drm_sched_backend_ops.free_job()
> - * will not be called for all jobs still in drm_gpu_scheduler.pending_list.
> - * There is no solution for this currently. Thus, it is up to the driver to make
> - * sure that:
> - *
> - *  a) drm_sched_fini() is only called after for all submitted jobs
> - *     drm_sched_backend_ops.free_job() has been called or that
> - *  b) the jobs for which drm_sched_backend_ops.free_job() has not been called
> - *     after drm_sched_fini() ran are freed manually.
> - *
> - * FIXME: Take care of the above problem and prevent this function from leaking
> - * the jobs in drm_gpu_scheduler.pending_list under any circumstances.
> + * This stops submission of new jobs to the hardware through &struct
> + * drm_sched_backend_ops.run_job. If &struct drm_sched_backend_ops.cancel_job
> + * is implemented, all jobs will be canceled through it and afterwards cleaned
> + * up through &struct drm_sched_backend_ops.free_job. If cancel_job is not
> + * implemented, memory could leak.
>    */
>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>   {
> @@ -1402,6 +1406,10 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
>   	/* Confirm no work left behind accessing device structures */
>   	cancel_delayed_work_sync(&sched->work_tdr);
>   
> +	/* Avoid memory leaks if supported by the driver. */
> +	if (sched->ops->cancel_job)
> +		drm_sched_cancel_remaining_jobs(sched);
> +
>   	if (sched->own_submit_wq)
>   		destroy_workqueue(sched->submit_wq);
>   	sched->ready = false;
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index e62a7214e052..190844370f48 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -512,6 +512,24 @@ struct drm_sched_backend_ops {
>            * and it's time to clean it up.
>   	 */
>   	void (*free_job)(struct drm_sched_job *sched_job);
> +
> +	/**
> +	 * @cancel_job: Used by the scheduler to guarantee remaining jobs' fences
> +	 * get signaled in drm_sched_fini().
> +	 *
> +	 * Used by the scheduler to cancel all jobs that have not been executed
> +	 * with &struct drm_sched_backend_ops.run_job by the time
> +	 * drm_sched_fini() gets invoked.
> +	 *
> +	 * Drivers need to signal the passed job's hardware fence with an
> +	 * appropriate error code (e.g., -ECANCELED) in this callback. They
> +	 * must not free the job.
> +	 *
> +	 * The scheduler will only call this callback once it stopped calling
> +	 * all other callbacks forever, with the exception of &struct
> +	 * drm_sched_backend_ops.free_job.
> +	 */
> +	void (*cancel_job)(struct drm_sched_job *sched_job);
>   };
>   
>   /**


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/6] drm/sched/tests: Port to cancel_job()
  2025-07-01 13:21 ` [PATCH 2/6] drm/sched/tests: Port to cancel_job() Philipp Stanner
@ 2025-07-02 10:36   ` Tvrtko Ursulin
  2025-07-02 10:56     ` Philipp Stanner
  0 siblings, 1 reply; 15+ messages in thread
From: Tvrtko Ursulin @ 2025-07-02 10:36 UTC (permalink / raw)
  To: Philipp Stanner, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Matthew Brost, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
	Pierre-Eric Pelloux-Prayer
  Cc: dri-devel, nouveau, linux-kernel, linux-media, linaro-mm-sig


On 01/07/2025 14:21, Philipp Stanner wrote:
> The GPU Scheduler now supports a new callback, cancel_job(), which lets
> the scheduler cancel all jobs which might not yet be freed when
> drm_sched_fini() runs. Using this callback allows for significantly
> simplifying the mock scheduler teardown code.
> 
> Implement the cancel_job() callback and adjust the code where necessary.

Cross referencing against my version I think you missed this hunk:

--- a/drivers/gpu/drm/scheduler/tests/sched_tests.h
+++ b/drivers/gpu/drm/scheduler/tests/sched_tests.h
@@ -49,7 +49,6 @@ struct drm_mock_scheduler {

  	spinlock_t		lock;
  	struct list_head	job_list;
-	struct list_head	done_list;

  	struct {
  		u64		context;


I also had this:

@@ -97,7 +96,8 @@ struct drm_mock_sched_job {
  	struct completion	done;

  #define DRM_MOCK_SCHED_JOB_DONE		0x1
-#define DRM_MOCK_SCHED_JOB_TIMEDOUT	0x2
+#define DRM_MOCK_SCHED_JOB_CANCELED	0x2
+#define DRM_MOCK_SCHED_JOB_TIMEDOUT	0x4

And was setting it in the callback. And since we should add a test to 
explicitly cover the new callback, and just the callback, that could 
make it very easy to do it.

> Signed-off-by: Philipp Stanner <phasta@kernel.org>
> ---
>   .../gpu/drm/scheduler/tests/mock_scheduler.c  | 66 +++++++------------
>   1 file changed, 23 insertions(+), 43 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> index 49d067fecd67..2d3169d95200 100644
> --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> @@ -63,7 +63,7 @@ static void drm_mock_sched_job_complete(struct drm_mock_sched_job *job)
>   	lockdep_assert_held(&sched->lock);
>   
>   	job->flags |= DRM_MOCK_SCHED_JOB_DONE;
> -	list_move_tail(&job->link, &sched->done_list);
> +	list_del(&job->link);
>   	dma_fence_signal_locked(&job->hw_fence);
>   	complete(&job->done);
>   }
> @@ -236,26 +236,39 @@ mock_sched_timedout_job(struct drm_sched_job *sched_job)
>   
>   static void mock_sched_free_job(struct drm_sched_job *sched_job)
>   {
> -	struct drm_mock_scheduler *sched =
> -			drm_sched_to_mock_sched(sched_job->sched);
>   	struct drm_mock_sched_job *job = drm_sched_job_to_mock_job(sched_job);
> -	unsigned long flags;
>   
> -	/* Remove from the scheduler done list. */
> -	spin_lock_irqsave(&sched->lock, flags);
> -	list_del(&job->link);
> -	spin_unlock_irqrestore(&sched->lock, flags);
>   	dma_fence_put(&job->hw_fence);
> -
>   	drm_sched_job_cleanup(sched_job);
>   
>   	/* Mock job itself is freed by the kunit framework. */
>   }
>   
> +static void mock_sched_cancel_job(struct drm_sched_job *sched_job)
> +{
> +	struct drm_mock_scheduler *sched = drm_sched_to_mock_sched(sched_job->sched);
> +	struct drm_mock_sched_job *job = drm_sched_job_to_mock_job(sched_job);
> +	unsigned long flags;
> +
> +	hrtimer_cancel(&job->timer);
> +
> +	spin_lock_irqsave(&sched->lock, flags);
> +	if (!dma_fence_is_signaled_locked(&job->hw_fence)) {
> +		list_del(&job->link);
> +		dma_fence_set_error(&job->hw_fence, -ECANCELED);
> +		dma_fence_signal_locked(&job->hw_fence);
> +	}
> +	spin_unlock_irqrestore(&sched->lock, flags);
> +
> +	/* The GPU Scheduler will call drm_sched_backend_ops.free_job(), still.
> +	 * Mock job itself is freed by the kunit framework. */

/*
  * Multiline comment style to stay consistent, at least in this file.
  */

The rest looks good, but I need to revisit the timeout/free handling 
since it has been a while and you changed it recently.

Regards,

Tvrtko

> +}
> +
>   static const struct drm_sched_backend_ops drm_mock_scheduler_ops = {
>   	.run_job = mock_sched_run_job,
>   	.timedout_job = mock_sched_timedout_job,
> -	.free_job = mock_sched_free_job
> +	.free_job = mock_sched_free_job,
> +	.cancel_job = mock_sched_cancel_job,
>   };
>   
>   /**
> @@ -289,7 +302,6 @@ struct drm_mock_scheduler *drm_mock_sched_new(struct kunit *test, long timeout)
>   	sched->hw_timeline.context = dma_fence_context_alloc(1);
>   	atomic_set(&sched->hw_timeline.next_seqno, 0);
>   	INIT_LIST_HEAD(&sched->job_list);
> -	INIT_LIST_HEAD(&sched->done_list);
>   	spin_lock_init(&sched->lock);
>   
>   	return sched;
> @@ -304,38 +316,6 @@ struct drm_mock_scheduler *drm_mock_sched_new(struct kunit *test, long timeout)
>    */
>   void drm_mock_sched_fini(struct drm_mock_scheduler *sched)
>   {
> -	struct drm_mock_sched_job *job, *next;
> -	unsigned long flags;
> -	LIST_HEAD(list);
> -
> -	drm_sched_wqueue_stop(&sched->base);
> -
> -	/* Force complete all unfinished jobs. */
> -	spin_lock_irqsave(&sched->lock, flags);
> -	list_for_each_entry_safe(job, next, &sched->job_list, link)
> -		list_move_tail(&job->link, &list);
> -	spin_unlock_irqrestore(&sched->lock, flags);
> -
> -	list_for_each_entry(job, &list, link)
> -		hrtimer_cancel(&job->timer);
> -
> -	spin_lock_irqsave(&sched->lock, flags);
> -	list_for_each_entry_safe(job, next, &list, link)
> -		drm_mock_sched_job_complete(job);
> -	spin_unlock_irqrestore(&sched->lock, flags);
> -
> -	/*
> -	 * Free completed jobs and jobs not yet processed by the DRM scheduler
> -	 * free worker.
> -	 */
> -	spin_lock_irqsave(&sched->lock, flags);
> -	list_for_each_entry_safe(job, next, &sched->done_list, link)
> -		list_move_tail(&job->link, &list);
> -	spin_unlock_irqrestore(&sched->lock, flags);
> -
> -	list_for_each_entry_safe(job, next, &list, link)
> -		mock_sched_free_job(&job->base);
> -
>   	drm_sched_fini(&sched->base);
>   }
>   


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/6] drm/sched/tests: Port to cancel_job()
  2025-07-02 10:36   ` Tvrtko Ursulin
@ 2025-07-02 10:56     ` Philipp Stanner
  2025-07-02 11:25       ` Tvrtko Ursulin
  0 siblings, 1 reply; 15+ messages in thread
From: Philipp Stanner @ 2025-07-02 10:56 UTC (permalink / raw)
  To: Tvrtko Ursulin, Philipp Stanner, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Matthew Brost, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
	Pierre-Eric Pelloux-Prayer
  Cc: dri-devel, nouveau, linux-kernel, linux-media, linaro-mm-sig

On Wed, 2025-07-02 at 11:36 +0100, Tvrtko Ursulin wrote:
> 
> On 01/07/2025 14:21, Philipp Stanner wrote:
> > The GPU Scheduler now supports a new callback, cancel_job(), which
> > lets
> > the scheduler cancel all jobs which might not yet be freed when
> > drm_sched_fini() runs. Using this callback allows for significantly
> > simplifying the mock scheduler teardown code.
> > 
> > Implement the cancel_job() callback and adjust the code where
> > necessary.
> 
> Cross referencing against my version I think you missed this hunk:
> 
> --- a/drivers/gpu/drm/scheduler/tests/sched_tests.h
> +++ b/drivers/gpu/drm/scheduler/tests/sched_tests.h
> @@ -49,7 +49,6 @@ struct drm_mock_scheduler {
> 
>   	spinlock_t		lock;
>   	struct list_head	job_list;
> -	struct list_head	done_list;
> 
>   	struct {
>   		u64		context;
> 

Right, overlooked that one.

> 
> I also had this:
> 
> @@ -97,7 +96,8 @@ struct drm_mock_sched_job {
>   	struct completion	done;
> 
>   #define DRM_MOCK_SCHED_JOB_DONE		0x1
> -#define DRM_MOCK_SCHED_JOB_TIMEDOUT	0x2
> +#define DRM_MOCK_SCHED_JOB_CANCELED	0x2
> +#define DRM_MOCK_SCHED_JOB_TIMEDOUT	0x4
> 
> And was setting it in the callback. And since we should add a test to
> explicitly cover the new callback, and just the callback, that could 
> make it very easy to do it.

What do you imagine that to look like? The scheduler only invokes the
callback on tear down.

We also don't have tests that only test free_job() and the like, do
we? 

You cannot test a callback for the scheduler, because the callback is
implemented in the driver.

Callbacks are tested by using the scheduler. In this case, it's tested
the intended way by the unit tests invoking drm_sched_free().


P.


> 
> > Signed-off-by: Philipp Stanner <phasta@kernel.org>
> > ---
> >   .../gpu/drm/scheduler/tests/mock_scheduler.c  | 66 +++++++-------
> > -----
> >   1 file changed, 23 insertions(+), 43 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > index 49d067fecd67..2d3169d95200 100644
> > --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > @@ -63,7 +63,7 @@ static void drm_mock_sched_job_complete(struct
> > drm_mock_sched_job *job)
> >   	lockdep_assert_held(&sched->lock);
> >   
> >   	job->flags |= DRM_MOCK_SCHED_JOB_DONE;
> > -	list_move_tail(&job->link, &sched->done_list);
> > +	list_del(&job->link);
> >   	dma_fence_signal_locked(&job->hw_fence);
> >   	complete(&job->done);
> >   }
> > @@ -236,26 +236,39 @@ mock_sched_timedout_job(struct drm_sched_job
> > *sched_job)
> >   
> >   static void mock_sched_free_job(struct drm_sched_job *sched_job)
> >   {
> > -	struct drm_mock_scheduler *sched =
> > -			drm_sched_to_mock_sched(sched_job->sched);
> >   	struct drm_mock_sched_job *job =
> > drm_sched_job_to_mock_job(sched_job);
> > -	unsigned long flags;
> >   
> > -	/* Remove from the scheduler done list. */
> > -	spin_lock_irqsave(&sched->lock, flags);
> > -	list_del(&job->link);
> > -	spin_unlock_irqrestore(&sched->lock, flags);
> >   	dma_fence_put(&job->hw_fence);
> > -
> >   	drm_sched_job_cleanup(sched_job);
> >   
> >   	/* Mock job itself is freed by the kunit framework. */
> >   }
> >   
> > +static void mock_sched_cancel_job(struct drm_sched_job *sched_job)
> > +{
> > +	struct drm_mock_scheduler *sched =
> > drm_sched_to_mock_sched(sched_job->sched);
> > +	struct drm_mock_sched_job *job =
> > drm_sched_job_to_mock_job(sched_job);
> > +	unsigned long flags;
> > +
> > +	hrtimer_cancel(&job->timer);
> > +
> > +	spin_lock_irqsave(&sched->lock, flags);
> > +	if (!dma_fence_is_signaled_locked(&job->hw_fence)) {
> > +		list_del(&job->link);
> > +		dma_fence_set_error(&job->hw_fence, -ECANCELED);
> > +		dma_fence_signal_locked(&job->hw_fence);
> > +	}
> > +	spin_unlock_irqrestore(&sched->lock, flags);
> > +
> > +	/* The GPU Scheduler will call
> > drm_sched_backend_ops.free_job(), still.
> > +	 * Mock job itself is freed by the kunit framework. */
> 
> /*
>   * Multiline comment style to stay consistent, at least in this
> file.
>   */
> 
> The rest looks good, but I need to revisit the timeout/free handling 
> since it has been a while and you changed it recently.
> 
> Regards,
> 
> Tvrtko
> 
> > +}
> > +
> >   static const struct drm_sched_backend_ops drm_mock_scheduler_ops
> > = {
> >   	.run_job = mock_sched_run_job,
> >   	.timedout_job = mock_sched_timedout_job,
> > -	.free_job = mock_sched_free_job
> > +	.free_job = mock_sched_free_job,
> > +	.cancel_job = mock_sched_cancel_job,
> >   };
> >   
> >   /**
> > @@ -289,7 +302,6 @@ struct drm_mock_scheduler
> > *drm_mock_sched_new(struct kunit *test, long timeout)
> >   	sched->hw_timeline.context = dma_fence_context_alloc(1);
> >   	atomic_set(&sched->hw_timeline.next_seqno, 0);
> >   	INIT_LIST_HEAD(&sched->job_list);
> > -	INIT_LIST_HEAD(&sched->done_list);
> >   	spin_lock_init(&sched->lock);
> >   
> >   	return sched;
> > @@ -304,38 +316,6 @@ struct drm_mock_scheduler
> > *drm_mock_sched_new(struct kunit *test, long timeout)
> >    */
> >   void drm_mock_sched_fini(struct drm_mock_scheduler *sched)
> >   {
> > -	struct drm_mock_sched_job *job, *next;
> > -	unsigned long flags;
> > -	LIST_HEAD(list);
> > -
> > -	drm_sched_wqueue_stop(&sched->base);
> > -
> > -	/* Force complete all unfinished jobs. */
> > -	spin_lock_irqsave(&sched->lock, flags);
> > -	list_for_each_entry_safe(job, next, &sched->job_list,
> > link)
> > -		list_move_tail(&job->link, &list);
> > -	spin_unlock_irqrestore(&sched->lock, flags);
> > -
> > -	list_for_each_entry(job, &list, link)
> > -		hrtimer_cancel(&job->timer);
> > -
> > -	spin_lock_irqsave(&sched->lock, flags);
> > -	list_for_each_entry_safe(job, next, &list, link)
> > -		drm_mock_sched_job_complete(job);
> > -	spin_unlock_irqrestore(&sched->lock, flags);
> > -
> > -	/*
> > -	 * Free completed jobs and jobs not yet processed by the
> > DRM scheduler
> > -	 * free worker.
> > -	 */
> > -	spin_lock_irqsave(&sched->lock, flags);
> > -	list_for_each_entry_safe(job, next, &sched->done_list,
> > link)
> > -		list_move_tail(&job->link, &list);
> > -	spin_unlock_irqrestore(&sched->lock, flags);
> > -
> > -	list_for_each_entry_safe(job, next, &list, link)
> > -		mock_sched_free_job(&job->base);
> > -
> >   	drm_sched_fini(&sched->base);
> >   }
> >   
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/6] drm/sched/tests: Port to cancel_job()
  2025-07-02 10:56     ` Philipp Stanner
@ 2025-07-02 11:25       ` Tvrtko Ursulin
  2025-07-04  9:53         ` Philipp Stanner
  0 siblings, 1 reply; 15+ messages in thread
From: Tvrtko Ursulin @ 2025-07-02 11:25 UTC (permalink / raw)
  To: phasta, Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Christian König, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
	Pierre-Eric Pelloux-Prayer
  Cc: dri-devel, nouveau, linux-kernel, linux-media, linaro-mm-sig


On 02/07/2025 11:56, Philipp Stanner wrote:
> On Wed, 2025-07-02 at 11:36 +0100, Tvrtko Ursulin wrote:
>>
>> On 01/07/2025 14:21, Philipp Stanner wrote:
>>> The GPU Scheduler now supports a new callback, cancel_job(), which
>>> lets
>>> the scheduler cancel all jobs which might not yet be freed when
>>> drm_sched_fini() runs. Using this callback allows for significantly
>>> simplifying the mock scheduler teardown code.
>>>
>>> Implement the cancel_job() callback and adjust the code where
>>> necessary.
>>
>> Cross referencing against my version I think you missed this hunk:
>>
>> --- a/drivers/gpu/drm/scheduler/tests/sched_tests.h
>> +++ b/drivers/gpu/drm/scheduler/tests/sched_tests.h
>> @@ -49,7 +49,6 @@ struct drm_mock_scheduler {
>>
>>    	spinlock_t		lock;
>>    	struct list_head	job_list;
>> -	struct list_head	done_list;
>>
>>    	struct {
>>    		u64		context;
>>
> 
> Right, overlooked that one.
> 
>>
>> I also had this:
>>
>> @@ -97,7 +96,8 @@ struct drm_mock_sched_job {
>>    	struct completion	done;
>>
>>    #define DRM_MOCK_SCHED_JOB_DONE		0x1
>> -#define DRM_MOCK_SCHED_JOB_TIMEDOUT	0x2
>> +#define DRM_MOCK_SCHED_JOB_CANCELED	0x2
>> +#define DRM_MOCK_SCHED_JOB_TIMEDOUT	0x4
>>
>> And was setting it in the callback. And since we should add a test to
>> explicitly cover the new callback, and just the callback, that could
>> make it very easy to do it.
> 
> What do you imagine that to look like? The scheduler only invokes the
> callback on tear down.
> 
> We also don't have tests that only test free_job() and the like, do
> we?
> 
> You cannot test a callback for the scheduler, because the callback is
> implemented in the driver.
> 
> Callbacks are tested by using the scheduler. In this case, it's tested
> the intended way by the unit tests invoking drm_sched_free().

Something like (untested):

static void drm_sched_test_cleanup(struct kunit *test)
{
	struct drm_mock_sched_entity *entity;
	struct drm_mock_scheduler *sched;
	struct drm_mock_sched_job job;
	bool done;

	/*
	 * Check that the job cancel callback gets invoked by the scheduler.
	 */

	sched = drm_mock_sched_new(test, MAX_SCHEDULE_TIMEOUT);
	entity = drm_mock_sched_entity_new(test,
					   DRM_SCHED_PRIORITY_NORMAL,
					   sched);

	job = drm_mock_sched_job_new(test, entity);
	drm_mock_sched_job_submit(job);
	done = drm_mock_sched_job_wait_scheduled(job, HZ);
	KUNIT_ASSERT_TRUE(test, done);

	drm_mock_sched_entity_free(entity);
	drm_mock_sched_fini(sched);

	KUNIT_ASSERT_TRUE(test, job->flags & DRM_MOCK_SCHED_JOB_CANCELED);
}

Or via the hw fence status.

Regards,

Tvrtko

>>> Signed-off-by: Philipp Stanner <phasta@kernel.org>
>>> ---
>>>    .../gpu/drm/scheduler/tests/mock_scheduler.c  | 66 +++++++-------
>>> -----
>>>    1 file changed, 23 insertions(+), 43 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>>> b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>>> index 49d067fecd67..2d3169d95200 100644
>>> --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>>> +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>>> @@ -63,7 +63,7 @@ static void drm_mock_sched_job_complete(struct
>>> drm_mock_sched_job *job)
>>>    	lockdep_assert_held(&sched->lock);
>>>    
>>>    	job->flags |= DRM_MOCK_SCHED_JOB_DONE;
>>> -	list_move_tail(&job->link, &sched->done_list);
>>> +	list_del(&job->link);
>>>    	dma_fence_signal_locked(&job->hw_fence);
>>>    	complete(&job->done);
>>>    }
>>> @@ -236,26 +236,39 @@ mock_sched_timedout_job(struct drm_sched_job
>>> *sched_job)
>>>    
>>>    static void mock_sched_free_job(struct drm_sched_job *sched_job)
>>>    {
>>> -	struct drm_mock_scheduler *sched =
>>> -			drm_sched_to_mock_sched(sched_job->sched);
>>>    	struct drm_mock_sched_job *job =
>>> drm_sched_job_to_mock_job(sched_job);
>>> -	unsigned long flags;
>>>    
>>> -	/* Remove from the scheduler done list. */
>>> -	spin_lock_irqsave(&sched->lock, flags);
>>> -	list_del(&job->link);
>>> -	spin_unlock_irqrestore(&sched->lock, flags);
>>>    	dma_fence_put(&job->hw_fence);
>>> -
>>>    	drm_sched_job_cleanup(sched_job);
>>>    
>>>    	/* Mock job itself is freed by the kunit framework. */
>>>    }
>>>    
>>> +static void mock_sched_cancel_job(struct drm_sched_job *sched_job)
>>> +{
>>> +	struct drm_mock_scheduler *sched =
>>> drm_sched_to_mock_sched(sched_job->sched);
>>> +	struct drm_mock_sched_job *job =
>>> drm_sched_job_to_mock_job(sched_job);
>>> +	unsigned long flags;
>>> +
>>> +	hrtimer_cancel(&job->timer);
>>> +
>>> +	spin_lock_irqsave(&sched->lock, flags);
>>> +	if (!dma_fence_is_signaled_locked(&job->hw_fence)) {
>>> +		list_del(&job->link);
>>> +		dma_fence_set_error(&job->hw_fence, -ECANCELED);
>>> +		dma_fence_signal_locked(&job->hw_fence);
>>> +	}
>>> +	spin_unlock_irqrestore(&sched->lock, flags);
>>> +
>>> +	/* The GPU Scheduler will call
>>> drm_sched_backend_ops.free_job(), still.
>>> +	 * Mock job itself is freed by the kunit framework. */
>>
>> /*
>>    * Multiline comment style to stay consistent, at least in this
>> file.
>>    */
>>
>> The rest looks good, but I need to revisit the timeout/free handling
>> since it has been a while and you changed it recently.
>>
>> Regards,
>>
>> Tvrtko
>>
>>> +}
>>> +
>>>    static const struct drm_sched_backend_ops drm_mock_scheduler_ops
>>> = {
>>>    	.run_job = mock_sched_run_job,
>>>    	.timedout_job = mock_sched_timedout_job,
>>> -	.free_job = mock_sched_free_job
>>> +	.free_job = mock_sched_free_job,
>>> +	.cancel_job = mock_sched_cancel_job,
>>>    };
>>>    
>>>    /**
>>> @@ -289,7 +302,6 @@ struct drm_mock_scheduler
>>> *drm_mock_sched_new(struct kunit *test, long timeout)
>>>    	sched->hw_timeline.context = dma_fence_context_alloc(1);
>>>    	atomic_set(&sched->hw_timeline.next_seqno, 0);
>>>    	INIT_LIST_HEAD(&sched->job_list);
>>> -	INIT_LIST_HEAD(&sched->done_list);
>>>    	spin_lock_init(&sched->lock);
>>>    
>>>    	return sched;
>>> @@ -304,38 +316,6 @@ struct drm_mock_scheduler
>>> *drm_mock_sched_new(struct kunit *test, long timeout)
>>>     */
>>>    void drm_mock_sched_fini(struct drm_mock_scheduler *sched)
>>>    {
>>> -	struct drm_mock_sched_job *job, *next;
>>> -	unsigned long flags;
>>> -	LIST_HEAD(list);
>>> -
>>> -	drm_sched_wqueue_stop(&sched->base);
>>> -
>>> -	/* Force complete all unfinished jobs. */
>>> -	spin_lock_irqsave(&sched->lock, flags);
>>> -	list_for_each_entry_safe(job, next, &sched->job_list,
>>> link)
>>> -		list_move_tail(&job->link, &list);
>>> -	spin_unlock_irqrestore(&sched->lock, flags);
>>> -
>>> -	list_for_each_entry(job, &list, link)
>>> -		hrtimer_cancel(&job->timer);
>>> -
>>> -	spin_lock_irqsave(&sched->lock, flags);
>>> -	list_for_each_entry_safe(job, next, &list, link)
>>> -		drm_mock_sched_job_complete(job);
>>> -	spin_unlock_irqrestore(&sched->lock, flags);
>>> -
>>> -	/*
>>> -	 * Free completed jobs and jobs not yet processed by the
>>> DRM scheduler
>>> -	 * free worker.
>>> -	 */
>>> -	spin_lock_irqsave(&sched->lock, flags);
>>> -	list_for_each_entry_safe(job, next, &sched->done_list,
>>> link)
>>> -		list_move_tail(&job->link, &list);
>>> -	spin_unlock_irqrestore(&sched->lock, flags);
>>> -
>>> -	list_for_each_entry_safe(job, next, &list, link)
>>> -		mock_sched_free_job(&job->base);
>>> -
>>>    	drm_sched_fini(&sched->base);
>>>    }
>>>    
>>
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/6] drm/sched/tests: Port to cancel_job()
  2025-07-02 11:25       ` Tvrtko Ursulin
@ 2025-07-04  9:53         ` Philipp Stanner
  2025-07-04 11:27           ` Philipp Stanner
  0 siblings, 1 reply; 15+ messages in thread
From: Philipp Stanner @ 2025-07-04  9:53 UTC (permalink / raw)
  To: Tvrtko Ursulin, phasta, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Matthew Brost, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
	Pierre-Eric Pelloux-Prayer
  Cc: dri-devel, nouveau, linux-kernel, linux-media, linaro-mm-sig

On Wed, 2025-07-02 at 12:25 +0100, Tvrtko Ursulin wrote:
> 
> On 02/07/2025 11:56, Philipp Stanner wrote:
> > On Wed, 2025-07-02 at 11:36 +0100, Tvrtko Ursulin wrote:
> > > 
> > > On 01/07/2025 14:21, Philipp Stanner wrote:
> > > > The GPU Scheduler now supports a new callback, cancel_job(),
> > > > which
> > > > lets
> > > > the scheduler cancel all jobs which might not yet be freed when
> > > > drm_sched_fini() runs. Using this callback allows for
> > > > significantly
> > > > simplifying the mock scheduler teardown code.
> > > > 
> > > > Implement the cancel_job() callback and adjust the code where
> > > > necessary.
> > > 
> > > Cross referencing against my version I think you missed this
> > > hunk:
> > > 
> > > --- a/drivers/gpu/drm/scheduler/tests/sched_tests.h
> > > +++ b/drivers/gpu/drm/scheduler/tests/sched_tests.h
> > > @@ -49,7 +49,6 @@ struct drm_mock_scheduler {
> > > 
> > >    	spinlock_t		lock;
> > >    	struct list_head	job_list;
> > > -	struct list_head	done_list;
> > > 
> > >    	struct {
> > >    		u64		context;
> > > 
> > 
> > Right, overlooked that one.
> > 
> > > 
> > > I also had this:
> > > 
> > > @@ -97,7 +96,8 @@ struct drm_mock_sched_job {
> > >    	struct completion	done;
> > > 
> > >    #define DRM_MOCK_SCHED_JOB_DONE		0x1
> > > -#define DRM_MOCK_SCHED_JOB_TIMEDOUT	0x2
> > > +#define DRM_MOCK_SCHED_JOB_CANCELED	0x2
> > > +#define DRM_MOCK_SCHED_JOB_TIMEDOUT	0x4
> > > 
> > > And was setting it in the callback. And since we should add a
> > > test to
> > > explicitly cover the new callback, and just the callback, that
> > > could
> > > make it very easy to do it.
> > 
> > What do you imagine that to look like? The scheduler only invokes
> > the
> > callback on tear down.
> > 
> > We also don't have tests that only test free_job() and the like, do
> > we?
> > 
> > You cannot test a callback for the scheduler, because the callback
> > is
> > implemented in the driver.
> > 
> > Callbacks are tested by using the scheduler. In this case, it's
> > tested
> > the intended way by the unit tests invoking drm_sched_free().
> 
> Something like (untested):
> 
> static void drm_sched_test_cleanup(struct kunit *test)
> {
> 	struct drm_mock_sched_entity *entity;
> 	struct drm_mock_scheduler *sched;
> 	struct drm_mock_sched_job job;
> 	bool done;
> 
> 	/*
> 	 * Check that the job cancel callback gets invoked by the
> scheduler.
> 	 */
> 
> 	sched = drm_mock_sched_new(test, MAX_SCHEDULE_TIMEOUT);
> 	entity = drm_mock_sched_entity_new(test,
> 					  
> DRM_SCHED_PRIORITY_NORMAL,
> 					   sched);
> 
> 	job = drm_mock_sched_job_new(test, entity);
> 	drm_mock_sched_job_submit(job);
> 	done = drm_mock_sched_job_wait_scheduled(job, HZ);
> 	KUNIT_ASSERT_TRUE(test, done);
> 
> 	drm_mock_sched_entity_free(entity);
> 	drm_mock_sched_fini(sched);
> 
> 	KUNIT_ASSERT_TRUE(test, job->flags &
> DRM_MOCK_SCHED_JOB_CANCELED);
> }

That could work – but it's racy.

I wonder if we want to introduce a mechanism into the mock scheduler
through which we can enforce a simulated GPU hang. Then it would never
race, and that might be useful for more advanced tests for reset
recovery and those things.

Opinions?


P.


> 
> Or via the hw fence status.
> 
> Regards,
> 
> Tvrtko
> 
> > > > Signed-off-by: Philipp Stanner <phasta@kernel.org>
> > > > ---
> > > >    .../gpu/drm/scheduler/tests/mock_scheduler.c  | 66 +++++++--
> > > > -----
> > > > -----
> > > >    1 file changed, 23 insertions(+), 43 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > > b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > > index 49d067fecd67..2d3169d95200 100644
> > > > --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > > +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > > @@ -63,7 +63,7 @@ static void
> > > > drm_mock_sched_job_complete(struct
> > > > drm_mock_sched_job *job)
> > > >    	lockdep_assert_held(&sched->lock);
> > > >    
> > > >    	job->flags |= DRM_MOCK_SCHED_JOB_DONE;
> > > > -	list_move_tail(&job->link, &sched->done_list);
> > > > +	list_del(&job->link);
> > > >    	dma_fence_signal_locked(&job->hw_fence);
> > > >    	complete(&job->done);
> > > >    }
> > > > @@ -236,26 +236,39 @@ mock_sched_timedout_job(struct
> > > > drm_sched_job
> > > > *sched_job)
> > > >    
> > > >    static void mock_sched_free_job(struct drm_sched_job
> > > > *sched_job)
> > > >    {
> > > > -	struct drm_mock_scheduler *sched =
> > > > -			drm_sched_to_mock_sched(sched_job-
> > > > >sched);
> > > >    	struct drm_mock_sched_job *job =
> > > > drm_sched_job_to_mock_job(sched_job);
> > > > -	unsigned long flags;
> > > >    
> > > > -	/* Remove from the scheduler done list. */
> > > > -	spin_lock_irqsave(&sched->lock, flags);
> > > > -	list_del(&job->link);
> > > > -	spin_unlock_irqrestore(&sched->lock, flags);
> > > >    	dma_fence_put(&job->hw_fence);
> > > > -
> > > >    	drm_sched_job_cleanup(sched_job);
> > > >    
> > > >    	/* Mock job itself is freed by the kunit framework. */
> > > >    }
> > > >    
> > > > +static void mock_sched_cancel_job(struct drm_sched_job
> > > > *sched_job)
> > > > +{
> > > > +	struct drm_mock_scheduler *sched =
> > > > drm_sched_to_mock_sched(sched_job->sched);
> > > > +	struct drm_mock_sched_job *job =
> > > > drm_sched_job_to_mock_job(sched_job);
> > > > +	unsigned long flags;
> > > > +
> > > > +	hrtimer_cancel(&job->timer);
> > > > +
> > > > +	spin_lock_irqsave(&sched->lock, flags);
> > > > +	if (!dma_fence_is_signaled_locked(&job->hw_fence)) {
> > > > +		list_del(&job->link);
> > > > +		dma_fence_set_error(&job->hw_fence, -
> > > > ECANCELED);
> > > > +		dma_fence_signal_locked(&job->hw_fence);
> > > > +	}
> > > > +	spin_unlock_irqrestore(&sched->lock, flags);
> > > > +
> > > > +	/* The GPU Scheduler will call
> > > > drm_sched_backend_ops.free_job(), still.
> > > > +	 * Mock job itself is freed by the kunit framework. */
> > > 
> > > /*
> > >    * Multiline comment style to stay consistent, at least in this
> > > file.
> > >    */
> > > 
> > > The rest looks good, but I need to revisit the timeout/free
> > > handling
> > > since it has been a while and you changed it recently.
> > > 
> > > Regards,
> > > 
> > > Tvrtko
> > > 
> > > > +}
> > > > +
> > > >    static const struct drm_sched_backend_ops
> > > > drm_mock_scheduler_ops
> > > > = {
> > > >    	.run_job = mock_sched_run_job,
> > > >    	.timedout_job = mock_sched_timedout_job,
> > > > -	.free_job = mock_sched_free_job
> > > > +	.free_job = mock_sched_free_job,
> > > > +	.cancel_job = mock_sched_cancel_job,
> > > >    };
> > > >    
> > > >    /**
> > > > @@ -289,7 +302,6 @@ struct drm_mock_scheduler
> > > > *drm_mock_sched_new(struct kunit *test, long timeout)
> > > >    	sched->hw_timeline.context =
> > > > dma_fence_context_alloc(1);
> > > >    	atomic_set(&sched->hw_timeline.next_seqno, 0);
> > > >    	INIT_LIST_HEAD(&sched->job_list);
> > > > -	INIT_LIST_HEAD(&sched->done_list);
> > > >    	spin_lock_init(&sched->lock);
> > > >    
> > > >    	return sched;
> > > > @@ -304,38 +316,6 @@ struct drm_mock_scheduler
> > > > *drm_mock_sched_new(struct kunit *test, long timeout)
> > > >     */
> > > >    void drm_mock_sched_fini(struct drm_mock_scheduler *sched)
> > > >    {
> > > > -	struct drm_mock_sched_job *job, *next;
> > > > -	unsigned long flags;
> > > > -	LIST_HEAD(list);
> > > > -
> > > > -	drm_sched_wqueue_stop(&sched->base);
> > > > -
> > > > -	/* Force complete all unfinished jobs. */
> > > > -	spin_lock_irqsave(&sched->lock, flags);
> > > > -	list_for_each_entry_safe(job, next, &sched->job_list,
> > > > link)
> > > > -		list_move_tail(&job->link, &list);
> > > > -	spin_unlock_irqrestore(&sched->lock, flags);
> > > > -
> > > > -	list_for_each_entry(job, &list, link)
> > > > -		hrtimer_cancel(&job->timer);
> > > > -
> > > > -	spin_lock_irqsave(&sched->lock, flags);
> > > > -	list_for_each_entry_safe(job, next, &list, link)
> > > > -		drm_mock_sched_job_complete(job);
> > > > -	spin_unlock_irqrestore(&sched->lock, flags);
> > > > -
> > > > -	/*
> > > > -	 * Free completed jobs and jobs not yet processed by
> > > > the
> > > > DRM scheduler
> > > > -	 * free worker.
> > > > -	 */
> > > > -	spin_lock_irqsave(&sched->lock, flags);
> > > > -	list_for_each_entry_safe(job, next, &sched->done_list,
> > > > link)
> > > > -		list_move_tail(&job->link, &list);
> > > > -	spin_unlock_irqrestore(&sched->lock, flags);
> > > > -
> > > > -	list_for_each_entry_safe(job, next, &list, link)
> > > > -		mock_sched_free_job(&job->base);
> > > > -
> > > >    	drm_sched_fini(&sched->base);
> > > >    }
> > > >    
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/6] drm/sched/tests: Port to cancel_job()
  2025-07-04  9:53         ` Philipp Stanner
@ 2025-07-04 11:27           ` Philipp Stanner
  2025-07-04 12:30             ` Tvrtko Ursulin
  0 siblings, 1 reply; 15+ messages in thread
From: Philipp Stanner @ 2025-07-04 11:27 UTC (permalink / raw)
  To: phasta, Tvrtko Ursulin, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Matthew Brost, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
	Pierre-Eric Pelloux-Prayer
  Cc: dri-devel, nouveau, linux-kernel, linux-media, linaro-mm-sig

On Fri, 2025-07-04 at 11:53 +0200, Philipp Stanner wrote:
> On Wed, 2025-07-02 at 12:25 +0100, Tvrtko Ursulin wrote:
> > 
> > On 02/07/2025 11:56, Philipp Stanner wrote:
> > > On Wed, 2025-07-02 at 11:36 +0100, Tvrtko Ursulin wrote:
> > > > 
> > > > On 01/07/2025 14:21, Philipp Stanner wrote:
> > > > > The GPU Scheduler now supports a new callback, cancel_job(),
> > > > > which
> > > > > lets
> > > > > the scheduler cancel all jobs which might not yet be freed
> > > > > when
> > > > > drm_sched_fini() runs. Using this callback allows for
> > > > > significantly
> > > > > simplifying the mock scheduler teardown code.
> > > > > 
> > > > > Implement the cancel_job() callback and adjust the code where
> > > > > necessary.
> > > > 
> > > > Cross referencing against my version I think you missed this
> > > > hunk:
> > > > 
> > > > --- a/drivers/gpu/drm/scheduler/tests/sched_tests.h
> > > > +++ b/drivers/gpu/drm/scheduler/tests/sched_tests.h
> > > > @@ -49,7 +49,6 @@ struct drm_mock_scheduler {
> > > > 
> > > >    	spinlock_t		lock;
> > > >    	struct list_head	job_list;
> > > > -	struct list_head	done_list;
> > > > 
> > > >    	struct {
> > > >    		u64		context;
> > > > 
> > > 
> > > Right, overlooked that one.
> > > 
> > > > 
> > > > I also had this:
> > > > 
> > > > @@ -97,7 +96,8 @@ struct drm_mock_sched_job {
> > > >    	struct completion	done;
> > > > 
> > > >    #define DRM_MOCK_SCHED_JOB_DONE		0x1
> > > > -#define DRM_MOCK_SCHED_JOB_TIMEDOUT	0x2
> > > > +#define DRM_MOCK_SCHED_JOB_CANCELED	0x2
> > > > +#define DRM_MOCK_SCHED_JOB_TIMEDOUT	0x4
> > > > 
> > > > And was setting it in the callback. And since we should add a
> > > > test to
> > > > explicitly cover the new callback, and just the callback, that
> > > > could
> > > > make it very easy to do it.
> > > 
> > > What do you imagine that to look like? The scheduler only invokes
> > > the
> > > callback on tear down.
> > > 
> > > We also don't have tests that only test free_job() and the like,
> > > do
> > > we?
> > > 
> > > You cannot test a callback for the scheduler, because the
> > > callback
> > > is
> > > implemented in the driver.
> > > 
> > > Callbacks are tested by using the scheduler. In this case, it's
> > > tested
> > > the intended way by the unit tests invoking drm_sched_free().
> > 
> > Something like (untested):
> > 
> > static void drm_sched_test_cleanup(struct kunit *test)
> > {
> > 	struct drm_mock_sched_entity *entity;
> > 	struct drm_mock_scheduler *sched;
> > 	struct drm_mock_sched_job job;
> > 	bool done;
> > 
> > 	/*
> > 	 * Check that the job cancel callback gets invoked by the
> > scheduler.
> > 	 */
> > 
> > 	sched = drm_mock_sched_new(test, MAX_SCHEDULE_TIMEOUT);
> > 	entity = drm_mock_sched_entity_new(test,
> > 					  
> > DRM_SCHED_PRIORITY_NORMAL,
> > 					   sched);
> > 
> > 	job = drm_mock_sched_job_new(test, entity);
> > 	drm_mock_sched_job_submit(job);
> > 	done = drm_mock_sched_job_wait_scheduled(job, HZ);
> > 	KUNIT_ASSERT_TRUE(test, done);
> > 
> > 	drm_mock_sched_entity_free(entity);
> > 	drm_mock_sched_fini(sched);
> > 
> > 	KUNIT_ASSERT_TRUE(test, job->flags &
> > DRM_MOCK_SCHED_JOB_CANCELED);
> > }
> 
> That could work – but it's racy.
> 
> I wonder if we want to introduce a mechanism into the mock scheduler
> through which we can enforce a simulated GPU hang. Then it would
> never
> race, and that might be useful for more advanced tests for reset
> recovery and those things.
> 
> Opinions?

Ah wait, we can do that with job_duration = 0

Very good, I'll include something like that in v2.


P.

> 
> 
> P.
> 
> 
> > 
> > Or via the hw fence status.
> > 
> > Regards,
> > 
> > Tvrtko
> > 
> > > > > Signed-off-by: Philipp Stanner <phasta@kernel.org>
> > > > > ---
> > > > >    .../gpu/drm/scheduler/tests/mock_scheduler.c  | 66
> > > > > +++++++--
> > > > > -----
> > > > > -----
> > > > >    1 file changed, 23 insertions(+), 43 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > > > b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > > > index 49d067fecd67..2d3169d95200 100644
> > > > > --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > > > +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > > > @@ -63,7 +63,7 @@ static void
> > > > > drm_mock_sched_job_complete(struct
> > > > > drm_mock_sched_job *job)
> > > > >    	lockdep_assert_held(&sched->lock);
> > > > >    
> > > > >    	job->flags |= DRM_MOCK_SCHED_JOB_DONE;
> > > > > -	list_move_tail(&job->link, &sched->done_list);
> > > > > +	list_del(&job->link);
> > > > >    	dma_fence_signal_locked(&job->hw_fence);
> > > > >    	complete(&job->done);
> > > > >    }
> > > > > @@ -236,26 +236,39 @@ mock_sched_timedout_job(struct
> > > > > drm_sched_job
> > > > > *sched_job)
> > > > >    
> > > > >    static void mock_sched_free_job(struct drm_sched_job
> > > > > *sched_job)
> > > > >    {
> > > > > -	struct drm_mock_scheduler *sched =
> > > > > -			drm_sched_to_mock_sched(sched_job-
> > > > > > sched);
> > > > >    	struct drm_mock_sched_job *job =
> > > > > drm_sched_job_to_mock_job(sched_job);
> > > > > -	unsigned long flags;
> > > > >    
> > > > > -	/* Remove from the scheduler done list. */
> > > > > -	spin_lock_irqsave(&sched->lock, flags);
> > > > > -	list_del(&job->link);
> > > > > -	spin_unlock_irqrestore(&sched->lock, flags);
> > > > >    	dma_fence_put(&job->hw_fence);
> > > > > -
> > > > >    	drm_sched_job_cleanup(sched_job);
> > > > >    
> > > > >    	/* Mock job itself is freed by the kunit framework.
> > > > > */
> > > > >    }
> > > > >    
> > > > > +static void mock_sched_cancel_job(struct drm_sched_job
> > > > > *sched_job)
> > > > > +{
> > > > > +	struct drm_mock_scheduler *sched =
> > > > > drm_sched_to_mock_sched(sched_job->sched);
> > > > > +	struct drm_mock_sched_job *job =
> > > > > drm_sched_job_to_mock_job(sched_job);
> > > > > +	unsigned long flags;
> > > > > +
> > > > > +	hrtimer_cancel(&job->timer);
> > > > > +
> > > > > +	spin_lock_irqsave(&sched->lock, flags);
> > > > > +	if (!dma_fence_is_signaled_locked(&job->hw_fence)) {
> > > > > +		list_del(&job->link);
> > > > > +		dma_fence_set_error(&job->hw_fence, -
> > > > > ECANCELED);
> > > > > +		dma_fence_signal_locked(&job->hw_fence);
> > > > > +	}
> > > > > +	spin_unlock_irqrestore(&sched->lock, flags);
> > > > > +
> > > > > +	/* The GPU Scheduler will call
> > > > > drm_sched_backend_ops.free_job(), still.
> > > > > +	 * Mock job itself is freed by the kunit framework.
> > > > > */
> > > > 
> > > > /*
> > > >    * Multiline comment style to stay consistent, at least in
> > > > this
> > > > file.
> > > >    */
> > > > 
> > > > The rest looks good, but I need to revisit the timeout/free
> > > > handling
> > > > since it has been a while and you changed it recently.
> > > > 
> > > > Regards,
> > > > 
> > > > Tvrtko
> > > > 
> > > > > +}
> > > > > +
> > > > >    static const struct drm_sched_backend_ops
> > > > > drm_mock_scheduler_ops
> > > > > = {
> > > > >    	.run_job = mock_sched_run_job,
> > > > >    	.timedout_job = mock_sched_timedout_job,
> > > > > -	.free_job = mock_sched_free_job
> > > > > +	.free_job = mock_sched_free_job,
> > > > > +	.cancel_job = mock_sched_cancel_job,
> > > > >    };
> > > > >    
> > > > >    /**
> > > > > @@ -289,7 +302,6 @@ struct drm_mock_scheduler
> > > > > *drm_mock_sched_new(struct kunit *test, long timeout)
> > > > >    	sched->hw_timeline.context =
> > > > > dma_fence_context_alloc(1);
> > > > >    	atomic_set(&sched->hw_timeline.next_seqno, 0);
> > > > >    	INIT_LIST_HEAD(&sched->job_list);
> > > > > -	INIT_LIST_HEAD(&sched->done_list);
> > > > >    	spin_lock_init(&sched->lock);
> > > > >    
> > > > >    	return sched;
> > > > > @@ -304,38 +316,6 @@ struct drm_mock_scheduler
> > > > > *drm_mock_sched_new(struct kunit *test, long timeout)
> > > > >     */
> > > > >    void drm_mock_sched_fini(struct drm_mock_scheduler *sched)
> > > > >    {
> > > > > -	struct drm_mock_sched_job *job, *next;
> > > > > -	unsigned long flags;
> > > > > -	LIST_HEAD(list);
> > > > > -
> > > > > -	drm_sched_wqueue_stop(&sched->base);
> > > > > -
> > > > > -	/* Force complete all unfinished jobs. */
> > > > > -	spin_lock_irqsave(&sched->lock, flags);
> > > > > -	list_for_each_entry_safe(job, next, &sched-
> > > > > >job_list,
> > > > > link)
> > > > > -		list_move_tail(&job->link, &list);
> > > > > -	spin_unlock_irqrestore(&sched->lock, flags);
> > > > > -
> > > > > -	list_for_each_entry(job, &list, link)
> > > > > -		hrtimer_cancel(&job->timer);
> > > > > -
> > > > > -	spin_lock_irqsave(&sched->lock, flags);
> > > > > -	list_for_each_entry_safe(job, next, &list, link)
> > > > > -		drm_mock_sched_job_complete(job);
> > > > > -	spin_unlock_irqrestore(&sched->lock, flags);
> > > > > -
> > > > > -	/*
> > > > > -	 * Free completed jobs and jobs not yet processed by
> > > > > the
> > > > > DRM scheduler
> > > > > -	 * free worker.
> > > > > -	 */
> > > > > -	spin_lock_irqsave(&sched->lock, flags);
> > > > > -	list_for_each_entry_safe(job, next, &sched-
> > > > > >done_list,
> > > > > link)
> > > > > -		list_move_tail(&job->link, &list);
> > > > > -	spin_unlock_irqrestore(&sched->lock, flags);
> > > > > -
> > > > > -	list_for_each_entry_safe(job, next, &list, link)
> > > > > -		mock_sched_free_job(&job->base);
> > > > > -
> > > > >    	drm_sched_fini(&sched->base);
> > > > >    }
> > > > >    
> > > > 
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/6] drm/sched: Avoid memory leaks with cancel_job() callback
  2025-07-01 13:21 ` [PATCH 1/6] drm/sched: Avoid " Philipp Stanner
  2025-07-02 10:19   ` Tvrtko Ursulin
@ 2025-07-04 12:18   ` Maíra Canal
  1 sibling, 0 replies; 15+ messages in thread
From: Maíra Canal @ 2025-07-04 12:18 UTC (permalink / raw)
  To: Philipp Stanner, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Matthew Brost, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
	Tvrtko Ursulin, Pierre-Eric Pelloux-Prayer
  Cc: dri-devel, nouveau, linux-kernel, linux-media, linaro-mm-sig

Hi Philipp,

On 01/07/25 10:21, Philipp Stanner wrote:
> Since its inception, the GPU scheduler can leak memory if the driver
> calls drm_sched_fini() while there are still jobs in flight.
> 
> The simplest way to solve this in a backwards compatible manner is by
> adding a new callback, drm_sched_backend_ops.cancel_job(), which
> instructs the driver to signal the hardware fence associated with the
> job. Afterwards, the scheduler can savely use the established free_job()

s/savely/safely

> callback for freeing the job.
> 
> Implement the new backend_ops callback cancel_job().
> 
> Suggested-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Link: https://lore.kernel.org/dri-devel/20250418113211.69956-1-tvrtko.ursulin@igalia.com/
> Signed-off-by: Philipp Stanner <phasta@kernel.org>

Reviewed-by: Maíra Canal <mcanal@igalia.com>

Best Regards,
- Maíra

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/6] drm/sched/tests: Port to cancel_job()
  2025-07-04 11:27           ` Philipp Stanner
@ 2025-07-04 12:30             ` Tvrtko Ursulin
  0 siblings, 0 replies; 15+ messages in thread
From: Tvrtko Ursulin @ 2025-07-04 12:30 UTC (permalink / raw)
  To: phasta, Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Christian König, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
	Pierre-Eric Pelloux-Prayer
  Cc: dri-devel, nouveau, linux-kernel, linux-media, linaro-mm-sig


On 04/07/2025 12:27, Philipp Stanner wrote:
> On Fri, 2025-07-04 at 11:53 +0200, Philipp Stanner wrote:
>> On Wed, 2025-07-02 at 12:25 +0100, Tvrtko Ursulin wrote:
>>>
>>> On 02/07/2025 11:56, Philipp Stanner wrote:
>>>> On Wed, 2025-07-02 at 11:36 +0100, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 01/07/2025 14:21, Philipp Stanner wrote:
>>>>>> The GPU Scheduler now supports a new callback, cancel_job(),
>>>>>> which
>>>>>> lets
>>>>>> the scheduler cancel all jobs which might not yet be freed
>>>>>> when
>>>>>> drm_sched_fini() runs. Using this callback allows for
>>>>>> significantly
>>>>>> simplifying the mock scheduler teardown code.
>>>>>>
>>>>>> Implement the cancel_job() callback and adjust the code where
>>>>>> necessary.
>>>>>
>>>>> Cross referencing against my version I think you missed this
>>>>> hunk:
>>>>>
>>>>> --- a/drivers/gpu/drm/scheduler/tests/sched_tests.h
>>>>> +++ b/drivers/gpu/drm/scheduler/tests/sched_tests.h
>>>>> @@ -49,7 +49,6 @@ struct drm_mock_scheduler {
>>>>>
>>>>>     	spinlock_t		lock;
>>>>>     	struct list_head	job_list;
>>>>> -	struct list_head	done_list;
>>>>>
>>>>>     	struct {
>>>>>     		u64		context;
>>>>>
>>>>
>>>> Right, overlooked that one.
>>>>
>>>>>
>>>>> I also had this:
>>>>>
>>>>> @@ -97,7 +96,8 @@ struct drm_mock_sched_job {
>>>>>     	struct completion	done;
>>>>>
>>>>>     #define DRM_MOCK_SCHED_JOB_DONE		0x1
>>>>> -#define DRM_MOCK_SCHED_JOB_TIMEDOUT	0x2
>>>>> +#define DRM_MOCK_SCHED_JOB_CANCELED	0x2
>>>>> +#define DRM_MOCK_SCHED_JOB_TIMEDOUT	0x4
>>>>>
>>>>> And was setting it in the callback. And since we should add a
>>>>> test to
>>>>> explicitly cover the new callback, and just the callback, that
>>>>> could
>>>>> make it very easy to do it.
>>>>
>>>> What do you imagine that to look like? The scheduler only invokes
>>>> the
>>>> callback on tear down.
>>>>
>>>> We also don't have tests that only test free_job() and the like,
>>>> do
>>>> we?
>>>>
>>>> You cannot test a callback for the scheduler, because the
>>>> callback
>>>> is
>>>> implemented in the driver.
>>>>
>>>> Callbacks are tested by using the scheduler. In this case, it's
>>>> tested
>>>> the intended way by the unit tests invoking drm_sched_free().
>>>
>>> Something like (untested):
>>>
>>> static void drm_sched_test_cleanup(struct kunit *test)
>>> {
>>> 	struct drm_mock_sched_entity *entity;
>>> 	struct drm_mock_scheduler *sched;
>>> 	struct drm_mock_sched_job job;
>>> 	bool done;
>>>
>>> 	/*
>>> 	 * Check that the job cancel callback gets invoked by the
>>> scheduler.
>>> 	 */
>>>
>>> 	sched = drm_mock_sched_new(test, MAX_SCHEDULE_TIMEOUT);
>>> 	entity = drm_mock_sched_entity_new(test,
>>> 					
>>> DRM_SCHED_PRIORITY_NORMAL,
>>> 					   sched);
>>>
>>> 	job = drm_mock_sched_job_new(test, entity);
>>> 	drm_mock_sched_job_submit(job);
>>> 	done = drm_mock_sched_job_wait_scheduled(job, HZ);
>>> 	KUNIT_ASSERT_TRUE(test, done);
>>>
>>> 	drm_mock_sched_entity_free(entity);
>>> 	drm_mock_sched_fini(sched);
>>>
>>> 	KUNIT_ASSERT_TRUE(test, job->flags &
>>> DRM_MOCK_SCHED_JOB_CANCELED);
>>> }
>>
>> That could work – but it's racy.
>>
>> I wonder if we want to introduce a mechanism into the mock scheduler
>> through which we can enforce a simulated GPU hang. Then it would
>> never
>> race, and that might be useful for more advanced tests for reset
>> recovery and those things.
>>
>> Opinions?
> 
> Ah wait, we can do that with job_duration = 0

Yes, the above already wasn't racy.

If job duration is not explicitly set, and it isn't, the job will not 
complete until test would call drm_mock_sched_advance(sched, 1). And 
since it doesn't, the job is guaranteed to be sitting on the list 
forever. So drm_sched_fini() has a guaranteed chance to clean it up.

> Very good, I'll include something like that in v2.

Ack.

Regards,

Tvrtko

>>> Or via the hw fence status.
>>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>>>>> Signed-off-by: Philipp Stanner <phasta@kernel.org>
>>>>>> ---
>>>>>>     .../gpu/drm/scheduler/tests/mock_scheduler.c  | 66
>>>>>> +++++++--
>>>>>> -----
>>>>>> -----
>>>>>>     1 file changed, 23 insertions(+), 43 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>>>>>> b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>>>>>> index 49d067fecd67..2d3169d95200 100644
>>>>>> --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>>>>>> +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>>>>>> @@ -63,7 +63,7 @@ static void
>>>>>> drm_mock_sched_job_complete(struct
>>>>>> drm_mock_sched_job *job)
>>>>>>     	lockdep_assert_held(&sched->lock);
>>>>>>     
>>>>>>     	job->flags |= DRM_MOCK_SCHED_JOB_DONE;
>>>>>> -	list_move_tail(&job->link, &sched->done_list);
>>>>>> +	list_del(&job->link);
>>>>>>     	dma_fence_signal_locked(&job->hw_fence);
>>>>>>     	complete(&job->done);
>>>>>>     }
>>>>>> @@ -236,26 +236,39 @@ mock_sched_timedout_job(struct
>>>>>> drm_sched_job
>>>>>> *sched_job)
>>>>>>     
>>>>>>     static void mock_sched_free_job(struct drm_sched_job
>>>>>> *sched_job)
>>>>>>     {
>>>>>> -	struct drm_mock_scheduler *sched =
>>>>>> -			drm_sched_to_mock_sched(sched_job-
>>>>>>> sched);
>>>>>>     	struct drm_mock_sched_job *job =
>>>>>> drm_sched_job_to_mock_job(sched_job);
>>>>>> -	unsigned long flags;
>>>>>>     
>>>>>> -	/* Remove from the scheduler done list. */
>>>>>> -	spin_lock_irqsave(&sched->lock, flags);
>>>>>> -	list_del(&job->link);
>>>>>> -	spin_unlock_irqrestore(&sched->lock, flags);
>>>>>>     	dma_fence_put(&job->hw_fence);
>>>>>> -
>>>>>>     	drm_sched_job_cleanup(sched_job);
>>>>>>     
>>>>>>     	/* Mock job itself is freed by the kunit framework.
>>>>>> */
>>>>>>     }
>>>>>>     
>>>>>> +static void mock_sched_cancel_job(struct drm_sched_job
>>>>>> *sched_job)
>>>>>> +{
>>>>>> +	struct drm_mock_scheduler *sched =
>>>>>> drm_sched_to_mock_sched(sched_job->sched);
>>>>>> +	struct drm_mock_sched_job *job =
>>>>>> drm_sched_job_to_mock_job(sched_job);
>>>>>> +	unsigned long flags;
>>>>>> +
>>>>>> +	hrtimer_cancel(&job->timer);
>>>>>> +
>>>>>> +	spin_lock_irqsave(&sched->lock, flags);
>>>>>> +	if (!dma_fence_is_signaled_locked(&job->hw_fence)) {
>>>>>> +		list_del(&job->link);
>>>>>> +		dma_fence_set_error(&job->hw_fence, -
>>>>>> ECANCELED);
>>>>>> +		dma_fence_signal_locked(&job->hw_fence);
>>>>>> +	}
>>>>>> +	spin_unlock_irqrestore(&sched->lock, flags);
>>>>>> +
>>>>>> +	/* The GPU Scheduler will call
>>>>>> drm_sched_backend_ops.free_job(), still.
>>>>>> +	 * Mock job itself is freed by the kunit framework.
>>>>>> */
>>>>>
>>>>> /*
>>>>>     * Multiline comment style to stay consistent, at least in
>>>>> this
>>>>> file.
>>>>>     */
>>>>>
>>>>> The rest looks good, but I need to revisit the timeout/free
>>>>> handling
>>>>> since it has been a while and you changed it recently.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Tvrtko
>>>>>
>>>>>> +}
>>>>>> +
>>>>>>     static const struct drm_sched_backend_ops
>>>>>> drm_mock_scheduler_ops
>>>>>> = {
>>>>>>     	.run_job = mock_sched_run_job,
>>>>>>     	.timedout_job = mock_sched_timedout_job,
>>>>>> -	.free_job = mock_sched_free_job
>>>>>> +	.free_job = mock_sched_free_job,
>>>>>> +	.cancel_job = mock_sched_cancel_job,
>>>>>>     };
>>>>>>     
>>>>>>     /**
>>>>>> @@ -289,7 +302,6 @@ struct drm_mock_scheduler
>>>>>> *drm_mock_sched_new(struct kunit *test, long timeout)
>>>>>>     	sched->hw_timeline.context =
>>>>>> dma_fence_context_alloc(1);
>>>>>>     	atomic_set(&sched->hw_timeline.next_seqno, 0);
>>>>>>     	INIT_LIST_HEAD(&sched->job_list);
>>>>>> -	INIT_LIST_HEAD(&sched->done_list);
>>>>>>     	spin_lock_init(&sched->lock);
>>>>>>     
>>>>>>     	return sched;
>>>>>> @@ -304,38 +316,6 @@ struct drm_mock_scheduler
>>>>>> *drm_mock_sched_new(struct kunit *test, long timeout)
>>>>>>      */
>>>>>>     void drm_mock_sched_fini(struct drm_mock_scheduler *sched)
>>>>>>     {
>>>>>> -	struct drm_mock_sched_job *job, *next;
>>>>>> -	unsigned long flags;
>>>>>> -	LIST_HEAD(list);
>>>>>> -
>>>>>> -	drm_sched_wqueue_stop(&sched->base);
>>>>>> -
>>>>>> -	/* Force complete all unfinished jobs. */
>>>>>> -	spin_lock_irqsave(&sched->lock, flags);
>>>>>> -	list_for_each_entry_safe(job, next, &sched-
>>>>>>> job_list,
>>>>>> link)
>>>>>> -		list_move_tail(&job->link, &list);
>>>>>> -	spin_unlock_irqrestore(&sched->lock, flags);
>>>>>> -
>>>>>> -	list_for_each_entry(job, &list, link)
>>>>>> -		hrtimer_cancel(&job->timer);
>>>>>> -
>>>>>> -	spin_lock_irqsave(&sched->lock, flags);
>>>>>> -	list_for_each_entry_safe(job, next, &list, link)
>>>>>> -		drm_mock_sched_job_complete(job);
>>>>>> -	spin_unlock_irqrestore(&sched->lock, flags);
>>>>>> -
>>>>>> -	/*
>>>>>> -	 * Free completed jobs and jobs not yet processed by
>>>>>> the
>>>>>> DRM scheduler
>>>>>> -	 * free worker.
>>>>>> -	 */
>>>>>> -	spin_lock_irqsave(&sched->lock, flags);
>>>>>> -	list_for_each_entry_safe(job, next, &sched-
>>>>>>> done_list,
>>>>>> link)
>>>>>> -		list_move_tail(&job->link, &list);
>>>>>> -	spin_unlock_irqrestore(&sched->lock, flags);
>>>>>> -
>>>>>> -	list_for_each_entry_safe(job, next, &list, link)
>>>>>> -		mock_sched_free_job(&job->base);
>>>>>> -
>>>>>>     	drm_sched_fini(&sched->base);
>>>>>>     }
>>>>>>     
>>>>>
>>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-07-04 12:30 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-01 13:21 [PATCH 0/6] drm/sched: Fix memory leaks with cancel_job() callback Philipp Stanner
2025-07-01 13:21 ` [PATCH 1/6] drm/sched: Avoid " Philipp Stanner
2025-07-02 10:19   ` Tvrtko Ursulin
2025-07-04 12:18   ` Maíra Canal
2025-07-01 13:21 ` [PATCH 2/6] drm/sched/tests: Port to cancel_job() Philipp Stanner
2025-07-02 10:36   ` Tvrtko Ursulin
2025-07-02 10:56     ` Philipp Stanner
2025-07-02 11:25       ` Tvrtko Ursulin
2025-07-04  9:53         ` Philipp Stanner
2025-07-04 11:27           ` Philipp Stanner
2025-07-04 12:30             ` Tvrtko Ursulin
2025-07-01 13:21 ` [PATCH 3/6] drm/sched: Warn if pending list is not empty Philipp Stanner
2025-07-01 13:21 ` [PATCH 4/6] drm/nouveau: Make fence container helper usable driver-wide Philipp Stanner
2025-07-01 13:21 ` [PATCH 5/6] drm/nouveau: Add new callback for scheduler teardown Philipp Stanner
2025-07-01 13:21 ` [PATCH 6/6] drm/nouveau: Remove waitque for sched teardown Philipp Stanner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).