linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/7] drm/sched: Fix memory leaks with cancel_job() callback
@ 2025-07-07 13:42 Philipp Stanner
  2025-07-07 13:42 ` [PATCH v2 1/7] drm/sched: Avoid " Philipp Stanner
                   ` (6 more replies)
  0 siblings, 7 replies; 15+ messages in thread
From: Philipp Stanner @ 2025-07-07 13:42 UTC (permalink / raw)
  To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Philipp Stanner, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
	Tvrtko Ursulin, Pierre-Eric Pelloux-Prayer
  Cc: dri-devel, nouveau, linux-kernel, linux-media

Changes in v2:
  - Add new unit test to test cancel_job()'s behavior. (Tvrtko)
  - Add RB from Maíra

Changes since the RFC:
  - Rename helper function for drm_sched_fini(). (Tvrtko)
  - Add Link to Tvrtko's RFC patch to patch 1.


Since a long time, drm_sched_fini() can leak jobs that are still in
drm_sched.pending_list.

This series solves the leaks in a backwards-compatible manner by adding
a new, optional callback. If that callback is implemented, the scheduler
uses it to cancel all jobs from pending_list and then frees them.

Philipp Stanner (7):
  drm/sched: Avoid memory leaks with cancel_job() callback
  drm/sched/tests: Implement cancel_job() callback
  drm/sched/tests: Add unit test for cancel_job()
  drm/sched: Warn if pending list is not empty
  drm/nouveau: Make fence container helper usable driver-wide
  drm/nouveau: Add new callback for scheduler teardown
  drm/nouveau: Remove waitque for sched teardown

 drivers/gpu/drm/nouveau/nouveau_fence.c       | 35 ++++++----
 drivers/gpu/drm/nouveau/nouveau_fence.h       |  7 ++
 drivers/gpu/drm/nouveau/nouveau_sched.c       | 35 ++++++----
 drivers/gpu/drm/nouveau/nouveau_sched.h       |  9 +--
 drivers/gpu/drm/nouveau/nouveau_uvmm.c        |  8 +--
 drivers/gpu/drm/scheduler/sched_main.c        | 37 +++++++----
 .../gpu/drm/scheduler/tests/mock_scheduler.c  | 66 +++++++------------
 drivers/gpu/drm/scheduler/tests/tests_basic.c | 43 ++++++++++++
 include/drm/gpu_scheduler.h                   | 18 +++++
 9 files changed, 165 insertions(+), 93 deletions(-)

-- 
2.49.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 1/7] drm/sched: Avoid memory leaks with cancel_job() callback
  2025-07-07 13:42 [PATCH v2 0/7] drm/sched: Fix memory leaks with cancel_job() callback Philipp Stanner
@ 2025-07-07 13:42 ` Philipp Stanner
  2025-07-07 13:42 ` [PATCH v2 2/7] drm/sched/tests: Implement " Philipp Stanner
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Philipp Stanner @ 2025-07-07 13:42 UTC (permalink / raw)
  To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Philipp Stanner, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
	Tvrtko Ursulin, Pierre-Eric Pelloux-Prayer
  Cc: dri-devel, nouveau, linux-kernel, linux-media, Maíra Canal

Since its inception, the GPU scheduler can leak memory if the driver
calls drm_sched_fini() while there are still jobs in flight.

The simplest way to solve this in a backwards compatible manner is by
adding a new callback, drm_sched_backend_ops.cancel_job(), which
instructs the driver to signal the hardware fence associated with the
job. Afterwards, the scheduler can safely use the established free_job()
callback for freeing the job.

Implement the new backend_ops callback cancel_job().

Suggested-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Link: https://lore.kernel.org/dri-devel/20250418113211.69956-1-tvrtko.ursulin@igalia.com/
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Reviewed-by: Maíra Canal <mcanal@igalia.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 34 ++++++++++++++++----------
 include/drm/gpu_scheduler.h            | 18 ++++++++++++++
 2 files changed, 39 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index c63543132f9d..1239954f5f7c 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1353,6 +1353,18 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
 }
 EXPORT_SYMBOL(drm_sched_init);
 
+static void drm_sched_cancel_remaining_jobs(struct drm_gpu_scheduler *sched)
+{
+	struct drm_sched_job *job, *tmp;
+
+	/* All other accessors are stopped. No locking necessary. */
+	list_for_each_entry_safe_reverse(job, tmp, &sched->pending_list, list) {
+		sched->ops->cancel_job(job);
+		list_del(&job->list);
+		sched->ops->free_job(job);
+	}
+}
+
 /**
  * drm_sched_fini - Destroy a gpu scheduler
  *
@@ -1360,19 +1372,11 @@ EXPORT_SYMBOL(drm_sched_init);
  *
  * Tears down and cleans up the scheduler.
  *
- * This stops submission of new jobs to the hardware through
- * drm_sched_backend_ops.run_job(). Consequently, drm_sched_backend_ops.free_job()
- * will not be called for all jobs still in drm_gpu_scheduler.pending_list.
- * There is no solution for this currently. Thus, it is up to the driver to make
- * sure that:
- *
- *  a) drm_sched_fini() is only called after for all submitted jobs
- *     drm_sched_backend_ops.free_job() has been called or that
- *  b) the jobs for which drm_sched_backend_ops.free_job() has not been called
- *     after drm_sched_fini() ran are freed manually.
- *
- * FIXME: Take care of the above problem and prevent this function from leaking
- * the jobs in drm_gpu_scheduler.pending_list under any circumstances.
+ * This stops submission of new jobs to the hardware through &struct
+ * drm_sched_backend_ops.run_job. If &struct drm_sched_backend_ops.cancel_job
+ * is implemented, all jobs will be canceled through it and afterwards cleaned
+ * up through &struct drm_sched_backend_ops.free_job. If cancel_job is not
+ * implemented, memory could leak.
  */
 void drm_sched_fini(struct drm_gpu_scheduler *sched)
 {
@@ -1402,6 +1406,10 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
 	/* Confirm no work left behind accessing device structures */
 	cancel_delayed_work_sync(&sched->work_tdr);
 
+	/* Avoid memory leaks if supported by the driver. */
+	if (sched->ops->cancel_job)
+		drm_sched_cancel_remaining_jobs(sched);
+
 	if (sched->own_submit_wq)
 		destroy_workqueue(sched->submit_wq);
 	sched->ready = false;
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index e62a7214e052..190844370f48 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -512,6 +512,24 @@ struct drm_sched_backend_ops {
          * and it's time to clean it up.
 	 */
 	void (*free_job)(struct drm_sched_job *sched_job);
+
+	/**
+	 * @cancel_job: Used by the scheduler to guarantee remaining jobs' fences
+	 * get signaled in drm_sched_fini().
+	 *
+	 * Used by the scheduler to cancel all jobs that have not been executed
+	 * with &struct drm_sched_backend_ops.run_job by the time
+	 * drm_sched_fini() gets invoked.
+	 *
+	 * Drivers need to signal the passed job's hardware fence with an
+	 * appropriate error code (e.g., -ECANCELED) in this callback. They
+	 * must not free the job.
+	 *
+	 * The scheduler will only call this callback once it stopped calling
+	 * all other callbacks forever, with the exception of &struct
+	 * drm_sched_backend_ops.free_job.
+	 */
+	void (*cancel_job)(struct drm_sched_job *sched_job);
 };
 
 /**
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 2/7] drm/sched/tests: Implement cancel_job() callback
  2025-07-07 13:42 [PATCH v2 0/7] drm/sched: Fix memory leaks with cancel_job() callback Philipp Stanner
  2025-07-07 13:42 ` [PATCH v2 1/7] drm/sched: Avoid " Philipp Stanner
@ 2025-07-07 13:42 ` Philipp Stanner
  2025-07-07 15:25   ` Tvrtko Ursulin
  2025-07-07 13:42 ` [PATCH v2 3/7] drm/sched/tests: Add unit test for cancel_job() Philipp Stanner
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Philipp Stanner @ 2025-07-07 13:42 UTC (permalink / raw)
  To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Philipp Stanner, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
	Tvrtko Ursulin, Pierre-Eric Pelloux-Prayer
  Cc: dri-devel, nouveau, linux-kernel, linux-media

The GPU Scheduler now supports a new callback, cancel_job(), which lets
the scheduler cancel all jobs which might not yet be freed when
drm_sched_fini() runs. Using this callback allows for significantly
simplifying the mock scheduler teardown code.

Implement the cancel_job() callback and adjust the code where necessary.

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 .../gpu/drm/scheduler/tests/mock_scheduler.c  | 66 +++++++------------
 1 file changed, 23 insertions(+), 43 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
index 49d067fecd67..2d3169d95200 100644
--- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
+++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
@@ -63,7 +63,7 @@ static void drm_mock_sched_job_complete(struct drm_mock_sched_job *job)
 	lockdep_assert_held(&sched->lock);
 
 	job->flags |= DRM_MOCK_SCHED_JOB_DONE;
-	list_move_tail(&job->link, &sched->done_list);
+	list_del(&job->link);
 	dma_fence_signal_locked(&job->hw_fence);
 	complete(&job->done);
 }
@@ -236,26 +236,39 @@ mock_sched_timedout_job(struct drm_sched_job *sched_job)
 
 static void mock_sched_free_job(struct drm_sched_job *sched_job)
 {
-	struct drm_mock_scheduler *sched =
-			drm_sched_to_mock_sched(sched_job->sched);
 	struct drm_mock_sched_job *job = drm_sched_job_to_mock_job(sched_job);
-	unsigned long flags;
 
-	/* Remove from the scheduler done list. */
-	spin_lock_irqsave(&sched->lock, flags);
-	list_del(&job->link);
-	spin_unlock_irqrestore(&sched->lock, flags);
 	dma_fence_put(&job->hw_fence);
-
 	drm_sched_job_cleanup(sched_job);
 
 	/* Mock job itself is freed by the kunit framework. */
 }
 
+static void mock_sched_cancel_job(struct drm_sched_job *sched_job)
+{
+	struct drm_mock_scheduler *sched = drm_sched_to_mock_sched(sched_job->sched);
+	struct drm_mock_sched_job *job = drm_sched_job_to_mock_job(sched_job);
+	unsigned long flags;
+
+	hrtimer_cancel(&job->timer);
+
+	spin_lock_irqsave(&sched->lock, flags);
+	if (!dma_fence_is_signaled_locked(&job->hw_fence)) {
+		list_del(&job->link);
+		dma_fence_set_error(&job->hw_fence, -ECANCELED);
+		dma_fence_signal_locked(&job->hw_fence);
+	}
+	spin_unlock_irqrestore(&sched->lock, flags);
+
+	/* The GPU Scheduler will call drm_sched_backend_ops.free_job(), still.
+	 * Mock job itself is freed by the kunit framework. */
+}
+
 static const struct drm_sched_backend_ops drm_mock_scheduler_ops = {
 	.run_job = mock_sched_run_job,
 	.timedout_job = mock_sched_timedout_job,
-	.free_job = mock_sched_free_job
+	.free_job = mock_sched_free_job,
+	.cancel_job = mock_sched_cancel_job,
 };
 
 /**
@@ -289,7 +302,6 @@ struct drm_mock_scheduler *drm_mock_sched_new(struct kunit *test, long timeout)
 	sched->hw_timeline.context = dma_fence_context_alloc(1);
 	atomic_set(&sched->hw_timeline.next_seqno, 0);
 	INIT_LIST_HEAD(&sched->job_list);
-	INIT_LIST_HEAD(&sched->done_list);
 	spin_lock_init(&sched->lock);
 
 	return sched;
@@ -304,38 +316,6 @@ struct drm_mock_scheduler *drm_mock_sched_new(struct kunit *test, long timeout)
  */
 void drm_mock_sched_fini(struct drm_mock_scheduler *sched)
 {
-	struct drm_mock_sched_job *job, *next;
-	unsigned long flags;
-	LIST_HEAD(list);
-
-	drm_sched_wqueue_stop(&sched->base);
-
-	/* Force complete all unfinished jobs. */
-	spin_lock_irqsave(&sched->lock, flags);
-	list_for_each_entry_safe(job, next, &sched->job_list, link)
-		list_move_tail(&job->link, &list);
-	spin_unlock_irqrestore(&sched->lock, flags);
-
-	list_for_each_entry(job, &list, link)
-		hrtimer_cancel(&job->timer);
-
-	spin_lock_irqsave(&sched->lock, flags);
-	list_for_each_entry_safe(job, next, &list, link)
-		drm_mock_sched_job_complete(job);
-	spin_unlock_irqrestore(&sched->lock, flags);
-
-	/*
-	 * Free completed jobs and jobs not yet processed by the DRM scheduler
-	 * free worker.
-	 */
-	spin_lock_irqsave(&sched->lock, flags);
-	list_for_each_entry_safe(job, next, &sched->done_list, link)
-		list_move_tail(&job->link, &list);
-	spin_unlock_irqrestore(&sched->lock, flags);
-
-	list_for_each_entry_safe(job, next, &list, link)
-		mock_sched_free_job(&job->base);
-
 	drm_sched_fini(&sched->base);
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 3/7] drm/sched/tests: Add unit test for cancel_job()
  2025-07-07 13:42 [PATCH v2 0/7] drm/sched: Fix memory leaks with cancel_job() callback Philipp Stanner
  2025-07-07 13:42 ` [PATCH v2 1/7] drm/sched: Avoid " Philipp Stanner
  2025-07-07 13:42 ` [PATCH v2 2/7] drm/sched/tests: Implement " Philipp Stanner
@ 2025-07-07 13:42 ` Philipp Stanner
  2025-07-07 15:20   ` Tvrtko Ursulin
  2025-07-07 13:42 ` [PATCH v2 4/7] drm/sched: Warn if pending list is not empty Philipp Stanner
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Philipp Stanner @ 2025-07-07 13:42 UTC (permalink / raw)
  To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Philipp Stanner, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
	Tvrtko Ursulin, Pierre-Eric Pelloux-Prayer
  Cc: dri-devel, nouveau, linux-kernel, linux-media

The scheduler unit tests now provide a new callback, cancel_job(). This
callback gets used by drm_sched_fini() for all still pending jobs to
cancel them.

Implement a new unit test to test this.

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/scheduler/tests/tests_basic.c | 43 +++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/tests/tests_basic.c b/drivers/gpu/drm/scheduler/tests/tests_basic.c
index 7230057e0594..fa3da2db4893 100644
--- a/drivers/gpu/drm/scheduler/tests/tests_basic.c
+++ b/drivers/gpu/drm/scheduler/tests/tests_basic.c
@@ -204,6 +204,48 @@ static struct kunit_suite drm_sched_basic = {
 	.test_cases = drm_sched_basic_tests,
 };
 
+static void drm_sched_basic_cancel(struct kunit *test)
+{
+	struct drm_mock_sched_entity *entity;
+	struct drm_mock_scheduler *sched;
+	struct drm_mock_sched_job *job;
+	bool done;
+
+	/*
+	 * Check that the configured credit limit is respected.
+	 */
+
+	sched = drm_mock_sched_new(test, MAX_SCHEDULE_TIMEOUT);
+	sched->base.credit_limit = 1;
+
+	entity = drm_mock_sched_entity_new(test, DRM_SCHED_PRIORITY_NORMAL,
+					   sched);
+
+	job = drm_mock_sched_job_new(test, entity);
+
+	drm_mock_sched_job_submit(job);
+
+	done = drm_mock_sched_job_wait_scheduled(job, HZ);
+	KUNIT_ASSERT_TRUE(test, done);
+
+	drm_mock_sched_entity_free(entity);
+	drm_mock_sched_fini(sched);
+
+	KUNIT_ASSERT_EQ(test, job->hw_fence.error, -ECANCELED);
+}
+
+static struct kunit_case drm_sched_cancel_tests[] = {
+	KUNIT_CASE(drm_sched_basic_cancel),
+	{}
+};
+
+static struct kunit_suite drm_sched_cancel = {
+	.name = "drm_sched_basic_cancel_tests",
+	.init = drm_sched_basic_init,
+	.exit = drm_sched_basic_exit,
+	.test_cases = drm_sched_cancel_tests,
+};
+
 static void drm_sched_basic_timeout(struct kunit *test)
 {
 	struct drm_mock_scheduler *sched = test->priv;
@@ -471,6 +513,7 @@ static struct kunit_suite drm_sched_credits = {
 
 kunit_test_suites(&drm_sched_basic,
 		  &drm_sched_timeout,
+		  &drm_sched_cancel,
 		  &drm_sched_priority,
 		  &drm_sched_modify_sched,
 		  &drm_sched_credits);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 4/7] drm/sched: Warn if pending list is not empty
  2025-07-07 13:42 [PATCH v2 0/7] drm/sched: Fix memory leaks with cancel_job() callback Philipp Stanner
                   ` (2 preceding siblings ...)
  2025-07-07 13:42 ` [PATCH v2 3/7] drm/sched/tests: Add unit test for cancel_job() Philipp Stanner
@ 2025-07-07 13:42 ` Philipp Stanner
  2025-07-07 13:42 ` [PATCH v2 5/7] drm/nouveau: Make fence container helper usable driver-wide Philipp Stanner
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Philipp Stanner @ 2025-07-07 13:42 UTC (permalink / raw)
  To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Philipp Stanner, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
	Tvrtko Ursulin, Pierre-Eric Pelloux-Prayer
  Cc: dri-devel, nouveau, linux-kernel, linux-media

drm_sched_fini() can leak jobs under certain circumstances.

Warn if that happens.

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/scheduler/sched_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 1239954f5f7c..dadf1a22ddf6 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1415,6 +1415,9 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
 	sched->ready = false;
 	kfree(sched->sched_rq);
 	sched->sched_rq = NULL;
+
+	if (!list_empty(&sched->pending_list))
+		dev_err(sched->dev, "Tearing down scheduler while jobs are pending!\n");
 }
 EXPORT_SYMBOL(drm_sched_fini);
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 5/7] drm/nouveau: Make fence container helper usable driver-wide
  2025-07-07 13:42 [PATCH v2 0/7] drm/sched: Fix memory leaks with cancel_job() callback Philipp Stanner
                   ` (3 preceding siblings ...)
  2025-07-07 13:42 ` [PATCH v2 4/7] drm/sched: Warn if pending list is not empty Philipp Stanner
@ 2025-07-07 13:42 ` Philipp Stanner
  2025-07-07 13:45   ` Danilo Krummrich
  2025-07-07 13:42 ` [PATCH v2 6/7] drm/nouveau: Add new callback for scheduler teardown Philipp Stanner
  2025-07-07 13:42 ` [PATCH v2 7/7] drm/nouveau: Remove waitque for sched teardown Philipp Stanner
  6 siblings, 1 reply; 15+ messages in thread
From: Philipp Stanner @ 2025-07-07 13:42 UTC (permalink / raw)
  To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Philipp Stanner, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
	Tvrtko Ursulin, Pierre-Eric Pelloux-Prayer
  Cc: dri-devel, nouveau, linux-kernel, linux-media

In order to implement a new DRM GPU scheduler callback in Nouveau, a
helper for obtaining a nouveau_fence from a dma_fence is necessary. Such
a helper exists already inside nouveau_fence.c, called from_fence().

Make that helper available to other C files with a more precise name.

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/nouveau/nouveau_fence.c | 20 +++++++-------------
 drivers/gpu/drm/nouveau/nouveau_fence.h |  6 ++++++
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/drivers/gpu/drm/nouveau/nouveau_fence.c
index d5654e26d5bc..869d4335c0f4 100644
--- a/drivers/gpu/drm/nouveau/nouveau_fence.c
+++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
@@ -38,12 +38,6 @@
 static const struct dma_fence_ops nouveau_fence_ops_uevent;
 static const struct dma_fence_ops nouveau_fence_ops_legacy;
 
-static inline struct nouveau_fence *
-from_fence(struct dma_fence *fence)
-{
-	return container_of(fence, struct nouveau_fence, base);
-}
-
 static inline struct nouveau_fence_chan *
 nouveau_fctx(struct nouveau_fence *fence)
 {
@@ -77,7 +71,7 @@ nouveau_local_fence(struct dma_fence *fence, struct nouveau_drm *drm)
 	    fence->ops != &nouveau_fence_ops_uevent)
 		return NULL;
 
-	return from_fence(fence);
+	return to_nouveau_fence(fence);
 }
 
 void
@@ -268,7 +262,7 @@ nouveau_fence_done(struct nouveau_fence *fence)
 static long
 nouveau_fence_wait_legacy(struct dma_fence *f, bool intr, long wait)
 {
-	struct nouveau_fence *fence = from_fence(f);
+	struct nouveau_fence *fence = to_nouveau_fence(f);
 	unsigned long sleep_time = NSEC_PER_MSEC / 1000;
 	unsigned long t = jiffies, timeout = t + wait;
 
@@ -448,7 +442,7 @@ static const char *nouveau_fence_get_get_driver_name(struct dma_fence *fence)
 
 static const char *nouveau_fence_get_timeline_name(struct dma_fence *f)
 {
-	struct nouveau_fence *fence = from_fence(f);
+	struct nouveau_fence *fence = to_nouveau_fence(f);
 	struct nouveau_fence_chan *fctx = nouveau_fctx(fence);
 
 	return !fctx->dead ? fctx->name : "dead channel";
@@ -462,7 +456,7 @@ static const char *nouveau_fence_get_timeline_name(struct dma_fence *f)
  */
 static bool nouveau_fence_is_signaled(struct dma_fence *f)
 {
-	struct nouveau_fence *fence = from_fence(f);
+	struct nouveau_fence *fence = to_nouveau_fence(f);
 	struct nouveau_fence_chan *fctx = nouveau_fctx(fence);
 	struct nouveau_channel *chan;
 	bool ret = false;
@@ -478,7 +472,7 @@ static bool nouveau_fence_is_signaled(struct dma_fence *f)
 
 static bool nouveau_fence_no_signaling(struct dma_fence *f)
 {
-	struct nouveau_fence *fence = from_fence(f);
+	struct nouveau_fence *fence = to_nouveau_fence(f);
 
 	/*
 	 * caller should have a reference on the fence,
@@ -503,7 +497,7 @@ static bool nouveau_fence_no_signaling(struct dma_fence *f)
 
 static void nouveau_fence_release(struct dma_fence *f)
 {
-	struct nouveau_fence *fence = from_fence(f);
+	struct nouveau_fence *fence = to_nouveau_fence(f);
 	struct nouveau_fence_chan *fctx = nouveau_fctx(fence);
 
 	kref_put(&fctx->fence_ref, nouveau_fence_context_put);
@@ -521,7 +515,7 @@ static const struct dma_fence_ops nouveau_fence_ops_legacy = {
 
 static bool nouveau_fence_enable_signaling(struct dma_fence *f)
 {
-	struct nouveau_fence *fence = from_fence(f);
+	struct nouveau_fence *fence = to_nouveau_fence(f);
 	struct nouveau_fence_chan *fctx = nouveau_fctx(fence);
 	bool ret;
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.h b/drivers/gpu/drm/nouveau/nouveau_fence.h
index 6a983dd9f7b9..183dd43ecfff 100644
--- a/drivers/gpu/drm/nouveau/nouveau_fence.h
+++ b/drivers/gpu/drm/nouveau/nouveau_fence.h
@@ -17,6 +17,12 @@ struct nouveau_fence {
 	unsigned long timeout;
 };
 
+static inline struct nouveau_fence *
+to_nouveau_fence(struct dma_fence *fence)
+{
+	return container_of(fence, struct nouveau_fence, base);
+}
+
 int  nouveau_fence_create(struct nouveau_fence **, struct nouveau_channel *);
 int  nouveau_fence_new(struct nouveau_fence **, struct nouveau_channel *);
 void nouveau_fence_unref(struct nouveau_fence **);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 6/7] drm/nouveau: Add new callback for scheduler teardown
  2025-07-07 13:42 [PATCH v2 0/7] drm/sched: Fix memory leaks with cancel_job() callback Philipp Stanner
                   ` (4 preceding siblings ...)
  2025-07-07 13:42 ` [PATCH v2 5/7] drm/nouveau: Make fence container helper usable driver-wide Philipp Stanner
@ 2025-07-07 13:42 ` Philipp Stanner
  2025-07-07 14:26   ` Danilo Krummrich
  2025-07-07 13:42 ` [PATCH v2 7/7] drm/nouveau: Remove waitque for sched teardown Philipp Stanner
  6 siblings, 1 reply; 15+ messages in thread
From: Philipp Stanner @ 2025-07-07 13:42 UTC (permalink / raw)
  To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Philipp Stanner, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
	Tvrtko Ursulin, Pierre-Eric Pelloux-Prayer
  Cc: dri-devel, nouveau, linux-kernel, linux-media

There is a new callback for always tearing the scheduler down in a
leak-free, deadlock-free manner.

Port Nouveau as its first user by providing the scheduler with a
callback that ensures the fence context gets killed in drm_sched_fini().

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/nouveau/nouveau_fence.c | 15 +++++++++++++++
 drivers/gpu/drm/nouveau/nouveau_fence.h |  1 +
 drivers/gpu/drm/nouveau/nouveau_sched.c | 15 ++++++++++++++-
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/drivers/gpu/drm/nouveau/nouveau_fence.c
index 869d4335c0f4..9f345a008717 100644
--- a/drivers/gpu/drm/nouveau/nouveau_fence.c
+++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
@@ -240,6 +240,21 @@ nouveau_fence_emit(struct nouveau_fence *fence)
 	return ret;
 }
 
+void
+nouveau_fence_cancel(struct nouveau_fence *fence)
+{
+	struct nouveau_fence_chan *fctx = nouveau_fctx(fence);
+	unsigned long flags;
+
+	spin_lock_irqsave(&fctx->lock, flags);
+	if (!dma_fence_is_signaled_locked(&fence->base)) {
+		dma_fence_set_error(&fence->base, -ECANCELED);
+		if (nouveau_fence_signal(fence))
+			nvif_event_block(&fctx->event);
+	}
+	spin_unlock_irqrestore(&fctx->lock, flags);
+}
+
 bool
 nouveau_fence_done(struct nouveau_fence *fence)
 {
diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.h b/drivers/gpu/drm/nouveau/nouveau_fence.h
index 183dd43ecfff..9957a919bd38 100644
--- a/drivers/gpu/drm/nouveau/nouveau_fence.h
+++ b/drivers/gpu/drm/nouveau/nouveau_fence.h
@@ -29,6 +29,7 @@ void nouveau_fence_unref(struct nouveau_fence **);
 
 int  nouveau_fence_emit(struct nouveau_fence *);
 bool nouveau_fence_done(struct nouveau_fence *);
+void nouveau_fence_cancel(struct nouveau_fence *fence);
 int  nouveau_fence_wait(struct nouveau_fence *, bool lazy, bool intr);
 int  nouveau_fence_sync(struct nouveau_bo *, struct nouveau_channel *, bool exclusive, bool intr);
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
index 460a5fb02412..2ec62059c351 100644
--- a/drivers/gpu/drm/nouveau/nouveau_sched.c
+++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
@@ -11,6 +11,7 @@
 #include "nouveau_exec.h"
 #include "nouveau_abi16.h"
 #include "nouveau_sched.h"
+#include "nouveau_chan.h"
 
 #define NOUVEAU_SCHED_JOB_TIMEOUT_MS		10000
 
@@ -393,10 +394,23 @@ nouveau_sched_free_job(struct drm_sched_job *sched_job)
 	nouveau_job_fini(job);
 }
 
+static void
+nouveau_sched_cancel_job(struct drm_sched_job *sched_job)
+{
+	struct nouveau_fence *fence;
+	struct nouveau_job *job;
+
+	job = to_nouveau_job(sched_job);
+	fence = to_nouveau_fence(job->done_fence);
+
+	nouveau_fence_cancel(fence);
+}
+
 static const struct drm_sched_backend_ops nouveau_sched_ops = {
 	.run_job = nouveau_sched_run_job,
 	.timedout_job = nouveau_sched_timedout_job,
 	.free_job = nouveau_sched_free_job,
+	.cancel_job = nouveau_sched_cancel_job,
 };
 
 static int
@@ -482,7 +496,6 @@ nouveau_sched_create(struct nouveau_sched **psched, struct nouveau_drm *drm,
 	return 0;
 }
 
-
 static void
 nouveau_sched_fini(struct nouveau_sched *sched)
 {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 7/7] drm/nouveau: Remove waitque for sched teardown
  2025-07-07 13:42 [PATCH v2 0/7] drm/sched: Fix memory leaks with cancel_job() callback Philipp Stanner
                   ` (5 preceding siblings ...)
  2025-07-07 13:42 ` [PATCH v2 6/7] drm/nouveau: Add new callback for scheduler teardown Philipp Stanner
@ 2025-07-07 13:42 ` Philipp Stanner
  2025-07-07 13:47   ` Danilo Krummrich
  6 siblings, 1 reply; 15+ messages in thread
From: Philipp Stanner @ 2025-07-07 13:42 UTC (permalink / raw)
  To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Philipp Stanner, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
	Tvrtko Ursulin, Pierre-Eric Pelloux-Prayer
  Cc: dri-devel, nouveau, linux-kernel, linux-media

struct nouveau_sched contains a waitque needed to prevent
drm_sched_fini() from being called while there are still jobs pending.
Doing so so far would have caused memory leaks.

With the new memleak-free mode of operation switched on in
drm_sched_fini() by providing the callback
nouveau_sched_fence_context_kill() the waitque is not necessary anymore.

Remove the waitque.

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/nouveau/nouveau_sched.c | 20 +++++++-------------
 drivers/gpu/drm/nouveau/nouveau_sched.h |  9 +++------
 drivers/gpu/drm/nouveau/nouveau_uvmm.c  |  8 ++++----
 3 files changed, 14 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
index 2ec62059c351..7d9c3418e76b 100644
--- a/drivers/gpu/drm/nouveau/nouveau_sched.c
+++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
@@ -122,11 +122,9 @@ nouveau_job_done(struct nouveau_job *job)
 {
 	struct nouveau_sched *sched = job->sched;
 
-	spin_lock(&sched->job.list.lock);
+	spin_lock(&sched->job_list.lock);
 	list_del(&job->entry);
-	spin_unlock(&sched->job.list.lock);
-
-	wake_up(&sched->job.wq);
+	spin_unlock(&sched->job_list.lock);
 }
 
 void
@@ -307,9 +305,9 @@ nouveau_job_submit(struct nouveau_job *job)
 	}
 
 	/* Submit was successful; add the job to the schedulers job list. */
-	spin_lock(&sched->job.list.lock);
-	list_add(&job->entry, &sched->job.list.head);
-	spin_unlock(&sched->job.list.lock);
+	spin_lock(&sched->job_list.lock);
+	list_add(&job->entry, &sched->job_list.head);
+	spin_unlock(&sched->job_list.lock);
 
 	drm_sched_job_arm(&job->base);
 	job->done_fence = dma_fence_get(&job->base.s_fence->finished);
@@ -460,9 +458,8 @@ nouveau_sched_init(struct nouveau_sched *sched, struct nouveau_drm *drm,
 		goto fail_sched;
 
 	mutex_init(&sched->mutex);
-	spin_lock_init(&sched->job.list.lock);
-	INIT_LIST_HEAD(&sched->job.list.head);
-	init_waitqueue_head(&sched->job.wq);
+	spin_lock_init(&sched->job_list.lock);
+	INIT_LIST_HEAD(&sched->job_list.head);
 
 	return 0;
 
@@ -502,9 +499,6 @@ nouveau_sched_fini(struct nouveau_sched *sched)
 	struct drm_gpu_scheduler *drm_sched = &sched->base;
 	struct drm_sched_entity *entity = &sched->entity;
 
-	rmb(); /* for list_empty to work without lock */
-	wait_event(sched->job.wq, list_empty(&sched->job.list.head));
-
 	drm_sched_entity_fini(entity);
 	drm_sched_fini(drm_sched);
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.h b/drivers/gpu/drm/nouveau/nouveau_sched.h
index 20cd1da8db73..b98c3f0bef30 100644
--- a/drivers/gpu/drm/nouveau/nouveau_sched.h
+++ b/drivers/gpu/drm/nouveau/nouveau_sched.h
@@ -103,12 +103,9 @@ struct nouveau_sched {
 	struct mutex mutex;
 
 	struct {
-		struct {
-			struct list_head head;
-			spinlock_t lock;
-		} list;
-		struct wait_queue_head wq;
-	} job;
+		struct list_head head;
+		spinlock_t lock;
+	} job_list;
 };
 
 int nouveau_sched_create(struct nouveau_sched **psched, struct nouveau_drm *drm,
diff --git a/drivers/gpu/drm/nouveau/nouveau_uvmm.c b/drivers/gpu/drm/nouveau/nouveau_uvmm.c
index 48f105239f42..ddfc46bc1b3e 100644
--- a/drivers/gpu/drm/nouveau/nouveau_uvmm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_uvmm.c
@@ -1019,8 +1019,8 @@ bind_validate_map_sparse(struct nouveau_job *job, u64 addr, u64 range)
 	u64 end = addr + range;
 
 again:
-	spin_lock(&sched->job.list.lock);
-	list_for_each_entry(__job, &sched->job.list.head, entry) {
+	spin_lock(&sched->job_list.lock);
+	list_for_each_entry(__job, &sched->job_list.head, entry) {
 		struct nouveau_uvmm_bind_job *bind_job = to_uvmm_bind_job(__job);
 
 		list_for_each_op(op, &bind_job->ops) {
@@ -1030,7 +1030,7 @@ bind_validate_map_sparse(struct nouveau_job *job, u64 addr, u64 range)
 
 				if (!(end <= op_addr || addr >= op_end)) {
 					nouveau_uvmm_bind_job_get(bind_job);
-					spin_unlock(&sched->job.list.lock);
+					spin_unlock(&sched->job_list.lock);
 					wait_for_completion(&bind_job->complete);
 					nouveau_uvmm_bind_job_put(bind_job);
 					goto again;
@@ -1038,7 +1038,7 @@ bind_validate_map_sparse(struct nouveau_job *job, u64 addr, u64 range)
 			}
 		}
 	}
-	spin_unlock(&sched->job.list.lock);
+	spin_unlock(&sched->job_list.lock);
 }
 
 static int
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 5/7] drm/nouveau: Make fence container helper usable driver-wide
  2025-07-07 13:42 ` [PATCH v2 5/7] drm/nouveau: Make fence container helper usable driver-wide Philipp Stanner
@ 2025-07-07 13:45   ` Danilo Krummrich
  0 siblings, 0 replies; 15+ messages in thread
From: Danilo Krummrich @ 2025-07-07 13:45 UTC (permalink / raw)
  To: Philipp Stanner
  Cc: Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Sumit Semwal, Tvrtko Ursulin,
	Pierre-Eric Pelloux-Prayer, dri-devel, nouveau, linux-kernel,
	linux-media

On 7/7/25 3:42 PM, Philipp Stanner wrote:
> In order to implement a new DRM GPU scheduler callback in Nouveau, a
> helper for obtaining a nouveau_fence from a dma_fence is necessary. Such
> a helper exists already inside nouveau_fence.c, called from_fence().
> 
> Make that helper available to other C files with a more precise name.
> 
> Signed-off-by: Philipp Stanner <phasta@kernel.org>

Acked-by: Danilo Krummrich <dakr@kernel.org>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 7/7] drm/nouveau: Remove waitque for sched teardown
  2025-07-07 13:42 ` [PATCH v2 7/7] drm/nouveau: Remove waitque for sched teardown Philipp Stanner
@ 2025-07-07 13:47   ` Danilo Krummrich
  2025-07-07 13:54     ` Philipp Stanner
  0 siblings, 1 reply; 15+ messages in thread
From: Danilo Krummrich @ 2025-07-07 13:47 UTC (permalink / raw)
  To: Philipp Stanner
  Cc: Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Sumit Semwal, Tvrtko Ursulin,
	Pierre-Eric Pelloux-Prayer, dri-devel, nouveau, linux-kernel,
	linux-media

On 7/7/25 3:42 PM, Philipp Stanner wrote:
> struct nouveau_sched contains a waitque needed to prevent
> drm_sched_fini() from being called while there are still jobs pending.
> Doing so so far would have caused memory leaks.
> 
> With the new memleak-free mode of operation switched on in
> drm_sched_fini() by providing the callback
> nouveau_sched_fence_context_kill() the waitque is not necessary anymore.
> 
> Remove the waitque.
> 
> Signed-off-by: Philipp Stanner <phasta@kernel.org>

Doesn't this break the driver until fixed up by the subsequent patch?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 7/7] drm/nouveau: Remove waitque for sched teardown
  2025-07-07 13:47   ` Danilo Krummrich
@ 2025-07-07 13:54     ` Philipp Stanner
  2025-07-07 14:26       ` Danilo Krummrich
  0 siblings, 1 reply; 15+ messages in thread
From: Philipp Stanner @ 2025-07-07 13:54 UTC (permalink / raw)
  To: Danilo Krummrich, Philipp Stanner
  Cc: Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Sumit Semwal, Tvrtko Ursulin,
	Pierre-Eric Pelloux-Prayer, dri-devel, nouveau, linux-kernel,
	linux-media

On Mon, 2025-07-07 at 15:47 +0200, Danilo Krummrich wrote:
> On 7/7/25 3:42 PM, Philipp Stanner wrote:
> > struct nouveau_sched contains a waitque needed to prevent
> > drm_sched_fini() from being called while there are still jobs
> > pending.
> > Doing so so far would have caused memory leaks.
> > 
> > With the new memleak-free mode of operation switched on in
> > drm_sched_fini() by providing the callback
> > nouveau_sched_fence_context_kill() the waitque is not necessary
> > anymore.
> > 
> > Remove the waitque.
> > 
> > Signed-off-by: Philipp Stanner <phasta@kernel.org>
> 
> Doesn't this break the driver until fixed up by the subsequent patch?
> 

Did you mean to answer to patch 6?

Patch 6 implements the cancel_job() callback for nouveau, which makes
sure the (still existing) waitque will never block. The, now redundant,
waitque then gets removed in patch 7.


P.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 7/7] drm/nouveau: Remove waitque for sched teardown
  2025-07-07 13:54     ` Philipp Stanner
@ 2025-07-07 14:26       ` Danilo Krummrich
  0 siblings, 0 replies; 15+ messages in thread
From: Danilo Krummrich @ 2025-07-07 14:26 UTC (permalink / raw)
  To: phasta
  Cc: Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Sumit Semwal, Tvrtko Ursulin,
	Pierre-Eric Pelloux-Prayer, dri-devel, nouveau, linux-kernel,
	linux-media

On 7/7/25 3:54 PM, Philipp Stanner wrote:
> On Mon, 2025-07-07 at 15:47 +0200, Danilo Krummrich wrote:
>> On 7/7/25 3:42 PM, Philipp Stanner wrote:
>>> struct nouveau_sched contains a waitque needed to prevent
>>> drm_sched_fini() from being called while there are still jobs
>>> pending.
>>> Doing so so far would have caused memory leaks.
>>>
>>> With the new memleak-free mode of operation switched on in
>>> drm_sched_fini() by providing the callback
>>> nouveau_sched_fence_context_kill() the waitque is not necessary
>>> anymore.
>>>
>>> Remove the waitque.
>>>
>>> Signed-off-by: Philipp Stanner <phasta@kernel.org>
>>
>> Doesn't this break the driver until fixed up by the subsequent patch?
>>
> 
> Did you mean to answer to patch 6?
> 
> Patch 6 implements the cancel_job() callback for nouveau, which makes
> sure the (still existing) waitque will never block. The, now redundant,
> waitque then gets removed in patch 7.

Yup, I mixed up the order on my end.

Acked-by: Danilo Krummrich <dakr@kernel.org>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 6/7] drm/nouveau: Add new callback for scheduler teardown
  2025-07-07 13:42 ` [PATCH v2 6/7] drm/nouveau: Add new callback for scheduler teardown Philipp Stanner
@ 2025-07-07 14:26   ` Danilo Krummrich
  0 siblings, 0 replies; 15+ messages in thread
From: Danilo Krummrich @ 2025-07-07 14:26 UTC (permalink / raw)
  To: Philipp Stanner
  Cc: Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Sumit Semwal, Tvrtko Ursulin,
	Pierre-Eric Pelloux-Prayer, dri-devel, nouveau, linux-kernel,
	linux-media



On 7/7/25 3:42 PM, Philipp Stanner wrote:
> There is a new callback for always tearing the scheduler down in a
> leak-free, deadlock-free manner.
> 
> Port Nouveau as its first user by providing the scheduler with a
> callback that ensures the fence context gets killed in drm_sched_fini().
> 
> Signed-off-by: Philipp Stanner <phasta@kernel.org>

Acked-by: Danilo Krummrich <dakr@kernel.org>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/7] drm/sched/tests: Add unit test for cancel_job()
  2025-07-07 13:42 ` [PATCH v2 3/7] drm/sched/tests: Add unit test for cancel_job() Philipp Stanner
@ 2025-07-07 15:20   ` Tvrtko Ursulin
  0 siblings, 0 replies; 15+ messages in thread
From: Tvrtko Ursulin @ 2025-07-07 15:20 UTC (permalink / raw)
  To: Philipp Stanner, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Matthew Brost, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
	Pierre-Eric Pelloux-Prayer
  Cc: dri-devel, nouveau, linux-kernel, linux-media


On 07/07/2025 14:42, Philipp Stanner wrote:
> The scheduler unit tests now provide a new callback, cancel_job(). This
> callback gets used by drm_sched_fini() for all still pending jobs to
> cancel them.
> 
> Implement a new unit test to test this.
> 
> Signed-off-by: Philipp Stanner <phasta@kernel.org>
> ---
>   drivers/gpu/drm/scheduler/tests/tests_basic.c | 43 +++++++++++++++++++
>   1 file changed, 43 insertions(+)
> 
> diff --git a/drivers/gpu/drm/scheduler/tests/tests_basic.c b/drivers/gpu/drm/scheduler/tests/tests_basic.c
> index 7230057e0594..fa3da2db4893 100644
> --- a/drivers/gpu/drm/scheduler/tests/tests_basic.c
> +++ b/drivers/gpu/drm/scheduler/tests/tests_basic.c
> @@ -204,6 +204,48 @@ static struct kunit_suite drm_sched_basic = {
>   	.test_cases = drm_sched_basic_tests,
>   };
>   
> +static void drm_sched_basic_cancel(struct kunit *test)
> +{
> +	struct drm_mock_sched_entity *entity;
> +	struct drm_mock_scheduler *sched;
> +	struct drm_mock_sched_job *job;
> +	bool done;
> +
> +	/*
> +	 * Check that the configured credit limit is respected.
> +	 */

Copy & paste mishap.

> +
> +	sched = drm_mock_sched_new(test, MAX_SCHEDULE_TIMEOUT);
> +	sched->base.credit_limit = 1;

Ditto.

> +
> +	entity = drm_mock_sched_entity_new(test, DRM_SCHED_PRIORITY_NORMAL,
> +					   sched);
> +
> +	job = drm_mock_sched_job_new(test, entity);
> +
> +	drm_mock_sched_job_submit(job);
> +
> +	done = drm_mock_sched_job_wait_scheduled(job, HZ);
> +	KUNIT_ASSERT_TRUE(test, done);
> +
> +	drm_mock_sched_entity_free(entity);
> +	drm_mock_sched_fini(sched);
> +
> +	KUNIT_ASSERT_EQ(test, job->hw_fence.error, -ECANCELED);
> +}
> +
> +static struct kunit_case drm_sched_cancel_tests[] = {
> +	KUNIT_CASE(drm_sched_basic_cancel),
> +	{}
> +};
> +
> +static struct kunit_suite drm_sched_cancel = {
> +	.name = "drm_sched_basic_cancel_tests",
> +	.init = drm_sched_basic_init,
> +	.exit = drm_sched_basic_exit,
> +	.test_cases = drm_sched_cancel_tests,
> +};
> +
>   static void drm_sched_basic_timeout(struct kunit *test)
>   {
>   	struct drm_mock_scheduler *sched = test->priv;
> @@ -471,6 +513,7 @@ static struct kunit_suite drm_sched_credits = {
>   
>   kunit_test_suites(&drm_sched_basic,
>   		  &drm_sched_timeout,
> +		  &drm_sched_cancel,
>   		  &drm_sched_priority,
>   		  &drm_sched_modify_sched,
>   		  &drm_sched_credits);

The rest looks good. With the comment fixed and credit limit setting 
removed:

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

Regards,

Tvrtko


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/7] drm/sched/tests: Implement cancel_job() callback
  2025-07-07 13:42 ` [PATCH v2 2/7] drm/sched/tests: Implement " Philipp Stanner
@ 2025-07-07 15:25   ` Tvrtko Ursulin
  0 siblings, 0 replies; 15+ messages in thread
From: Tvrtko Ursulin @ 2025-07-07 15:25 UTC (permalink / raw)
  To: Philipp Stanner, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Matthew Brost, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
	Pierre-Eric Pelloux-Prayer
  Cc: dri-devel, nouveau, linux-kernel, linux-media


On 07/07/2025 14:42, Philipp Stanner wrote:
> The GPU Scheduler now supports a new callback, cancel_job(), which lets
> the scheduler cancel all jobs which might not yet be freed when
> drm_sched_fini() runs. Using this callback allows for significantly
> simplifying the mock scheduler teardown code.
> 
> Implement the cancel_job() callback and adjust the code where necessary.
> 
> Signed-off-by: Philipp Stanner <phasta@kernel.org>
> ---
>   .../gpu/drm/scheduler/tests/mock_scheduler.c  | 66 +++++++------------
>   1 file changed, 23 insertions(+), 43 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> index 49d067fecd67..2d3169d95200 100644
> --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> @@ -63,7 +63,7 @@ static void drm_mock_sched_job_complete(struct drm_mock_sched_job *job)
>   	lockdep_assert_held(&sched->lock);
>   
>   	job->flags |= DRM_MOCK_SCHED_JOB_DONE;
> -	list_move_tail(&job->link, &sched->done_list);
> +	list_del(&job->link);
>   	dma_fence_signal_locked(&job->hw_fence);
>   	complete(&job->done);
>   }
> @@ -236,26 +236,39 @@ mock_sched_timedout_job(struct drm_sched_job *sched_job)
>   
>   static void mock_sched_free_job(struct drm_sched_job *sched_job)
>   {
> -	struct drm_mock_scheduler *sched =
> -			drm_sched_to_mock_sched(sched_job->sched);
>   	struct drm_mock_sched_job *job = drm_sched_job_to_mock_job(sched_job);
> -	unsigned long flags;
>   
> -	/* Remove from the scheduler done list. */
> -	spin_lock_irqsave(&sched->lock, flags);
> -	list_del(&job->link);
> -	spin_unlock_irqrestore(&sched->lock, flags);
>   	dma_fence_put(&job->hw_fence);
> -
>   	drm_sched_job_cleanup(sched_job);
>   
>   	/* Mock job itself is freed by the kunit framework. */
>   }
>   
> +static void mock_sched_cancel_job(struct drm_sched_job *sched_job)
> +{
> +	struct drm_mock_scheduler *sched = drm_sched_to_mock_sched(sched_job->sched);
> +	struct drm_mock_sched_job *job = drm_sched_job_to_mock_job(sched_job);
> +	unsigned long flags;
> +
> +	hrtimer_cancel(&job->timer);
> +
> +	spin_lock_irqsave(&sched->lock, flags);
> +	if (!dma_fence_is_signaled_locked(&job->hw_fence)) {
> +		list_del(&job->link);
> +		dma_fence_set_error(&job->hw_fence, -ECANCELED);
> +		dma_fence_signal_locked(&job->hw_fence);
> +	}
> +	spin_unlock_irqrestore(&sched->lock, flags);
> +
> +	/* The GPU Scheduler will call drm_sched_backend_ops.free_job(), still.
> +	 * Mock job itself is freed by the kunit framework. */

I suggest sticking with the multi-line comment format as used in this 
file. Ie.

/*
  * Multi-line comment.
  */

> +}
> +
>   static const struct drm_sched_backend_ops drm_mock_scheduler_ops = {
>   	.run_job = mock_sched_run_job,
>   	.timedout_job = mock_sched_timedout_job,
> -	.free_job = mock_sched_free_job
> +	.free_job = mock_sched_free_job,
> +	.cancel_job = mock_sched_cancel_job,
>   };
>   
>   /**
> @@ -289,7 +302,6 @@ struct drm_mock_scheduler *drm_mock_sched_new(struct kunit *test, long timeout)
>   	sched->hw_timeline.context = dma_fence_context_alloc(1);
>   	atomic_set(&sched->hw_timeline.next_seqno, 0);
>   	INIT_LIST_HEAD(&sched->job_list);
> -	INIT_LIST_HEAD(&sched->done_list);

You forgot to remove done_list from the struct definition.

Regards,

Tvrtko

>   	spin_lock_init(&sched->lock);
>   
>   	return sched;
> @@ -304,38 +316,6 @@ struct drm_mock_scheduler *drm_mock_sched_new(struct kunit *test, long timeout)
>    */
>   void drm_mock_sched_fini(struct drm_mock_scheduler *sched)
>   {
> -	struct drm_mock_sched_job *job, *next;
> -	unsigned long flags;
> -	LIST_HEAD(list);
> -
> -	drm_sched_wqueue_stop(&sched->base);
> -
> -	/* Force complete all unfinished jobs. */
> -	spin_lock_irqsave(&sched->lock, flags);
> -	list_for_each_entry_safe(job, next, &sched->job_list, link)
> -		list_move_tail(&job->link, &list);
> -	spin_unlock_irqrestore(&sched->lock, flags);
> -
> -	list_for_each_entry(job, &list, link)
> -		hrtimer_cancel(&job->timer);
> -
> -	spin_lock_irqsave(&sched->lock, flags);
> -	list_for_each_entry_safe(job, next, &list, link)
> -		drm_mock_sched_job_complete(job);
> -	spin_unlock_irqrestore(&sched->lock, flags);
> -
> -	/*
> -	 * Free completed jobs and jobs not yet processed by the DRM scheduler
> -	 * free worker.
> -	 */
> -	spin_lock_irqsave(&sched->lock, flags);
> -	list_for_each_entry_safe(job, next, &sched->done_list, link)
> -		list_move_tail(&job->link, &list);
> -	spin_unlock_irqrestore(&sched->lock, flags);
> -
> -	list_for_each_entry_safe(job, next, &list, link)
> -		mock_sched_free_job(&job->base);
> -
>   	drm_sched_fini(&sched->base);
>   }
>   


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-07-07 15:25 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-07 13:42 [PATCH v2 0/7] drm/sched: Fix memory leaks with cancel_job() callback Philipp Stanner
2025-07-07 13:42 ` [PATCH v2 1/7] drm/sched: Avoid " Philipp Stanner
2025-07-07 13:42 ` [PATCH v2 2/7] drm/sched/tests: Implement " Philipp Stanner
2025-07-07 15:25   ` Tvrtko Ursulin
2025-07-07 13:42 ` [PATCH v2 3/7] drm/sched/tests: Add unit test for cancel_job() Philipp Stanner
2025-07-07 15:20   ` Tvrtko Ursulin
2025-07-07 13:42 ` [PATCH v2 4/7] drm/sched: Warn if pending list is not empty Philipp Stanner
2025-07-07 13:42 ` [PATCH v2 5/7] drm/nouveau: Make fence container helper usable driver-wide Philipp Stanner
2025-07-07 13:45   ` Danilo Krummrich
2025-07-07 13:42 ` [PATCH v2 6/7] drm/nouveau: Add new callback for scheduler teardown Philipp Stanner
2025-07-07 14:26   ` Danilo Krummrich
2025-07-07 13:42 ` [PATCH v2 7/7] drm/nouveau: Remove waitque for sched teardown Philipp Stanner
2025-07-07 13:47   ` Danilo Krummrich
2025-07-07 13:54     ` Philipp Stanner
2025-07-07 14:26       ` Danilo Krummrich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).