[PATCH v3 0/5] Fix memory leaks in drm_sched

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/5] Fix memory leaks in drm_sched_fini()
@ 2025-05-22  8:27 Philipp Stanner
  2025-05-22  8:27 ` [PATCH v3 1/5] drm/sched: Fix teardown leaks with waitqueue Philipp Stanner
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Philipp Stanner @ 2025-05-22  8:27 UTC (permalink / raw)
  To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Philipp Stanner, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Tvrtko Ursulin
  Cc: dri-devel, nouveau, linux-kernel

Changelog below.

I experimented with the alternative solution (cancel_job() callback) and
maintain the position that this approach is more stable and robust,
despite more code added. I feel more comfortable with it regarding
stability and the possiblity of porting more drivers.

Changes in v3:
  - Fix and simplify cleanup callback in unit tests. Behave like a
    driver would: cancel interrupts (hrtimer), then simply run into
    drm_sched_fini().
  - Use drm_mock_sched_complete() as a centralized position to signal
    fences.
  - Reorder pending_list-is-empty warning patch for correct behavior. (Tvrtko)
  - Rename cleanup callback so that it becomes clear that it's about
    signaling all in-flight fences. (Tvrtko, Danilo, Me)
  - Add more documentation for the new functions.
  - Fix some typos.

Changes in v2:
  - Port kunit tests to new cleanup approach (Tvrtko), however so far
    still with the fence context-kill approach (Me. to be discussed.)
  - Improve len(pending_list) > 0 warning print. (Me)
  - Remove forgotten RFC comments. (Me)
  - Remove timeout boolean, since it's surplus. (Me)
  - Fix misunderstandings in the cover letter. (Tvrtko)

Changes since the RFC:
  - (None)

Howdy,

as many of you know, we have potential memory leaks in drm_sched_fini()
which have been tried to be solved by various parties with various
methods in the past.

In our past discussions, we came to the conclusion, that the simplest
solution, blocking in drm_sched_fini(), is not possible because it could
cause processes ignoring SIGKILL and blocking for too long (which could
turn out to be an effective way to generate a funny email from Linus,
though :) )

Another idea was to have submitted jobs refcount the scheduler. I
investigated this and we found that this then *additionally* would
require us to have *the scheduler* refcount everything *in the driver*
that is accessed through the still running callbacks; since the driver
would want to unload possibly after a non-blocking drm_sched_fini()
call. So that's also no solution.

This RFC here is a new approach, somewhat based on the original
waitque-idea. It looks as follows:

1. Have drm_sched_fini() block until the pending_list becomes empty with
   a waitque, as a first step.
2. Provide the scheduler with a callback with which it can instruct the
   driver to kill the associated fence context. This will cause all
   pending hardware fences to get signalled. (Credit to Danilo, whose
   idea this was)
3. In drm_sched_fini(), first switch off submission of new jobs and
   timeouts (the latter might not be strictly necessary, but is probably
   cleaner).
4. Then, call the aformentioned callback, ensuring that free_job() will
   be called for all remaining jobs relatively quickly. This has the
   great advantage that the jobs get cleaned up through the standard
   mechanism.
5. Once all jobs are gone, also switch off the free_job() work item and
   then proceed as usual.

Furthermore, since there is now such a callback, we can provide an
if-branch checking for its existence. If the driver doesn't provide it,
drm_sched_fini() operates in "legacy mode". So none of the existing
drivers should notice a difference and we remain fully backwards
compatible.

Our glorious beta-tester is Nouveau, which so far had its own waitque
solution, which is now obsolete. The last two patches port Nouveau and
remove that waitque.

I've tested this on a desktop environment with Nouveau. Works fine and
solves the problem (though we did discover an unrelated problem inside
Nouveau in the process).

Tvrtko's unit tests also run as expected (except for the new warning
print in patch 3), which is not surprising since they don't provide the
callback.

I'm looking forward to your input and feedback. I really hope we can
work this RFC into something that can provide users with a more
reliable, clean scheduler API.

Philipp

Philipp Stanner (5):
  drm/sched: Fix teardown leaks with waitqueue
  drm/sched/tests: Port tests to new cleanup method
  drm/sched: Warn if pending list is not empty
  drm/nouveau: Add new callback for scheduler teardown
  drm/nouveau: Remove waitque for sched teardown

 drivers/gpu/drm/nouveau/nouveau_abi16.c       |   4 +-
 drivers/gpu/drm/nouveau/nouveau_drm.c         |   2 +-
 drivers/gpu/drm/nouveau/nouveau_sched.c       |  39 ++++---
 drivers/gpu/drm/nouveau/nouveau_sched.h       |  12 +-
 drivers/gpu/drm/nouveau/nouveau_uvmm.c        |   8 +-
 drivers/gpu/drm/scheduler/sched_main.c        | 108 +++++++++++++++---
 .../gpu/drm/scheduler/tests/mock_scheduler.c  |  67 ++++-------
 drivers/gpu/drm/scheduler/tests/sched_tests.h |   4 +-
 include/drm/gpu_scheduler.h                   |  19 +++
 9 files changed, 170 insertions(+), 93 deletions(-)

-- 
2.49.0

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v3 1/5] drm/sched: Fix teardown leaks with waitqueue
  2025-05-22  8:27 [PATCH v3 0/5] Fix memory leaks in drm_sched_fini() Philipp Stanner
@ 2025-05-22  8:27 ` Philipp Stanner
  2025-05-22 12:44   ` Danilo Krummrich
  2025-05-22 13:37   ` Tvrtko Ursulin
  2025-05-22  8:27 ` [PATCH v3 2/5] drm/sched/tests: Port tests to new cleanup method Philipp Stanner
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 13+ messages in thread
From: Philipp Stanner @ 2025-05-22  8:27 UTC (permalink / raw)
  To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Philipp Stanner, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Tvrtko Ursulin
  Cc: dri-devel, nouveau, linux-kernel, Philipp Stanner

From: Philipp Stanner <pstanner@redhat.com>

The GPU scheduler currently does not ensure that its pending_list is
empty before performing various other teardown tasks in
drm_sched_fini().

If there are still jobs in the pending_list, this is problematic because
after scheduler teardown, no one will call backend_ops.free_job()
anymore. This would, consequently, result in memory leaks.

One way to solve this is to implement a waitqueue that drm_sched_fini()
blocks on until the pending_list has become empty. That waitqueue must
obviously not block for a significant time. Thus, it's necessary to only
wait if it's guaranteed that all fences will get signaled quickly.

This can be ensured by having the driver implement a new backend ops,
cancel_pending_fences(), in which the driver shall signal all
unsignaled, in-flight fences with an error.

Add a waitqueue to struct drm_gpu_scheduler. Wake up waiters once the
pending_list becomes empty. Wait in drm_sched_fini() for that to happen
if cancel_pending_fences() is implemented.

Signed-off-by: Philipp Stanner <pstanner@redhat.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 105 ++++++++++++++++++++-----
 include/drm/gpu_scheduler.h            |  19 +++++
 2 files changed, 105 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index f7118497e47a..406572f5168e 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -367,7 +367,7 @@ static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
  */
 static void __drm_sched_run_free_queue(struct drm_gpu_scheduler *sched)
 {
-	if (!READ_ONCE(sched->pause_submit))
+	if (!READ_ONCE(sched->pause_free))
 		queue_work(sched->submit_wq, &sched->work_free_job);
 }
 
@@ -1121,6 +1121,12 @@ drm_sched_get_finished_job(struct drm_gpu_scheduler *sched)
 		/* remove job from pending_list */
 		list_del_init(&job->list);
 
+		/*
+		 * Inform tasks blocking in drm_sched_fini() that it's now safe to proceed.
+		 */
+		if (list_empty(&sched->pending_list))
+			wake_up(&sched->pending_list_waitque);
+
 		/* cancel this job's TO timer */
 		cancel_delayed_work(&sched->work_tdr);
 		/* make the scheduled timestamp more accurate */
@@ -1326,6 +1332,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
 	init_waitqueue_head(&sched->job_scheduled);
 	INIT_LIST_HEAD(&sched->pending_list);
 	spin_lock_init(&sched->job_list_lock);
+	init_waitqueue_head(&sched->pending_list_waitque);
 	atomic_set(&sched->credit_count, 0);
 	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
 	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
@@ -1333,6 +1340,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
 	atomic_set(&sched->_score, 0);
 	atomic64_set(&sched->job_id_count, 0);
 	sched->pause_submit = false;
+	sched->pause_free = false;
 
 	sched->ready = true;
 	return 0;
@@ -1350,33 +1358,90 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
 }
 EXPORT_SYMBOL(drm_sched_init);
 
+/**
+ * drm_sched_submission_and_timeout_stop - stop everything except for free_job
+ * @sched: scheduler instance
+ *
+ * Helper for tearing down the scheduler in drm_sched_fini().
+ */
+static void
+drm_sched_submission_and_timeout_stop(struct drm_gpu_scheduler *sched)
+{
+	WRITE_ONCE(sched->pause_submit, true);
+	cancel_work_sync(&sched->work_run_job);
+	cancel_delayed_work_sync(&sched->work_tdr);
+}
+
+/**
+ * drm_sched_free_stop - stop free_job
+ * @sched: scheduler instance
+ *
+ * Helper for tearing down the scheduler in drm_sched_fini().
+ */
+static void drm_sched_free_stop(struct drm_gpu_scheduler *sched)
+{
+	WRITE_ONCE(sched->pause_free, true);
+	cancel_work_sync(&sched->work_free_job);
+}
+
+/**
+ * drm_sched_no_jobs_pending - check whether jobs are pending
+ * @sched: scheduler instance
+ *
+ * Checks if jobs are pending for @sched.
+ *
+ * Return: true if jobs are pending, false otherwise.
+ */
+static bool drm_sched_no_jobs_pending(struct drm_gpu_scheduler *sched)
+{
+	bool empty;
+
+	spin_lock(&sched->job_list_lock);
+	empty = list_empty(&sched->pending_list);
+	spin_unlock(&sched->job_list_lock);
+
+	return empty;
+}
+
+/**
+ * drm_sched_cancel_jobs_and_wait - trigger freeing of all pending jobs
+ * @sched: scheduler instance
+ *
+ * Must only be called if &struct drm_sched_backend_ops.cancel_pending_fences is
+ * implemented.
+ *
+ * Instructs the driver to kill the fence context associated with this scheduler,
+ * thereby signaling all pending fences. This, in turn, will trigger
+ * &struct drm_sched_backend_ops.free_job to be called for all pending jobs.
+ * The function then blocks until all pending jobs have been freed.
+ */
+static void drm_sched_cancel_jobs_and_wait(struct drm_gpu_scheduler *sched)
+{
+	sched->ops->cancel_pending_fences(sched);
+	wait_event(sched->pending_list_waitque, drm_sched_no_jobs_pending(sched));
+}
+
 /**
  * drm_sched_fini - Destroy a gpu scheduler
  *
  * @sched: scheduler instance
  *
- * Tears down and cleans up the scheduler.
- *
- * This stops submission of new jobs to the hardware through
- * drm_sched_backend_ops.run_job(). Consequently, drm_sched_backend_ops.free_job()
- * will not be called for all jobs still in drm_gpu_scheduler.pending_list.
- * There is no solution for this currently. Thus, it is up to the driver to make
- * sure that:
- *
- *  a) drm_sched_fini() is only called after for all submitted jobs
- *     drm_sched_backend_ops.free_job() has been called or that
- *  b) the jobs for which drm_sched_backend_ops.free_job() has not been called
- *     after drm_sched_fini() ran are freed manually.
- *
- * FIXME: Take care of the above problem and prevent this function from leaking
- * the jobs in drm_gpu_scheduler.pending_list under any circumstances.
+ * Tears down and cleans up the scheduler. Might leak memory if
+ * &struct drm_sched_backend_ops.cancel_pending_fences is not implemented.
  */
 void drm_sched_fini(struct drm_gpu_scheduler *sched)
 {
 	struct drm_sched_entity *s_entity;
 	int i;
 
-	drm_sched_wqueue_stop(sched);
+	if (sched->ops->cancel_pending_fences) {
+		drm_sched_submission_and_timeout_stop(sched);
+		drm_sched_cancel_jobs_and_wait(sched);
+		drm_sched_free_stop(sched);
+	} else {
+		/* We're in "legacy free-mode" and ignore potential mem leaks */
+		drm_sched_wqueue_stop(sched);
+	}
 
 	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
 		struct drm_sched_rq *rq = sched->sched_rq[i];
@@ -1464,7 +1529,7 @@ bool drm_sched_wqueue_ready(struct drm_gpu_scheduler *sched)
 EXPORT_SYMBOL(drm_sched_wqueue_ready);
 
 /**
- * drm_sched_wqueue_stop - stop scheduler submission
+ * drm_sched_wqueue_stop - stop scheduler submission and freeing
  * @sched: scheduler instance
  *
  * Stops the scheduler from pulling new jobs from entities. It also stops
@@ -1473,13 +1538,14 @@ EXPORT_SYMBOL(drm_sched_wqueue_ready);
 void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched)
 {
 	WRITE_ONCE(sched->pause_submit, true);
+	WRITE_ONCE(sched->pause_free, true);
 	cancel_work_sync(&sched->work_run_job);
 	cancel_work_sync(&sched->work_free_job);
 }
 EXPORT_SYMBOL(drm_sched_wqueue_stop);
 
 /**
- * drm_sched_wqueue_start - start scheduler submission
+ * drm_sched_wqueue_start - start scheduler submission and freeing
  * @sched: scheduler instance
  *
  * Restarts the scheduler after drm_sched_wqueue_stop() has stopped it.
@@ -1490,6 +1556,7 @@ EXPORT_SYMBOL(drm_sched_wqueue_stop);
 void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched)
 {
 	WRITE_ONCE(sched->pause_submit, false);
+	WRITE_ONCE(sched->pause_free, false);
 	queue_work(sched->submit_wq, &sched->work_run_job);
 	queue_work(sched->submit_wq, &sched->work_free_job);
 }
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index d860db087ea5..d8bd5b605336 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -29,6 +29,7 @@
 #include <linux/completion.h>
 #include <linux/xarray.h>
 #include <linux/workqueue.h>
+#include <linux/wait.h>
 
 #define MAX_WAIT_SCHED_ENTITY_Q_EMPTY msecs_to_jiffies(1000)
 
@@ -508,6 +509,19 @@ struct drm_sched_backend_ops {
          * and it's time to clean it up.
 	 */
 	void (*free_job)(struct drm_sched_job *sched_job);
+
+	/**
+	 * @cancel_pending_fences: cancel all unsignaled hardware fences
+	 *
+	 * This callback must signal all unsignaled hardware fences associated
+	 * with @sched with an appropriate error code (e.g., -ECANCELED). This
+	 * ensures that all jobs will get freed by the scheduler before
+	 * teardown.
+	 *
+	 * This callback is optional, but it is highly recommended to implement
+	 * it to avoid memory leaks.
+	 */
+	void (*cancel_pending_fences)(struct drm_gpu_scheduler *sched);
 };
 
 /**
@@ -533,6 +547,8 @@ struct drm_sched_backend_ops {
  *            timeout interval is over.
  * @pending_list: the list of jobs which are currently in the job queue.
  * @job_list_lock: lock to protect the pending_list.
+ * @pending_list_waitque: a waitqueue for drm_sched_fini() to block on until all
+ *		          pending jobs have been finished.
  * @hang_limit: once the hangs by a job crosses this limit then it is marked
  *              guilty and it will no longer be considered for scheduling.
  * @score: score to help loadbalancer pick a idle sched
@@ -540,6 +556,7 @@ struct drm_sched_backend_ops {
  * @ready: marks if the underlying HW is ready to work
  * @free_guilty: A hit to time out handler to free the guilty job.
  * @pause_submit: pause queuing of @work_run_job on @submit_wq
+ * @pause_free: pause queueing of @work_free_job on @submit_wq
  * @own_submit_wq: scheduler owns allocation of @submit_wq
  * @dev: system &struct device
  *
@@ -562,12 +579,14 @@ struct drm_gpu_scheduler {
 	struct delayed_work		work_tdr;
 	struct list_head		pending_list;
 	spinlock_t			job_list_lock;
+	wait_queue_head_t		pending_list_waitque;
 	int				hang_limit;
 	atomic_t                        *score;
 	atomic_t                        _score;
 	bool				ready;
 	bool				free_guilty;
 	bool				pause_submit;
+	bool				pause_free;
 	bool				own_submit_wq;
 	struct device			*dev;
 };
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 2/5] drm/sched/tests: Port tests to new cleanup method
  2025-05-22  8:27 [PATCH v3 0/5] Fix memory leaks in drm_sched_fini() Philipp Stanner
  2025-05-22  8:27 ` [PATCH v3 1/5] drm/sched: Fix teardown leaks with waitqueue Philipp Stanner
@ 2025-05-22  8:27 ` Philipp Stanner
  2025-05-22 14:06   ` Tvrtko Ursulin
  2025-05-22  8:27 ` [PATCH v3 3/5] drm/sched: Warn if pending list is not empty Philipp Stanner
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Philipp Stanner @ 2025-05-22  8:27 UTC (permalink / raw)
  To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Philipp Stanner, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Tvrtko Ursulin
  Cc: dri-devel, nouveau, linux-kernel

The drm_gpu_scheduler now supports a callback to help drm_sched_fini()
avoid memory leaks. This callback instructs the driver to signal all
pending hardware fences.

Implement the new callback
drm_sched_backend_ops.cancel_pending_fences().

Have the callback use drm_mock_sched_job_complete() with a new error
field for the fence error.

Keep the job status as DRM_MOCK_SCHED_JOB_DONE for now, since there is
no party for which checking for a CANCELED status would be useful
currently.

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 .../gpu/drm/scheduler/tests/mock_scheduler.c  | 67 +++++++------------
 drivers/gpu/drm/scheduler/tests/sched_tests.h |  4 +-
 2 files changed, 25 insertions(+), 46 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
index f999c8859cf7..eca47f0395bc 100644
--- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
+++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
@@ -55,7 +55,7 @@ void drm_mock_sched_entity_free(struct drm_mock_sched_entity *entity)
 	drm_sched_entity_destroy(&entity->base);
 }
 
-static void drm_mock_sched_job_complete(struct drm_mock_sched_job *job)
+static void drm_mock_sched_job_complete(struct drm_mock_sched_job *job, int err)
 {
 	struct drm_mock_scheduler *sched =
 		drm_sched_to_mock_sched(job->base.sched);
@@ -63,7 +63,9 @@ static void drm_mock_sched_job_complete(struct drm_mock_sched_job *job)
 	lockdep_assert_held(&sched->lock);
 
 	job->flags |= DRM_MOCK_SCHED_JOB_DONE;
-	list_move_tail(&job->link, &sched->done_list);
+	list_del(&job->link);
+	if (err)
+		dma_fence_set_error(&job->hw_fence, err);
 	dma_fence_signal(&job->hw_fence);
 	complete(&job->done);
 }
@@ -89,7 +91,7 @@ drm_mock_sched_job_signal_timer(struct hrtimer *hrtimer)
 			break;
 
 		sched->hw_timeline.cur_seqno = job->hw_fence.seqno;
-		drm_mock_sched_job_complete(job);
+		drm_mock_sched_job_complete(job, 0);
 	}
 	spin_unlock_irqrestore(&sched->lock, flags);
 
@@ -212,26 +214,33 @@ mock_sched_timedout_job(struct drm_sched_job *sched_job)
 
 static void mock_sched_free_job(struct drm_sched_job *sched_job)
 {
-	struct drm_mock_scheduler *sched =
-			drm_sched_to_mock_sched(sched_job->sched);
 	struct drm_mock_sched_job *job = drm_sched_job_to_mock_job(sched_job);
-	unsigned long flags;
 
-	/* Remove from the scheduler done list. */
-	spin_lock_irqsave(&sched->lock, flags);
-	list_del(&job->link);
-	spin_unlock_irqrestore(&sched->lock, flags);
 	dma_fence_put(&job->hw_fence);
-
 	drm_sched_job_cleanup(sched_job);
 
 	/* Mock job itself is freed by the kunit framework. */
 }
 
+static void mock_sched_cancel_pending_fences(struct drm_gpu_scheduler *gsched)
+{
+	struct drm_mock_sched_job *job, *next;
+	struct drm_mock_scheduler *sched;
+	unsigned long flags;
+
+	sched = container_of(gsched, struct drm_mock_scheduler, base);
+
+	spin_lock_irqsave(&sched->lock, flags);
+	list_for_each_entry_safe(job, next, &sched->job_list, link)
+		drm_mock_sched_job_complete(job, -ECANCELED);
+	spin_unlock_irqrestore(&sched->lock, flags);
+}
+
 static const struct drm_sched_backend_ops drm_mock_scheduler_ops = {
 	.run_job = mock_sched_run_job,
 	.timedout_job = mock_sched_timedout_job,
-	.free_job = mock_sched_free_job
+	.free_job = mock_sched_free_job,
+	.cancel_pending_fences = mock_sched_cancel_pending_fences,
 };
 
 /**
@@ -265,7 +274,6 @@ struct drm_mock_scheduler *drm_mock_sched_new(struct kunit *test, long timeout)
 	sched->hw_timeline.context = dma_fence_context_alloc(1);
 	atomic_set(&sched->hw_timeline.next_seqno, 0);
 	INIT_LIST_HEAD(&sched->job_list);
-	INIT_LIST_HEAD(&sched->done_list);
 	spin_lock_init(&sched->lock);
 
 	return sched;
@@ -280,38 +288,11 @@ struct drm_mock_scheduler *drm_mock_sched_new(struct kunit *test, long timeout)
  */
 void drm_mock_sched_fini(struct drm_mock_scheduler *sched)
 {
-	struct drm_mock_sched_job *job, *next;
-	unsigned long flags;
-	LIST_HEAD(list);
+	struct drm_mock_sched_job *job;
 
-	drm_sched_wqueue_stop(&sched->base);
-
-	/* Force complete all unfinished jobs. */
-	spin_lock_irqsave(&sched->lock, flags);
-	list_for_each_entry_safe(job, next, &sched->job_list, link)
-		list_move_tail(&job->link, &list);
-	spin_unlock_irqrestore(&sched->lock, flags);
-
-	list_for_each_entry(job, &list, link)
+	list_for_each_entry(job, &sched->job_list, link)
 		hrtimer_cancel(&job->timer);
 
-	spin_lock_irqsave(&sched->lock, flags);
-	list_for_each_entry_safe(job, next, &list, link)
-		drm_mock_sched_job_complete(job);
-	spin_unlock_irqrestore(&sched->lock, flags);
-
-	/*
-	 * Free completed jobs and jobs not yet processed by the DRM scheduler
-	 * free worker.
-	 */
-	spin_lock_irqsave(&sched->lock, flags);
-	list_for_each_entry_safe(job, next, &sched->done_list, link)
-		list_move_tail(&job->link, &list);
-	spin_unlock_irqrestore(&sched->lock, flags);
-
-	list_for_each_entry_safe(job, next, &list, link)
-		mock_sched_free_job(&job->base);
-
 	drm_sched_fini(&sched->base);
 }
 
@@ -346,7 +327,7 @@ unsigned int drm_mock_sched_advance(struct drm_mock_scheduler *sched,
 		if (sched->hw_timeline.cur_seqno < job->hw_fence.seqno)
 			break;
 
-		drm_mock_sched_job_complete(job);
+		drm_mock_sched_job_complete(job, 0);
 		found++;
 	}
 unlock:
diff --git a/drivers/gpu/drm/scheduler/tests/sched_tests.h b/drivers/gpu/drm/scheduler/tests/sched_tests.h
index 27caf8285fb7..22e530d87791 100644
--- a/drivers/gpu/drm/scheduler/tests/sched_tests.h
+++ b/drivers/gpu/drm/scheduler/tests/sched_tests.h
@@ -32,9 +32,8 @@
  *
  * @base: DRM scheduler base class
  * @test: Backpointer to owning the kunit test case
- * @lock: Lock to protect the simulated @hw_timeline, @job_list and @done_list
+ * @lock: Lock to protect the simulated @hw_timeline and @job_list
  * @job_list: List of jobs submitted to the mock GPU
- * @done_list: List of jobs completed by the mock GPU
  * @hw_timeline: Simulated hardware timeline has a @context, @next_seqno and
  *		 @cur_seqno for implementing a struct dma_fence signaling the
  *		 simulated job completion.
@@ -49,7 +48,6 @@ struct drm_mock_scheduler {
 
 	spinlock_t		lock;
 	struct list_head	job_list;
-	struct list_head	done_list;
 
 	struct {
 		u64		context;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 3/5] drm/sched: Warn if pending list is not empty
  2025-05-22  8:27 [PATCH v3 0/5] Fix memory leaks in drm_sched_fini() Philipp Stanner
  2025-05-22  8:27 ` [PATCH v3 1/5] drm/sched: Fix teardown leaks with waitqueue Philipp Stanner
  2025-05-22  8:27 ` [PATCH v3 2/5] drm/sched/tests: Port tests to new cleanup method Philipp Stanner
@ 2025-05-22  8:27 ` Philipp Stanner
  2025-05-22  8:27 ` [PATCH v3 4/5] drm/nouveau: Add new callback for scheduler teardown Philipp Stanner
  2025-05-22  8:27 ` [PATCH v3 5/5] drm/nouveau: Remove waitque for sched teardown Philipp Stanner
  4 siblings, 0 replies; 13+ messages in thread
From: Philipp Stanner @ 2025-05-22  8:27 UTC (permalink / raw)
  To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Philipp Stanner, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Tvrtko Ursulin
  Cc: dri-devel, nouveau, linux-kernel

drm_sched_fini() can leak jobs under certain circumstances.

Warn if that happens.

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/scheduler/sched_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 406572f5168e..48f07e6cfe2b 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1469,6 +1469,9 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
 	sched->ready = false;
 	kfree(sched->sched_rq);
 	sched->sched_rq = NULL;
+
+	if (!list_empty(&sched->pending_list))
+		dev_err(sched->dev, "Tearing down scheduler while jobs are pending!\n");
 }
 EXPORT_SYMBOL(drm_sched_fini);
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 4/5] drm/nouveau: Add new callback for scheduler teardown
  2025-05-22  8:27 [PATCH v3 0/5] Fix memory leaks in drm_sched_fini() Philipp Stanner
                   ` (2 preceding siblings ...)
  2025-05-22  8:27 ` [PATCH v3 3/5] drm/sched: Warn if pending list is not empty Philipp Stanner
@ 2025-05-22  8:27 ` Philipp Stanner
  2025-05-22  8:27 ` [PATCH v3 5/5] drm/nouveau: Remove waitque for sched teardown Philipp Stanner
  4 siblings, 0 replies; 13+ messages in thread
From: Philipp Stanner @ 2025-05-22  8:27 UTC (permalink / raw)
  To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Philipp Stanner, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Tvrtko Ursulin
  Cc: dri-devel, nouveau, linux-kernel

There is a new callback for always tearing the scheduler down in a
leak-free, deadlock-free manner.

Port Nouveau as its first user by providing the scheduler with a
callback that ensures the fence context gets killed in drm_sched_fini().

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/nouveau/nouveau_abi16.c |  4 ++--
 drivers/gpu/drm/nouveau/nouveau_drm.c   |  2 +-
 drivers/gpu/drm/nouveau/nouveau_sched.c | 19 +++++++++++++++++--
 drivers/gpu/drm/nouveau/nouveau_sched.h |  3 +++
 4 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_abi16.c b/drivers/gpu/drm/nouveau/nouveau_abi16.c
index 2a0617e5fe2a..53b8a85a8adb 100644
--- a/drivers/gpu/drm/nouveau/nouveau_abi16.c
+++ b/drivers/gpu/drm/nouveau/nouveau_abi16.c
@@ -415,8 +415,8 @@ nouveau_abi16_ioctl_channel_alloc(ABI16_IOCTL_ARGS)
 	 * The client lock is already acquired by nouveau_abi16_get().
 	 */
 	if (nouveau_cli_uvmm(cli)) {
-		ret = nouveau_sched_create(&chan->sched, drm, drm->sched_wq,
-					   chan->chan->dma.ib_max);
+		ret = nouveau_sched_create(&chan->sched, drm, chan->chan,
+					   drm->sched_wq, chan->chan->dma.ib_max);
 		if (ret)
 			goto done;
 	}
diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c
index c69139701056..2a2b319dca5f 100644
--- a/drivers/gpu/drm/nouveau/nouveau_drm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
@@ -313,7 +313,7 @@ nouveau_cli_init(struct nouveau_drm *drm, const char *sname,
 	 * locks which indirectly or directly are held for allocations
 	 * elsewhere.
 	 */
-	ret = nouveau_sched_create(&cli->sched, drm, NULL, 1);
+	ret = nouveau_sched_create(&cli->sched, drm, NULL, NULL, 1);
 	if (ret)
 		goto done;
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
index d326e55d2d24..de14883ee4c8 100644
--- a/drivers/gpu/drm/nouveau/nouveau_sched.c
+++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
@@ -11,6 +11,7 @@
 #include "nouveau_exec.h"
 #include "nouveau_abi16.h"
 #include "nouveau_sched.h"
+#include "nouveau_chan.h"
 
 #define NOUVEAU_SCHED_JOB_TIMEOUT_MS		10000
 
@@ -392,10 +393,22 @@ nouveau_sched_free_job(struct drm_sched_job *sched_job)
 	nouveau_job_fini(job);
 }
 
+static void
+nouveau_sched_fence_context_kill(struct drm_gpu_scheduler *sched)
+{
+	struct nouveau_sched *nsched;
+
+	nsched = container_of(sched, struct nouveau_sched, base);
+
+	if (nsched->chan)
+		nouveau_channel_kill(nsched->chan);
+}
+
 static const struct drm_sched_backend_ops nouveau_sched_ops = {
 	.run_job = nouveau_sched_run_job,
 	.timedout_job = nouveau_sched_timedout_job,
 	.free_job = nouveau_sched_free_job,
+	.cancel_pending_fences = nouveau_sched_fence_context_kill,
 };
 
 static int
@@ -461,7 +474,8 @@ nouveau_sched_init(struct nouveau_sched *sched, struct nouveau_drm *drm,
 
 int
 nouveau_sched_create(struct nouveau_sched **psched, struct nouveau_drm *drm,
-		     struct workqueue_struct *wq, u32 credit_limit)
+		     struct nouveau_channel *chan, struct workqueue_struct *wq,
+		     u32 credit_limit)
 {
 	struct nouveau_sched *sched;
 	int ret;
@@ -470,6 +484,8 @@ nouveau_sched_create(struct nouveau_sched **psched, struct nouveau_drm *drm,
 	if (!sched)
 		return -ENOMEM;
 
+	sched->chan = chan;
+
 	ret = nouveau_sched_init(sched, drm, wq, credit_limit);
 	if (ret) {
 		kfree(sched);
@@ -481,7 +497,6 @@ nouveau_sched_create(struct nouveau_sched **psched, struct nouveau_drm *drm,
 	return 0;
 }
 
-
 static void
 nouveau_sched_fini(struct nouveau_sched *sched)
 {
diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.h b/drivers/gpu/drm/nouveau/nouveau_sched.h
index 20cd1da8db73..e6e2016a3569 100644
--- a/drivers/gpu/drm/nouveau/nouveau_sched.h
+++ b/drivers/gpu/drm/nouveau/nouveau_sched.h
@@ -9,6 +9,7 @@
 #include <drm/gpu_scheduler.h>
 
 #include "nouveau_drv.h"
+#include "nouveau_chan.h"
 
 #define to_nouveau_job(sched_job)		\
 		container_of((sched_job), struct nouveau_job, base)
@@ -101,6 +102,7 @@ struct nouveau_sched {
 	struct drm_sched_entity entity;
 	struct workqueue_struct *wq;
 	struct mutex mutex;
+	struct nouveau_channel *chan;
 
 	struct {
 		struct {
@@ -112,6 +114,7 @@ struct nouveau_sched {
 };
 
 int nouveau_sched_create(struct nouveau_sched **psched, struct nouveau_drm *drm,
+			 struct nouveau_channel *chan,
 			 struct workqueue_struct *wq, u32 credit_limit);
 void nouveau_sched_destroy(struct nouveau_sched **psched);
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 5/5] drm/nouveau: Remove waitque for sched teardown
  2025-05-22  8:27 [PATCH v3 0/5] Fix memory leaks in drm_sched_fini() Philipp Stanner
                   ` (3 preceding siblings ...)
  2025-05-22  8:27 ` [PATCH v3 4/5] drm/nouveau: Add new callback for scheduler teardown Philipp Stanner
@ 2025-05-22  8:27 ` Philipp Stanner
  4 siblings, 0 replies; 13+ messages in thread
From: Philipp Stanner @ 2025-05-22  8:27 UTC (permalink / raw)
  To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Philipp Stanner, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Tvrtko Ursulin
  Cc: dri-devel, nouveau, linux-kernel

struct nouveau_sched contains a waitque needed to prevent
drm_sched_fini() from being called while there are still jobs pending.
Doing so so far would have caused memory leaks.

With the new memleak-free mode of operation switched on in
drm_sched_fini() by providing the callback
nouveau_sched_fence_context_kill() the waitque is not necessary anymore.

Remove the waitque.

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/nouveau/nouveau_sched.c | 20 +++++++-------------
 drivers/gpu/drm/nouveau/nouveau_sched.h |  9 +++------
 drivers/gpu/drm/nouveau/nouveau_uvmm.c  |  8 ++++----
 3 files changed, 14 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
index de14883ee4c8..1d300b382b32 100644
--- a/drivers/gpu/drm/nouveau/nouveau_sched.c
+++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
@@ -121,11 +121,9 @@ nouveau_job_done(struct nouveau_job *job)
 {
 	struct nouveau_sched *sched = job->sched;
 
-	spin_lock(&sched->job.list.lock);
+	spin_lock(&sched->job_list.lock);
 	list_del(&job->entry);
-	spin_unlock(&sched->job.list.lock);
-
-	wake_up(&sched->job.wq);
+	spin_unlock(&sched->job_list.lock);
 }
 
 void
@@ -306,9 +304,9 @@ nouveau_job_submit(struct nouveau_job *job)
 	}
 
 	/* Submit was successful; add the job to the schedulers job list. */
-	spin_lock(&sched->job.list.lock);
-	list_add(&job->entry, &sched->job.list.head);
-	spin_unlock(&sched->job.list.lock);
+	spin_lock(&sched->job_list.lock);
+	list_add(&job->entry, &sched->job_list.head);
+	spin_unlock(&sched->job_list.lock);
 
 	drm_sched_job_arm(&job->base);
 	job->done_fence = dma_fence_get(&job->base.s_fence->finished);
@@ -458,9 +456,8 @@ nouveau_sched_init(struct nouveau_sched *sched, struct nouveau_drm *drm,
 		goto fail_sched;
 
 	mutex_init(&sched->mutex);
-	spin_lock_init(&sched->job.list.lock);
-	INIT_LIST_HEAD(&sched->job.list.head);
-	init_waitqueue_head(&sched->job.wq);
+	spin_lock_init(&sched->job_list.lock);
+	INIT_LIST_HEAD(&sched->job_list.head);
 
 	return 0;
 
@@ -503,9 +500,6 @@ nouveau_sched_fini(struct nouveau_sched *sched)
 	struct drm_gpu_scheduler *drm_sched = &sched->base;
 	struct drm_sched_entity *entity = &sched->entity;
 
-	rmb(); /* for list_empty to work without lock */
-	wait_event(sched->job.wq, list_empty(&sched->job.list.head));
-
 	drm_sched_entity_fini(entity);
 	drm_sched_fini(drm_sched);
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.h b/drivers/gpu/drm/nouveau/nouveau_sched.h
index e6e2016a3569..339a14563fbb 100644
--- a/drivers/gpu/drm/nouveau/nouveau_sched.h
+++ b/drivers/gpu/drm/nouveau/nouveau_sched.h
@@ -105,12 +105,9 @@ struct nouveau_sched {
 	struct nouveau_channel *chan;
 
 	struct {
-		struct {
-			struct list_head head;
-			spinlock_t lock;
-		} list;
-		struct wait_queue_head wq;
-	} job;
+		struct list_head head;
+		spinlock_t lock;
+	} job_list;
 };
 
 int nouveau_sched_create(struct nouveau_sched **psched, struct nouveau_drm *drm,
diff --git a/drivers/gpu/drm/nouveau/nouveau_uvmm.c b/drivers/gpu/drm/nouveau/nouveau_uvmm.c
index 48f105239f42..ddfc46bc1b3e 100644
--- a/drivers/gpu/drm/nouveau/nouveau_uvmm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_uvmm.c
@@ -1019,8 +1019,8 @@ bind_validate_map_sparse(struct nouveau_job *job, u64 addr, u64 range)
 	u64 end = addr + range;
 
 again:
-	spin_lock(&sched->job.list.lock);
-	list_for_each_entry(__job, &sched->job.list.head, entry) {
+	spin_lock(&sched->job_list.lock);
+	list_for_each_entry(__job, &sched->job_list.head, entry) {
 		struct nouveau_uvmm_bind_job *bind_job = to_uvmm_bind_job(__job);
 
 		list_for_each_op(op, &bind_job->ops) {
@@ -1030,7 +1030,7 @@ bind_validate_map_sparse(struct nouveau_job *job, u64 addr, u64 range)
 
 				if (!(end <= op_addr || addr >= op_end)) {
 					nouveau_uvmm_bind_job_get(bind_job);
-					spin_unlock(&sched->job.list.lock);
+					spin_unlock(&sched->job_list.lock);
 					wait_for_completion(&bind_job->complete);
 					nouveau_uvmm_bind_job_put(bind_job);
 					goto again;
@@ -1038,7 +1038,7 @@ bind_validate_map_sparse(struct nouveau_job *job, u64 addr, u64 range)
 			}
 		}
 	}
-	spin_unlock(&sched->job.list.lock);
+	spin_unlock(&sched->job_list.lock);
 }
 
 static int
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/5] drm/sched: Fix teardown leaks with waitqueue
  2025-05-22  8:27 ` [PATCH v3 1/5] drm/sched: Fix teardown leaks with waitqueue Philipp Stanner
@ 2025-05-22 12:44   ` Danilo Krummrich
  2025-05-22 13:37   ` Tvrtko Ursulin
  1 sibling, 0 replies; 13+ messages in thread
From: Danilo Krummrich @ 2025-05-22 12:44 UTC (permalink / raw)
  To: Philipp Stanner
  Cc: Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Tvrtko Ursulin, dri-devel, nouveau,
	linux-kernel, Philipp Stanner

On Thu, May 22, 2025 at 10:27:39AM +0200, Philipp Stanner wrote:
> +/**
> + * drm_sched_submission_and_timeout_stop - stop everything except for free_job
> + * @sched: scheduler instance
> + *
> + * Helper for tearing down the scheduler in drm_sched_fini().
> + */
> +static void
> +drm_sched_submission_and_timeout_stop(struct drm_gpu_scheduler *sched)
> +{
> +	WRITE_ONCE(sched->pause_submit, true);
> +	cancel_work_sync(&sched->work_run_job);
> +	cancel_delayed_work_sync(&sched->work_tdr);
> +}
> +
> +/**
> + * drm_sched_free_stop - stop free_job
> + * @sched: scheduler instance
> + *
> + * Helper for tearing down the scheduler in drm_sched_fini().
> + */
> +static void drm_sched_free_stop(struct drm_gpu_scheduler *sched)
> +{
> +	WRITE_ONCE(sched->pause_free, true);
> +	cancel_work_sync(&sched->work_free_job);
> +}
> +
> +/**
> + * drm_sched_no_jobs_pending - check whether jobs are pending
> + * @sched: scheduler instance
> + *
> + * Checks if jobs are pending for @sched.
> + *
> + * Return: true if jobs are pending, false otherwise.
> + */
> +static bool drm_sched_no_jobs_pending(struct drm_gpu_scheduler *sched)
> +{
> +	bool empty;
> +
> +	spin_lock(&sched->job_list_lock);
> +	empty = list_empty(&sched->pending_list);
> +	spin_unlock(&sched->job_list_lock);
> +
> +	return empty;
> +}

I understand that the way you use this function is correct, since you only call
it *after* drm_sched_submission_and_timeout_stop(), which means that no new
items can end up on the pending_list.

But if we look at this function without context, it's broken:

The documentation says "Return: true if jobs are pending, false otherwise.", but
you can't guarantee that, since a new job could be added to the pending_list
after spin_unlock().

Hence, providing this function is a footgun.

Instead, you should put this teardown sequence in a single function, where you
can control the external conditions, i.e. that
drm_sched_submission_and_timeout_stop() has been called.

Please also add a comment explaining why we can release the lock and still work
with the value returned by list_empty() in this case, i.e. because we guarantee
that the list item count converges against zero.

The other two helpers above, drm_sched_submission_and_timeout_stop() and
drm_sched_free_stop() should be fine to have.

> +/**
> + * drm_sched_cancel_jobs_and_wait - trigger freeing of all pending jobs
> + * @sched: scheduler instance
> + *
> + * Must only be called if &struct drm_sched_backend_ops.cancel_pending_fences is
> + * implemented.
> + *
> + * Instructs the driver to kill the fence context associated with this scheduler,
> + * thereby signaling all pending fences. This, in turn, will trigger
> + * &struct drm_sched_backend_ops.free_job to be called for all pending jobs.
> + * The function then blocks until all pending jobs have been freed.
> + */
> +static void drm_sched_cancel_jobs_and_wait(struct drm_gpu_scheduler *sched)
> +{
> +	sched->ops->cancel_pending_fences(sched);
> +	wait_event(sched->pending_list_waitque, drm_sched_no_jobs_pending(sched));
> +}

Same here, you can't have this as an isolated helper.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/5] drm/sched: Fix teardown leaks with waitqueue
  2025-05-22  8:27 ` [PATCH v3 1/5] drm/sched: Fix teardown leaks with waitqueue Philipp Stanner
  2025-05-22 12:44   ` Danilo Krummrich
@ 2025-05-22 13:37   ` Tvrtko Ursulin
  2025-05-22 15:32     ` Philipp Stanner
  1 sibling, 1 reply; 13+ messages in thread
From: Tvrtko Ursulin @ 2025-05-22 13:37 UTC (permalink / raw)
  To: Philipp Stanner, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Matthew Brost, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann
  Cc: dri-devel, nouveau, linux-kernel, Philipp Stanner


On 22/05/2025 09:27, Philipp Stanner wrote:
> From: Philipp Stanner <pstanner@redhat.com>
> 
> The GPU scheduler currently does not ensure that its pending_list is
> empty before performing various other teardown tasks in
> drm_sched_fini().
> 
> If there are still jobs in the pending_list, this is problematic because
> after scheduler teardown, no one will call backend_ops.free_job()
> anymore. This would, consequently, result in memory leaks.
> 
> One way to solve this is to implement a waitqueue that drm_sched_fini()
> blocks on until the pending_list has become empty. That waitqueue must
> obviously not block for a significant time. Thus, it's necessary to only
> wait if it's guaranteed that all fences will get signaled quickly.
> 
> This can be ensured by having the driver implement a new backend ops,
> cancel_pending_fences(), in which the driver shall signal all
> unsignaled, in-flight fences with an error.
> 
> Add a waitqueue to struct drm_gpu_scheduler. Wake up waiters once the
> pending_list becomes empty. Wait in drm_sched_fini() for that to happen
> if cancel_pending_fences() is implemented.
> 
> Signed-off-by: Philipp Stanner <pstanner@redhat.com>
> ---
>   drivers/gpu/drm/scheduler/sched_main.c | 105 ++++++++++++++++++++-----
>   include/drm/gpu_scheduler.h            |  19 +++++
>   2 files changed, 105 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index f7118497e47a..406572f5168e 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -367,7 +367,7 @@ static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>    */
>   static void __drm_sched_run_free_queue(struct drm_gpu_scheduler *sched)
>   {
> -	if (!READ_ONCE(sched->pause_submit))
> +	if (!READ_ONCE(sched->pause_free))
>   		queue_work(sched->submit_wq, &sched->work_free_job);
>   }
>   
> @@ -1121,6 +1121,12 @@ drm_sched_get_finished_job(struct drm_gpu_scheduler *sched)
>   		/* remove job from pending_list */
>   		list_del_init(&job->list);
>   
> +		/*
> +		 * Inform tasks blocking in drm_sched_fini() that it's now safe to proceed.
> +		 */
> +		if (list_empty(&sched->pending_list))
> +			wake_up(&sched->pending_list_waitque);

Wait what? ;) (pun intended)

I think I mentioned in the last round that waitque looks dodgy. Either a 
typo or a very unusual and novel shorthand? I suggest a typical wq or 
waitqueue.

I also mentioned that one more advantage of the ->cancel_job() approach 
is there is no need for these extra calls on the normal path (non 
teardown) at all.

> +
>   		/* cancel this job's TO timer */
>   		cancel_delayed_work(&sched->work_tdr);
>   		/* make the scheduled timestamp more accurate */
> @@ -1326,6 +1332,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
>   	init_waitqueue_head(&sched->job_scheduled);
>   	INIT_LIST_HEAD(&sched->pending_list);
>   	spin_lock_init(&sched->job_list_lock);
> +	init_waitqueue_head(&sched->pending_list_waitque);
>   	atomic_set(&sched->credit_count, 0);
>   	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
>   	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
> @@ -1333,6 +1340,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
>   	atomic_set(&sched->_score, 0);
>   	atomic64_set(&sched->job_id_count, 0);
>   	sched->pause_submit = false;
> +	sched->pause_free = false;
>   
>   	sched->ready = true;
>   	return 0;
> @@ -1350,33 +1358,90 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
>   }
>   EXPORT_SYMBOL(drm_sched_init);
>   
> +/**
> + * drm_sched_submission_and_timeout_stop - stop everything except for free_job
> + * @sched: scheduler instance
> + *
> + * Helper for tearing down the scheduler in drm_sched_fini().
> + */
> +static void
> +drm_sched_submission_and_timeout_stop(struct drm_gpu_scheduler *sched)
> +{
> +	WRITE_ONCE(sched->pause_submit, true);
> +	cancel_work_sync(&sched->work_run_job);
> +	cancel_delayed_work_sync(&sched->work_tdr);
> +}
> +
> +/**
> + * drm_sched_free_stop - stop free_job
> + * @sched: scheduler instance
> + *
> + * Helper for tearing down the scheduler in drm_sched_fini().
> + */
> +static void drm_sched_free_stop(struct drm_gpu_scheduler *sched)
> +{
> +	WRITE_ONCE(sched->pause_free, true);
> +	cancel_work_sync(&sched->work_free_job);
> +}
> +
> +/**
> + * drm_sched_no_jobs_pending - check whether jobs are pending
> + * @sched: scheduler instance
> + *
> + * Checks if jobs are pending for @sched.
> + *
> + * Return: true if jobs are pending, false otherwise.
> + */
> +static bool drm_sched_no_jobs_pending(struct drm_gpu_scheduler *sched)
> +{
> +	bool empty;
> +
> +	spin_lock(&sched->job_list_lock);
> +	empty = list_empty(&sched->pending_list);
> +	spin_unlock(&sched->job_list_lock);
> +
> +	return empty;
> +}
> +
> +/**
> + * drm_sched_cancel_jobs_and_wait - trigger freeing of all pending jobs
> + * @sched: scheduler instance
> + *
> + * Must only be called if &struct drm_sched_backend_ops.cancel_pending_fences is
> + * implemented.
> + *
> + * Instructs the driver to kill the fence context associated with this scheduler,
> + * thereby signaling all pending fences. This, in turn, will trigger
> + * &struct drm_sched_backend_ops.free_job to be called for all pending jobs.
> + * The function then blocks until all pending jobs have been freed.
> + */
> +static void drm_sched_cancel_jobs_and_wait(struct drm_gpu_scheduler *sched)
> +{
> +	sched->ops->cancel_pending_fences(sched);
> +	wait_event(sched->pending_list_waitque, drm_sched_no_jobs_pending(sched));
> +}
> +
>   /**
>    * drm_sched_fini - Destroy a gpu scheduler
>    *
>    * @sched: scheduler instance
>    *
> - * Tears down and cleans up the scheduler.
> - *
> - * This stops submission of new jobs to the hardware through
> - * drm_sched_backend_ops.run_job(). Consequently, drm_sched_backend_ops.free_job()
> - * will not be called for all jobs still in drm_gpu_scheduler.pending_list.
> - * There is no solution for this currently. Thus, it is up to the driver to make
> - * sure that:
> - *
> - *  a) drm_sched_fini() is only called after for all submitted jobs
> - *     drm_sched_backend_ops.free_job() has been called or that
> - *  b) the jobs for which drm_sched_backend_ops.free_job() has not been called
> - *     after drm_sched_fini() ran are freed manually.
> - *
> - * FIXME: Take care of the above problem and prevent this function from leaking
> - * the jobs in drm_gpu_scheduler.pending_list under any circumstances.
> + * Tears down and cleans up the scheduler. Might leak memory if
> + * &struct drm_sched_backend_ops.cancel_pending_fences is not implemented.
>    */
>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>   {
>   	struct drm_sched_entity *s_entity;
>   	int i;
>   
> -	drm_sched_wqueue_stop(sched);
> +	if (sched->ops->cancel_pending_fences) {
> +		drm_sched_submission_and_timeout_stop(sched);
> +		drm_sched_cancel_jobs_and_wait(sched);
> +		drm_sched_free_stop(sched);
> +	} else {
> +		/* We're in "legacy free-mode" and ignore potential mem leaks */
> +		drm_sched_wqueue_stop(sched);
> +	}
>   
>   	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
>   		struct drm_sched_rq *rq = sched->sched_rq[i];
> @@ -1464,7 +1529,7 @@ bool drm_sched_wqueue_ready(struct drm_gpu_scheduler *sched)
>   EXPORT_SYMBOL(drm_sched_wqueue_ready);
>   
>   /**
> - * drm_sched_wqueue_stop - stop scheduler submission
> + * drm_sched_wqueue_stop - stop scheduler submission and freeing
>    * @sched: scheduler instance
>    *
>    * Stops the scheduler from pulling new jobs from entities. It also stops
> @@ -1473,13 +1538,14 @@ EXPORT_SYMBOL(drm_sched_wqueue_ready);
>   void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched)
>   {
>   	WRITE_ONCE(sched->pause_submit, true);
> +	WRITE_ONCE(sched->pause_free, true);
>   	cancel_work_sync(&sched->work_run_job);
>   	cancel_work_sync(&sched->work_free_job);
>   }
>   EXPORT_SYMBOL(drm_sched_wqueue_stop);
>   
>   /**
> - * drm_sched_wqueue_start - start scheduler submission
> + * drm_sched_wqueue_start - start scheduler submission and freeing
>    * @sched: scheduler instance
>    *
>    * Restarts the scheduler after drm_sched_wqueue_stop() has stopped it.
> @@ -1490,6 +1556,7 @@ EXPORT_SYMBOL(drm_sched_wqueue_stop);
>   void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched)
>   {
>   	WRITE_ONCE(sched->pause_submit, false);
> +	WRITE_ONCE(sched->pause_free, false);
>   	queue_work(sched->submit_wq, &sched->work_run_job);
>   	queue_work(sched->submit_wq, &sched->work_free_job);
>   }
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index d860db087ea5..d8bd5b605336 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -29,6 +29,7 @@
>   #include <linux/completion.h>
>   #include <linux/xarray.h>
>   #include <linux/workqueue.h>
> +#include <linux/wait.h>
>   
>   #define MAX_WAIT_SCHED_ENTITY_Q_EMPTY msecs_to_jiffies(1000)
>   
> @@ -508,6 +509,19 @@ struct drm_sched_backend_ops {
>            * and it's time to clean it up.
>   	 */
>   	void (*free_job)(struct drm_sched_job *sched_job);
> +
> +	/**
> +	 * @cancel_pending_fences: cancel all unsignaled hardware fences
> +	 *
> +	 * This callback must signal all unsignaled hardware fences associated
> +	 * with @sched with an appropriate error code (e.g., -ECANCELED). This
> +	 * ensures that all jobs will get freed by the scheduler before
> +	 * teardown.
> +	 *
> +	 * This callback is optional, but it is highly recommended to implement
> +	 * it to avoid memory leaks.
> +	 */
> +	void (*cancel_pending_fences)(struct drm_gpu_scheduler *sched);

I still don't understand why insist to use a new term in the backend 
ops, and even the whole scheduler API. Nothing in the API so far has 
fences in the name. Something like cancel(_all|pending)_jobs or 
sched_fini would read more aligned with the rest to me.

>   };
>   
>   /**
> @@ -533,6 +547,8 @@ struct drm_sched_backend_ops {
>    *            timeout interval is over.
>    * @pending_list: the list of jobs which are currently in the job queue.
>    * @job_list_lock: lock to protect the pending_list.
> + * @pending_list_waitque: a waitqueue for drm_sched_fini() to block on until all
> + *		          pending jobs have been finished.
>    * @hang_limit: once the hangs by a job crosses this limit then it is marked
>    *              guilty and it will no longer be considered for scheduling.
>    * @score: score to help loadbalancer pick a idle sched
> @@ -540,6 +556,7 @@ struct drm_sched_backend_ops {
>    * @ready: marks if the underlying HW is ready to work
>    * @free_guilty: A hit to time out handler to free the guilty job.
>    * @pause_submit: pause queuing of @work_run_job on @submit_wq
> + * @pause_free: pause queueing of @work_free_job on @submit_wq
>    * @own_submit_wq: scheduler owns allocation of @submit_wq
>    * @dev: system &struct device
>    *
> @@ -562,12 +579,14 @@ struct drm_gpu_scheduler {
>   	struct delayed_work		work_tdr;
>   	struct list_head		pending_list;
>   	spinlock_t			job_list_lock;
> +	wait_queue_head_t		pending_list_waitque;
>   	int				hang_limit;
>   	atomic_t                        *score;
>   	atomic_t                        _score;
>   	bool				ready;
>   	bool				free_guilty;
>   	bool				pause_submit;
> +	bool				pause_free;
>   	bool				own_submit_wq;
>   	struct device			*dev;
>   };

And, as you know, another thing I don't understand is why would we 
choose to add more of the state machine when I have shown how it can be 
done more elegantly. You don't have to reply, this is more a for the 
record against v3.

Regards,

Tvrtko


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 2/5] drm/sched/tests: Port tests to new cleanup method
  2025-05-22  8:27 ` [PATCH v3 2/5] drm/sched/tests: Port tests to new cleanup method Philipp Stanner
@ 2025-05-22 14:06   ` Tvrtko Ursulin
  2025-05-22 14:59     ` Philipp Stanner
  0 siblings, 1 reply; 13+ messages in thread
From: Tvrtko Ursulin @ 2025-05-22 14:06 UTC (permalink / raw)
  To: Philipp Stanner, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Matthew Brost, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann
  Cc: dri-devel, nouveau, linux-kernel


On 22/05/2025 09:27, Philipp Stanner wrote:
> The drm_gpu_scheduler now supports a callback to help drm_sched_fini()
> avoid memory leaks. This callback instructs the driver to signal all
> pending hardware fences.
> 
> Implement the new callback
> drm_sched_backend_ops.cancel_pending_fences().
> 
> Have the callback use drm_mock_sched_job_complete() with a new error
> field for the fence error.
> 
> Keep the job status as DRM_MOCK_SCHED_JOB_DONE for now, since there is
> no party for which checking for a CANCELED status would be useful
> currently.
> 
> Signed-off-by: Philipp Stanner <phasta@kernel.org>
> ---
>   .../gpu/drm/scheduler/tests/mock_scheduler.c  | 67 +++++++------------
>   drivers/gpu/drm/scheduler/tests/sched_tests.h |  4 +-
>   2 files changed, 25 insertions(+), 46 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> index f999c8859cf7..eca47f0395bc 100644
> --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> @@ -55,7 +55,7 @@ void drm_mock_sched_entity_free(struct drm_mock_sched_entity *entity)
>   	drm_sched_entity_destroy(&entity->base);
>   }
>   
> -static void drm_mock_sched_job_complete(struct drm_mock_sched_job *job)
> +static void drm_mock_sched_job_complete(struct drm_mock_sched_job *job, int err)
>   {
>   	struct drm_mock_scheduler *sched =
>   		drm_sched_to_mock_sched(job->base.sched);
> @@ -63,7 +63,9 @@ static void drm_mock_sched_job_complete(struct drm_mock_sched_job *job)
>   	lockdep_assert_held(&sched->lock);
>   
>   	job->flags |= DRM_MOCK_SCHED_JOB_DONE;
> -	list_move_tail(&job->link, &sched->done_list);
> +	list_del(&job->link);
> +	if (err)
> +		dma_fence_set_error(&job->hw_fence, err);
>   	dma_fence_signal(&job->hw_fence);
>   	complete(&job->done);
>   }
> @@ -89,7 +91,7 @@ drm_mock_sched_job_signal_timer(struct hrtimer *hrtimer)
>   			break;
>   
>   		sched->hw_timeline.cur_seqno = job->hw_fence.seqno;
> -		drm_mock_sched_job_complete(job);
> +		drm_mock_sched_job_complete(job, 0);
>   	}
>   	spin_unlock_irqrestore(&sched->lock, flags);
>   
> @@ -212,26 +214,33 @@ mock_sched_timedout_job(struct drm_sched_job *sched_job)
>   
>   static void mock_sched_free_job(struct drm_sched_job *sched_job)
>   {
> -	struct drm_mock_scheduler *sched =
> -			drm_sched_to_mock_sched(sched_job->sched);
>   	struct drm_mock_sched_job *job = drm_sched_job_to_mock_job(sched_job);
> -	unsigned long flags;
>   
> -	/* Remove from the scheduler done list. */
> -	spin_lock_irqsave(&sched->lock, flags);
> -	list_del(&job->link);
> -	spin_unlock_irqrestore(&sched->lock, flags);
>   	dma_fence_put(&job->hw_fence);
> -
>   	drm_sched_job_cleanup(sched_job);
>   
>   	/* Mock job itself is freed by the kunit framework. */
>   }
>   
> +static void mock_sched_cancel_pending_fences(struct drm_gpu_scheduler *gsched)

"gsched" feels like a first time invention. Maybe drm_sched?

> +{
> +	struct drm_mock_sched_job *job, *next;
> +	struct drm_mock_scheduler *sched;
> +	unsigned long flags;
> +
> +	sched = container_of(gsched, struct drm_mock_scheduler, base);
> +
> +	spin_lock_irqsave(&sched->lock, flags);
> +	list_for_each_entry_safe(job, next, &sched->job_list, link)
> +		drm_mock_sched_job_complete(job, -ECANCELED);
> +	spin_unlock_irqrestore(&sched->lock, flags);

Canceling of the timers belongs in this call back I think. Otherwise 
jobs are not fully cancelled.

Hm, I also think, conceptually, the order of first canceling the timer 
and then signaling the fence should be kept.

At the moment it does not matter hugely, since the timer does not signal 
the jobs directly and will not find unlinked jobs, but if that changes 
in the future, the reversed order could cause double signaling. So if 
you keep it in the correct logical order that potential gotcha is 
avoided. Basically just keep the two pass approach verbatim, as is in 
the current drm_mock_sched_fini.

The rest of the conversion is I think good.

Only a slight uncertainty after I cross-referenced with my version 
(->cancel_job()) around why I needed to add signaling to 
mock_sched_timedout_job() and manual job cleanup to the timeout test. It 
was more than a month ago that I wrote it so can't remember right now. 
You checked for memory leaks and the usual stuff?

Regards,

Tvrtko

> +}
> +
>   static const struct drm_sched_backend_ops drm_mock_scheduler_ops = {
>   	.run_job = mock_sched_run_job,
>   	.timedout_job = mock_sched_timedout_job,
> -	.free_job = mock_sched_free_job
> +	.free_job = mock_sched_free_job,
> +	.cancel_pending_fences = mock_sched_cancel_pending_fences,
>   };
>   
>   /**
> @@ -265,7 +274,6 @@ struct drm_mock_scheduler *drm_mock_sched_new(struct kunit *test, long timeout)
>   	sched->hw_timeline.context = dma_fence_context_alloc(1);
>   	atomic_set(&sched->hw_timeline.next_seqno, 0);
>   	INIT_LIST_HEAD(&sched->job_list);
> -	INIT_LIST_HEAD(&sched->done_list);
>   	spin_lock_init(&sched->lock);
>   
>   	return sched;
> @@ -280,38 +288,11 @@ struct drm_mock_scheduler *drm_mock_sched_new(struct kunit *test, long timeout)
>    */
>   void drm_mock_sched_fini(struct drm_mock_scheduler *sched)
>   {
> -	struct drm_mock_sched_job *job, *next;
> -	unsigned long flags;
> -	LIST_HEAD(list);
> +	struct drm_mock_sched_job *job;
>   
> -	drm_sched_wqueue_stop(&sched->base);
> -
> -	/* Force complete all unfinished jobs. */
> -	spin_lock_irqsave(&sched->lock, flags);
> -	list_for_each_entry_safe(job, next, &sched->job_list, link)
> -		list_move_tail(&job->link, &list);
> -	spin_unlock_irqrestore(&sched->lock, flags);
> -
> -	list_for_each_entry(job, &list, link)
> +	list_for_each_entry(job, &sched->job_list, link)
>   		hrtimer_cancel(&job->timer);
>   
> -	spin_lock_irqsave(&sched->lock, flags);
> -	list_for_each_entry_safe(job, next, &list, link)
> -		drm_mock_sched_job_complete(job);
> -	spin_unlock_irqrestore(&sched->lock, flags);
> -
> -	/*
> -	 * Free completed jobs and jobs not yet processed by the DRM scheduler
> -	 * free worker.
> -	 */
> -	spin_lock_irqsave(&sched->lock, flags);
> -	list_for_each_entry_safe(job, next, &sched->done_list, link)
> -		list_move_tail(&job->link, &list);
> -	spin_unlock_irqrestore(&sched->lock, flags);
> -
> -	list_for_each_entry_safe(job, next, &list, link)
> -		mock_sched_free_job(&job->base);
> -
>   	drm_sched_fini(&sched->base);
>   }
>   
> @@ -346,7 +327,7 @@ unsigned int drm_mock_sched_advance(struct drm_mock_scheduler *sched,
>   		if (sched->hw_timeline.cur_seqno < job->hw_fence.seqno)
>   			break;
>   
> -		drm_mock_sched_job_complete(job);
> +		drm_mock_sched_job_complete(job, 0);
>   		found++;
>   	}
>   unlock:
> diff --git a/drivers/gpu/drm/scheduler/tests/sched_tests.h b/drivers/gpu/drm/scheduler/tests/sched_tests.h
> index 27caf8285fb7..22e530d87791 100644
> --- a/drivers/gpu/drm/scheduler/tests/sched_tests.h
> +++ b/drivers/gpu/drm/scheduler/tests/sched_tests.h
> @@ -32,9 +32,8 @@
>    *
>    * @base: DRM scheduler base class
>    * @test: Backpointer to owning the kunit test case
> - * @lock: Lock to protect the simulated @hw_timeline, @job_list and @done_list
> + * @lock: Lock to protect the simulated @hw_timeline and @job_list
>    * @job_list: List of jobs submitted to the mock GPU
> - * @done_list: List of jobs completed by the mock GPU
>    * @hw_timeline: Simulated hardware timeline has a @context, @next_seqno and
>    *		 @cur_seqno for implementing a struct dma_fence signaling the
>    *		 simulated job completion.
> @@ -49,7 +48,6 @@ struct drm_mock_scheduler {
>   
>   	spinlock_t		lock;
>   	struct list_head	job_list;
> -	struct list_head	done_list;
>   
>   	struct {
>   		u64		context;


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 2/5] drm/sched/tests: Port tests to new cleanup method
  2025-05-22 14:06   ` Tvrtko Ursulin
@ 2025-05-22 14:59     ` Philipp Stanner
  2025-05-23 15:49       ` Tvrtko Ursulin
  0 siblings, 1 reply; 13+ messages in thread
From: Philipp Stanner @ 2025-05-22 14:59 UTC (permalink / raw)
  To: Tvrtko Ursulin, Philipp Stanner, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Matthew Brost, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann
  Cc: dri-devel, nouveau, linux-kernel

On Thu, 2025-05-22 at 15:06 +0100, Tvrtko Ursulin wrote:
> 
> On 22/05/2025 09:27, Philipp Stanner wrote:
> > The drm_gpu_scheduler now supports a callback to help
> > drm_sched_fini()
> > avoid memory leaks. This callback instructs the driver to signal
> > all
> > pending hardware fences.
> > 
> > Implement the new callback
> > drm_sched_backend_ops.cancel_pending_fences().
> > 
> > Have the callback use drm_mock_sched_job_complete() with a new
> > error
> > field for the fence error.
> > 
> > Keep the job status as DRM_MOCK_SCHED_JOB_DONE for now, since there
> > is
> > no party for which checking for a CANCELED status would be useful
> > currently.
> > 
> > Signed-off-by: Philipp Stanner <phasta@kernel.org>
> > ---
> >   .../gpu/drm/scheduler/tests/mock_scheduler.c  | 67 +++++++-------
> > -----
> >   drivers/gpu/drm/scheduler/tests/sched_tests.h |  4 +-
> >   2 files changed, 25 insertions(+), 46 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > index f999c8859cf7..eca47f0395bc 100644
> > --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > @@ -55,7 +55,7 @@ void drm_mock_sched_entity_free(struct
> > drm_mock_sched_entity *entity)
> >   	drm_sched_entity_destroy(&entity->base);
> >   }
> >   
> > -static void drm_mock_sched_job_complete(struct drm_mock_sched_job
> > *job)
> > +static void drm_mock_sched_job_complete(struct drm_mock_sched_job
> > *job, int err)
> >   {
> >   	struct drm_mock_scheduler *sched =
> >   		drm_sched_to_mock_sched(job->base.sched);
> > @@ -63,7 +63,9 @@ static void drm_mock_sched_job_complete(struct
> > drm_mock_sched_job *job)
> >   	lockdep_assert_held(&sched->lock);
> >   
> >   	job->flags |= DRM_MOCK_SCHED_JOB_DONE;
> > -	list_move_tail(&job->link, &sched->done_list);
> > +	list_del(&job->link);
> > +	if (err)
> > +		dma_fence_set_error(&job->hw_fence, err);
> >   	dma_fence_signal(&job->hw_fence);
> >   	complete(&job->done);
> >   }
> > @@ -89,7 +91,7 @@ drm_mock_sched_job_signal_timer(struct hrtimer
> > *hrtimer)
> >   			break;
> >   
> >   		sched->hw_timeline.cur_seqno = job-
> > >hw_fence.seqno;
> > -		drm_mock_sched_job_complete(job);
> > +		drm_mock_sched_job_complete(job, 0);
> >   	}
> >   	spin_unlock_irqrestore(&sched->lock, flags);
> >   
> > @@ -212,26 +214,33 @@ mock_sched_timedout_job(struct drm_sched_job
> > *sched_job)
> >   
> >   static void mock_sched_free_job(struct drm_sched_job *sched_job)
> >   {
> > -	struct drm_mock_scheduler *sched =
> > -			drm_sched_to_mock_sched(sched_job->sched);
> >   	struct drm_mock_sched_job *job =
> > drm_sched_job_to_mock_job(sched_job);
> > -	unsigned long flags;
> >   
> > -	/* Remove from the scheduler done list. */
> > -	spin_lock_irqsave(&sched->lock, flags);
> > -	list_del(&job->link);
> > -	spin_unlock_irqrestore(&sched->lock, flags);
> >   	dma_fence_put(&job->hw_fence);
> > -
> >   	drm_sched_job_cleanup(sched_job);
> >   
> >   	/* Mock job itself is freed by the kunit framework. */
> >   }
> >   
> > +static void mock_sched_cancel_pending_fences(struct
> > drm_gpu_scheduler *gsched)
> 
> "gsched" feels like a first time invention. Maybe drm_sched?

Alright

> 
> > +{
> > +	struct drm_mock_sched_job *job, *next;
> > +	struct drm_mock_scheduler *sched;
> > +	unsigned long flags;
> > +
> > +	sched = container_of(gsched, struct drm_mock_scheduler,
> > base);
> > +
> > +	spin_lock_irqsave(&sched->lock, flags);
> > +	list_for_each_entry_safe(job, next, &sched->job_list,
> > link)
> > +		drm_mock_sched_job_complete(job, -ECANCELED);
> > +	spin_unlock_irqrestore(&sched->lock, flags);
> 
> Canceling of the timers belongs in this call back I think. Otherwise 
> jobs are not fully cancelled.

I wouldn't say so – the timers represent things like the hardware
interrupts. And those must be deactivated by the driver itself.

See, one big reason why I like my approach is that the contract between
driver and scheduler is made very simple:

"Driver, signal all fences that you ever returned through run_job() to
this scheduler!"

That always works, and the driver always has all those fences. It's
based on the most fundamental agreement regarding dma_fences: they must
all be signaled.

> 
> Hm, I also think, conceptually, the order of first canceling the
> timer 
> and then signaling the fence should be kept.

That's the case here, no?

It must indeed be kept, otherwise the timers could fire after
everything is torn down -> UAF

> 
> At the moment it does not matter hugely, since the timer does not
> signal 
> the jobs directly and will not find unlinked jobs, but if that
> changes 
> in the future, the reversed order could cause double signaling. So if
> you keep it in the correct logical order that potential gotcha is 
> avoided. Basically just keep the two pass approach verbatim, as is in
> the current drm_mock_sched_fini.
> 
> The rest of the conversion is I think good.

:)

> 
> Only a slight uncertainty after I cross-referenced with my version 
> (->cancel_job()) around why I needed to add signaling to 
> mock_sched_timedout_job() and manual job cleanup to the timeout test.
> It 
> was more than a month ago that I wrote it so can't remember right
> now. 
> You checked for memory leaks and the usual stuff?

Hm, it seems I indeed ran into that leak that you fixed (in addition to
the other stuff) in your RFC, for the timeout tests.

We should fix that in a separate patch, probably.


P.

> 
> Regards,
> 
> Tvrtko
> 
> > +}
> > +
> >   static const struct drm_sched_backend_ops drm_mock_scheduler_ops
> > = {
> >   	.run_job = mock_sched_run_job,
> >   	.timedout_job = mock_sched_timedout_job,
> > -	.free_job = mock_sched_free_job
> > +	.free_job = mock_sched_free_job,
> > +	.cancel_pending_fences = mock_sched_cancel_pending_fences,
> >   };
> >   
> >   /**
> > @@ -265,7 +274,6 @@ struct drm_mock_scheduler
> > *drm_mock_sched_new(struct kunit *test, long timeout)
> >   	sched->hw_timeline.context = dma_fence_context_alloc(1);
> >   	atomic_set(&sched->hw_timeline.next_seqno, 0);
> >   	INIT_LIST_HEAD(&sched->job_list);
> > -	INIT_LIST_HEAD(&sched->done_list);
> >   	spin_lock_init(&sched->lock);
> >   
> >   	return sched;
> > @@ -280,38 +288,11 @@ struct drm_mock_scheduler
> > *drm_mock_sched_new(struct kunit *test, long timeout)
> >    */
> >   void drm_mock_sched_fini(struct drm_mock_scheduler *sched)
> >   {
> > -	struct drm_mock_sched_job *job, *next;
> > -	unsigned long flags;
> > -	LIST_HEAD(list);
> > +	struct drm_mock_sched_job *job;
> >   
> > -	drm_sched_wqueue_stop(&sched->base);
> > -
> > -	/* Force complete all unfinished jobs. */
> > -	spin_lock_irqsave(&sched->lock, flags);
> > -	list_for_each_entry_safe(job, next, &sched->job_list,
> > link)
> > -		list_move_tail(&job->link, &list);
> > -	spin_unlock_irqrestore(&sched->lock, flags);
> > -
> > -	list_for_each_entry(job, &list, link)
> > +	list_for_each_entry(job, &sched->job_list, link)
> >   		hrtimer_cancel(&job->timer);
> >   
> > -	spin_lock_irqsave(&sched->lock, flags);
> > -	list_for_each_entry_safe(job, next, &list, link)
> > -		drm_mock_sched_job_complete(job);
> > -	spin_unlock_irqrestore(&sched->lock, flags);
> > -
> > -	/*
> > -	 * Free completed jobs and jobs not yet processed by the
> > DRM scheduler
> > -	 * free worker.
> > -	 */
> > -	spin_lock_irqsave(&sched->lock, flags);
> > -	list_for_each_entry_safe(job, next, &sched->done_list,
> > link)
> > -		list_move_tail(&job->link, &list);
> > -	spin_unlock_irqrestore(&sched->lock, flags);
> > -
> > -	list_for_each_entry_safe(job, next, &list, link)
> > -		mock_sched_free_job(&job->base);
> > -
> >   	drm_sched_fini(&sched->base);
> >   }
> >   
> > @@ -346,7 +327,7 @@ unsigned int drm_mock_sched_advance(struct
> > drm_mock_scheduler *sched,
> >   		if (sched->hw_timeline.cur_seqno < job-
> > >hw_fence.seqno)
> >   			break;
> >   
> > -		drm_mock_sched_job_complete(job);
> > +		drm_mock_sched_job_complete(job, 0);
> >   		found++;
> >   	}
> >   unlock:
> > diff --git a/drivers/gpu/drm/scheduler/tests/sched_tests.h
> > b/drivers/gpu/drm/scheduler/tests/sched_tests.h
> > index 27caf8285fb7..22e530d87791 100644
> > --- a/drivers/gpu/drm/scheduler/tests/sched_tests.h
> > +++ b/drivers/gpu/drm/scheduler/tests/sched_tests.h
> > @@ -32,9 +32,8 @@
> >    *
> >    * @base: DRM scheduler base class
> >    * @test: Backpointer to owning the kunit test case
> > - * @lock: Lock to protect the simulated @hw_timeline, @job_list
> > and @done_list
> > + * @lock: Lock to protect the simulated @hw_timeline and @job_list
> >    * @job_list: List of jobs submitted to the mock GPU
> > - * @done_list: List of jobs completed by the mock GPU
> >    * @hw_timeline: Simulated hardware timeline has a @context,
> > @next_seqno and
> >    *		 @cur_seqno for implementing a struct dma_fence
> > signaling the
> >    *		 simulated job completion.
> > @@ -49,7 +48,6 @@ struct drm_mock_scheduler {
> >   
> >   	spinlock_t		lock;
> >   	struct list_head	job_list;
> > -	struct list_head	done_list;
> >   
> >   	struct {
> >   		u64		context;
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/5] drm/sched: Fix teardown leaks with waitqueue
  2025-05-22 13:37   ` Tvrtko Ursulin
@ 2025-05-22 15:32     ` Philipp Stanner
  2025-05-23 15:35       ` Tvrtko Ursulin
  0 siblings, 1 reply; 13+ messages in thread
From: Philipp Stanner @ 2025-05-22 15:32 UTC (permalink / raw)
  To: Tvrtko Ursulin, Philipp Stanner, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Matthew Brost, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann
  Cc: dri-devel, nouveau, linux-kernel

On Thu, 2025-05-22 at 14:37 +0100, Tvrtko Ursulin wrote:
> 
> On 22/05/2025 09:27, Philipp Stanner wrote:
> > From: Philipp Stanner <pstanner@redhat.com>
> > 
> > The GPU scheduler currently does not ensure that its pending_list
> > is
> > empty before performing various other teardown tasks in
> > drm_sched_fini().
> > 
> > If there are still jobs in the pending_list, this is problematic
> > because
> > after scheduler teardown, no one will call backend_ops.free_job()
> > anymore. This would, consequently, result in memory leaks.
> > 
> > One way to solve this is to implement a waitqueue that
> > drm_sched_fini()
> > blocks on until the pending_list has become empty. That waitqueue
> > must
> > obviously not block for a significant time. Thus, it's necessary to
> > only
> > wait if it's guaranteed that all fences will get signaled quickly.
> > 
> > This can be ensured by having the driver implement a new backend
> > ops,
> > cancel_pending_fences(), in which the driver shall signal all
> > unsignaled, in-flight fences with an error.
> > 
> > Add a waitqueue to struct drm_gpu_scheduler. Wake up waiters once
> > the
> > pending_list becomes empty. Wait in drm_sched_fini() for that to
> > happen
> > if cancel_pending_fences() is implemented.
> > 
> > Signed-off-by: Philipp Stanner <pstanner@redhat.com>
> > ---
> >   drivers/gpu/drm/scheduler/sched_main.c | 105
> > ++++++++++++++++++++-----
> >   include/drm/gpu_scheduler.h            |  19 +++++
> >   2 files changed, 105 insertions(+), 19 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > b/drivers/gpu/drm/scheduler/sched_main.c
> > index f7118497e47a..406572f5168e 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -367,7 +367,7 @@ static void drm_sched_run_job_queue(struct
> > drm_gpu_scheduler *sched)
> >    */
> >   static void __drm_sched_run_free_queue(struct drm_gpu_scheduler
> > *sched)
> >   {
> > -	if (!READ_ONCE(sched->pause_submit))
> > +	if (!READ_ONCE(sched->pause_free))
> >   		queue_work(sched->submit_wq, &sched-
> > >work_free_job);
> >   }
> >   
> > @@ -1121,6 +1121,12 @@ drm_sched_get_finished_job(struct
> > drm_gpu_scheduler *sched)
> >   		/* remove job from pending_list */
> >   		list_del_init(&job->list);
> >   
> > +		/*
> > +		 * Inform tasks blocking in drm_sched_fini() that
> > it's now safe to proceed.
> > +		 */
> > +		if (list_empty(&sched->pending_list))
> > +			wake_up(&sched->pending_list_waitque);
> 
> Wait what? ;) (pun intended)
> 
> I think I mentioned in the last round that waitque looks dodgy.
> Either a 
> typo or a very unusual and novel shorthand? I suggest a typical wq or
> waitqueue.

Ah right, I forgot about that.

> 
> I also mentioned that one more advantage of the ->cancel_job()
> approach 
> is there is no need for these extra calls on the normal path (non 
> teardown) at all.

Yes, agreed, that's a tiny performance gain. But it's running in a
workqueue, so no big deal. But see below, too

> 
> > +
> >   		/* cancel this job's TO timer */
> >   		cancel_delayed_work(&sched->work_tdr);
> >   		/* make the scheduled timestamp more accurate */
> > @@ -1326,6 +1332,7 @@ int drm_sched_init(struct drm_gpu_scheduler
> > *sched, const struct drm_sched_init_
> >   	init_waitqueue_head(&sched->job_scheduled);
> >   	INIT_LIST_HEAD(&sched->pending_list);
> >   	spin_lock_init(&sched->job_list_lock);
> > +	init_waitqueue_head(&sched->pending_list_waitque);
> >   	atomic_set(&sched->credit_count, 0);
> >   	INIT_DELAYED_WORK(&sched->work_tdr,
> > drm_sched_job_timedout);
> >   	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
> > @@ -1333,6 +1340,7 @@ int drm_sched_init(struct drm_gpu_scheduler
> > *sched, const struct drm_sched_init_
> >   	atomic_set(&sched->_score, 0);
> >   	atomic64_set(&sched->job_id_count, 0);
> >   	sched->pause_submit = false;
> > +	sched->pause_free = false;
> >   
> >   	sched->ready = true;
> >   	return 0;
> > @@ -1350,33 +1358,90 @@ int drm_sched_init(struct drm_gpu_scheduler
> > *sched, const struct drm_sched_init_
> >   }
> >   EXPORT_SYMBOL(drm_sched_init);
> >   
> > +/**
> > + * drm_sched_submission_and_timeout_stop - stop everything except
> > for free_job
> > + * @sched: scheduler instance
> > + *
> > + * Helper for tearing down the scheduler in drm_sched_fini().
> > + */
> > +static void
> > +drm_sched_submission_and_timeout_stop(struct drm_gpu_scheduler
> > *sched)
> > +{
> > +	WRITE_ONCE(sched->pause_submit, true);
> > +	cancel_work_sync(&sched->work_run_job);
> > +	cancel_delayed_work_sync(&sched->work_tdr);
> > +}
> > +
> > +/**
> > + * drm_sched_free_stop - stop free_job
> > + * @sched: scheduler instance
> > + *
> > + * Helper for tearing down the scheduler in drm_sched_fini().
> > + */
> > +static void drm_sched_free_stop(struct drm_gpu_scheduler *sched)
> > +{
> > +	WRITE_ONCE(sched->pause_free, true);
> > +	cancel_work_sync(&sched->work_free_job);
> > +}
> > +
> > +/**
> > + * drm_sched_no_jobs_pending - check whether jobs are pending
> > + * @sched: scheduler instance
> > + *
> > + * Checks if jobs are pending for @sched.
> > + *
> > + * Return: true if jobs are pending, false otherwise.
> > + */
> > +static bool drm_sched_no_jobs_pending(struct drm_gpu_scheduler
> > *sched)
> > +{
> > +	bool empty;
> > +
> > +	spin_lock(&sched->job_list_lock);
> > +	empty = list_empty(&sched->pending_list);
> > +	spin_unlock(&sched->job_list_lock);
> > +
> > +	return empty;
> > +}
> > +
> > +/**
> > + * drm_sched_cancel_jobs_and_wait - trigger freeing of all pending
> > jobs
> > + * @sched: scheduler instance
> > + *
> > + * Must only be called if &struct
> > drm_sched_backend_ops.cancel_pending_fences is
> > + * implemented.
> > + *
> > + * Instructs the driver to kill the fence context associated with
> > this scheduler,
> > + * thereby signaling all pending fences. This, in turn, will
> > trigger
> > + * &struct drm_sched_backend_ops.free_job to be called for all
> > pending jobs.
> > + * The function then blocks until all pending jobs have been
> > freed.
> > + */
> > +static void drm_sched_cancel_jobs_and_wait(struct
> > drm_gpu_scheduler *sched)
> > +{
> > +	sched->ops->cancel_pending_fences(sched);
> > +	wait_event(sched->pending_list_waitque,
> > drm_sched_no_jobs_pending(sched));
> > +}
> > +
> >   /**
> >    * drm_sched_fini - Destroy a gpu scheduler
> >    *
> >    * @sched: scheduler instance
> >    *
> > - * Tears down and cleans up the scheduler.
> > - *
> > - * This stops submission of new jobs to the hardware through
> > - * drm_sched_backend_ops.run_job(). Consequently,
> > drm_sched_backend_ops.free_job()
> > - * will not be called for all jobs still in
> > drm_gpu_scheduler.pending_list.
> > - * There is no solution for this currently. Thus, it is up to the
> > driver to make
> > - * sure that:
> > - *
> > - *  a) drm_sched_fini() is only called after for all submitted
> > jobs
> > - *     drm_sched_backend_ops.free_job() has been called or that
> > - *  b) the jobs for which drm_sched_backend_ops.free_job() has not
> > been called
> > - *     after drm_sched_fini() ran are freed manually.
> > - *
> > - * FIXME: Take care of the above problem and prevent this function
> > from leaking
> > - * the jobs in drm_gpu_scheduler.pending_list under any
> > circumstances.
> > + * Tears down and cleans up the scheduler. Might leak memory if
> > + * &struct drm_sched_backend_ops.cancel_pending_fences is not
> > implemented.
> >    */
> >   void drm_sched_fini(struct drm_gpu_scheduler *sched)
> >   {
> >   	struct drm_sched_entity *s_entity;
> >   	int i;
> >   
> > -	drm_sched_wqueue_stop(sched);
> > +	if (sched->ops->cancel_pending_fences) {
> > +		drm_sched_submission_and_timeout_stop(sched);
> > +		drm_sched_cancel_jobs_and_wait(sched);
> > +		drm_sched_free_stop(sched);
> > +	} else {
> > +		/* We're in "legacy free-mode" and ignore
> > potential mem leaks */
> > +		drm_sched_wqueue_stop(sched);
> > +	}
> >   
> >   	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs;
> > i++) {
> >   		struct drm_sched_rq *rq = sched->sched_rq[i];
> > @@ -1464,7 +1529,7 @@ bool drm_sched_wqueue_ready(struct
> > drm_gpu_scheduler *sched)
> >   EXPORT_SYMBOL(drm_sched_wqueue_ready);
> >   
> >   /**
> > - * drm_sched_wqueue_stop - stop scheduler submission
> > + * drm_sched_wqueue_stop - stop scheduler submission and freeing
> >    * @sched: scheduler instance
> >    *
> >    * Stops the scheduler from pulling new jobs from entities. It
> > also stops
> > @@ -1473,13 +1538,14 @@ EXPORT_SYMBOL(drm_sched_wqueue_ready);
> >   void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched)
> >   {
> >   	WRITE_ONCE(sched->pause_submit, true);
> > +	WRITE_ONCE(sched->pause_free, true);
> >   	cancel_work_sync(&sched->work_run_job);
> >   	cancel_work_sync(&sched->work_free_job);
> >   }
> >   EXPORT_SYMBOL(drm_sched_wqueue_stop);
> >   
> >   /**
> > - * drm_sched_wqueue_start - start scheduler submission
> > + * drm_sched_wqueue_start - start scheduler submission and freeing
> >    * @sched: scheduler instance
> >    *
> >    * Restarts the scheduler after drm_sched_wqueue_stop() has
> > stopped it.
> > @@ -1490,6 +1556,7 @@ EXPORT_SYMBOL(drm_sched_wqueue_stop);
> >   void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched)
> >   {
> >   	WRITE_ONCE(sched->pause_submit, false);
> > +	WRITE_ONCE(sched->pause_free, false);
> >   	queue_work(sched->submit_wq, &sched->work_run_job);
> >   	queue_work(sched->submit_wq, &sched->work_free_job);
> >   }
> > diff --git a/include/drm/gpu_scheduler.h
> > b/include/drm/gpu_scheduler.h
> > index d860db087ea5..d8bd5b605336 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -29,6 +29,7 @@
> >   #include <linux/completion.h>
> >   #include <linux/xarray.h>
> >   #include <linux/workqueue.h>
> > +#include <linux/wait.h>
> >   
> >   #define MAX_WAIT_SCHED_ENTITY_Q_EMPTY msecs_to_jiffies(1000)
> >   
> > @@ -508,6 +509,19 @@ struct drm_sched_backend_ops {
> >            * and it's time to clean it up.
> >   	 */
> >   	void (*free_job)(struct drm_sched_job *sched_job);
> > +
> > +	/**
> > +	 * @cancel_pending_fences: cancel all unsignaled hardware
> > fences
> > +	 *
> > +	 * This callback must signal all unsignaled hardware
> > fences associated
> > +	 * with @sched with an appropriate error code (e.g., -
> > ECANCELED). This
> > +	 * ensures that all jobs will get freed by the scheduler
> > before
> > +	 * teardown.
> > +	 *
> > +	 * This callback is optional, but it is highly recommended
> > to implement
> > +	 * it to avoid memory leaks.
> > +	 */
> > +	void (*cancel_pending_fences)(struct drm_gpu_scheduler
> > *sched);
> 
> I still don't understand why insist to use a new term in the backend 
> ops, and even the whole scheduler API. Nothing in the API so far has 
> fences in the name. Something like cancel(_all|pending)_jobs or 
> sched_fini would read more aligned with the rest to me.

Nothing has fences in the name, but they are the central concept of the
API: run_job() returns them, and they are the main mechanism through
which scheduler and driver communicate.

As mentioned in the other mail, the idea behind the callback is to get
all hardware fences signaled. Just that. That's a simple, well
established concept, easy to understand by drivers. In contrast, it
would be far less clear what "cancel" even means. That's evident from
the other patch where we're suddenly wondering whether the driver
should also cancel timers etc. in the callback.

It should not. The contract is simple: "signal everything".

It's also more difficult to abuse, since signaling is always valid. But
when is canceling valid?

Considering how internal APIs have been abused in the past, I can very
well see some party using a cancel_job() callback in the future to
"cancel" single jobs, for example in our (still unsolved)
resubmit_jobs() problem.

> 
> >   };
> >   
> >   /**
> > @@ -533,6 +547,8 @@ struct drm_sched_backend_ops {
> >    *            timeout interval is over.
> >    * @pending_list: the list of jobs which are currently in the job
> > queue.
> >    * @job_list_lock: lock to protect the pending_list.
> > + * @pending_list_waitque: a waitqueue for drm_sched_fini() to
> > block on until all
> > + *		          pending jobs have been finished.
> >    * @hang_limit: once the hangs by a job crosses this limit then
> > it is marked
> >    *              guilty and it will no longer be considered for
> > scheduling.
> >    * @score: score to help loadbalancer pick a idle sched
> > @@ -540,6 +556,7 @@ struct drm_sched_backend_ops {
> >    * @ready: marks if the underlying HW is ready to work
> >    * @free_guilty: A hit to time out handler to free the guilty
> > job.
> >    * @pause_submit: pause queuing of @work_run_job on @submit_wq
> > + * @pause_free: pause queueing of @work_free_job on @submit_wq
> >    * @own_submit_wq: scheduler owns allocation of @submit_wq
> >    * @dev: system &struct device
> >    *
> > @@ -562,12 +579,14 @@ struct drm_gpu_scheduler {
> >   	struct delayed_work		work_tdr;
> >   	struct list_head		pending_list;
> >   	spinlock_t			job_list_lock;
> > +	wait_queue_head_t		pending_list_waitque;
> >   	int				hang_limit;
> >   	atomic_t                        *score;
> >   	atomic_t                        _score;
> >   	bool				ready;
> >   	bool				free_guilty;
> >   	bool				pause_submit;
> > +	bool				pause_free;
> >   	bool				own_submit_wq;
> >   	struct device			*dev;
> >   };
> 
> And, as you know, another thing I don't understand is why would we 
> choose to add more of the state machine when I have shown how it can
> be 
> done more elegantly. You don't have to reply, this is more a for the 
> record against v3.

I do like your approach to a degree, and I reimplemented and tested it
during the last days! Don't think I just easily toss aside a good idea;
in fact, weighing between both approaches did cause me some headache :)

The thing is that while implementing it for the unit tests (not even to
begin with Nouveau, where the implementation is a bit more difficult
because some helpers need to be moved), I ran into a ton of faults
because of how the tests are constructed. When do I have to cancel
which timer for which job, all before calling drm_sched_fini(), or each
one separately in cancel_job()? What about timedout jobs?

This [1] for exmaple is an implementation attempt I made which differs
only slightly from yours but does not work and causes all sorts of
issues with timer interrupts.

Now, that could obviously be fixed, and maybe I fail – but the thing
is, if I can fail, others can, too. And porting the unit tests is a
good beta-test.

When you look at the waitqueue solution, it is easy to implement for
both Nouveau and sched/tests, with minimal adjustments. My approach
works for both and is already well tested in Nouveau.

The ease with which it can be used and the simplicity of the contract
("signal all fences!") gives me confidence that this is, despite the
larger state machine, more maintainable and that it's easier to port
other drivers to the new cleanup method. I've had some success with
similar approaches when cleaning up 16 year old broken PCI code [2].

Moreover, our long term goal should anyways be to, ideally, port all
drivers to the new cleanup design and then make the callback mandatory.
Then some places in drm_sched_fini() can be cleaned up, too.

Besides, the above reasons like the resistence against abuse of a
cancel_job() also apply.


Does that sound convincing? :)
P.


[1] https://paste.debian.net/1376021/
[2] https://lore.kernel.org/linux-pci/20250520085938.GB261485@rocinante/T/#m6e28eabdb22286238545c1fa6026445a4001d8e2



> 
> Regards,
> 
> Tvrtko
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/5] drm/sched: Fix teardown leaks with waitqueue
  2025-05-22 15:32     ` Philipp Stanner
@ 2025-05-23 15:35       ` Tvrtko Ursulin
  0 siblings, 0 replies; 13+ messages in thread
From: Tvrtko Ursulin @ 2025-05-23 15:35 UTC (permalink / raw)
  To: Philipp Stanner, Philipp Stanner, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Matthew Brost, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann
  Cc: dri-devel, nouveau, linux-kernel


On 22/05/2025 16:32, Philipp Stanner wrote:
> On Thu, 2025-05-22 at 14:37 +0100, Tvrtko Ursulin wrote:
>>
>> On 22/05/2025 09:27, Philipp Stanner wrote:
>>> From: Philipp Stanner <pstanner@redhat.com>
>>>
>>> The GPU scheduler currently does not ensure that its pending_list
>>> is
>>> empty before performing various other teardown tasks in
>>> drm_sched_fini().
>>>
>>> If there are still jobs in the pending_list, this is problematic
>>> because
>>> after scheduler teardown, no one will call backend_ops.free_job()
>>> anymore. This would, consequently, result in memory leaks.
>>>
>>> One way to solve this is to implement a waitqueue that
>>> drm_sched_fini()
>>> blocks on until the pending_list has become empty. That waitqueue
>>> must
>>> obviously not block for a significant time. Thus, it's necessary to
>>> only
>>> wait if it's guaranteed that all fences will get signaled quickly.
>>>
>>> This can be ensured by having the driver implement a new backend
>>> ops,
>>> cancel_pending_fences(), in which the driver shall signal all
>>> unsignaled, in-flight fences with an error.
>>>
>>> Add a waitqueue to struct drm_gpu_scheduler. Wake up waiters once
>>> the
>>> pending_list becomes empty. Wait in drm_sched_fini() for that to
>>> happen
>>> if cancel_pending_fences() is implemented.
>>>
>>> Signed-off-by: Philipp Stanner <pstanner@redhat.com>
>>> ---
>>>    drivers/gpu/drm/scheduler/sched_main.c | 105
>>> ++++++++++++++++++++-----
>>>    include/drm/gpu_scheduler.h            |  19 +++++
>>>    2 files changed, 105 insertions(+), 19 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index f7118497e47a..406572f5168e 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -367,7 +367,7 @@ static void drm_sched_run_job_queue(struct
>>> drm_gpu_scheduler *sched)
>>>     */
>>>    static void __drm_sched_run_free_queue(struct drm_gpu_scheduler
>>> *sched)
>>>    {
>>> -	if (!READ_ONCE(sched->pause_submit))
>>> +	if (!READ_ONCE(sched->pause_free))
>>>    		queue_work(sched->submit_wq, &sched-
>>>> work_free_job);
>>>    }
>>>    
>>> @@ -1121,6 +1121,12 @@ drm_sched_get_finished_job(struct
>>> drm_gpu_scheduler *sched)
>>>    		/* remove job from pending_list */
>>>    		list_del_init(&job->list);
>>>    
>>> +		/*
>>> +		 * Inform tasks blocking in drm_sched_fini() that
>>> it's now safe to proceed.
>>> +		 */
>>> +		if (list_empty(&sched->pending_list))
>>> +			wake_up(&sched->pending_list_waitque);
>>
>> Wait what? ;) (pun intended)
>>
>> I think I mentioned in the last round that waitque looks dodgy.
>> Either a
>> typo or a very unusual and novel shorthand? I suggest a typical wq or
>> waitqueue.
> 
> Ah right, I forgot about that.
> 
>>
>> I also mentioned that one more advantage of the ->cancel_job()
>> approach
>> is there is no need for these extra calls on the normal path (non
>> teardown) at all.
> 
> Yes, agreed, that's a tiny performance gain. But it's running in a
> workqueue, so no big deal. But see below, too
> 
>>
>>> +
>>>    		/* cancel this job's TO timer */
>>>    		cancel_delayed_work(&sched->work_tdr);
>>>    		/* make the scheduled timestamp more accurate */
>>> @@ -1326,6 +1332,7 @@ int drm_sched_init(struct drm_gpu_scheduler
>>> *sched, const struct drm_sched_init_
>>>    	init_waitqueue_head(&sched->job_scheduled);
>>>    	INIT_LIST_HEAD(&sched->pending_list);
>>>    	spin_lock_init(&sched->job_list_lock);
>>> +	init_waitqueue_head(&sched->pending_list_waitque);
>>>    	atomic_set(&sched->credit_count, 0);
>>>    	INIT_DELAYED_WORK(&sched->work_tdr,
>>> drm_sched_job_timedout);
>>>    	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
>>> @@ -1333,6 +1340,7 @@ int drm_sched_init(struct drm_gpu_scheduler
>>> *sched, const struct drm_sched_init_
>>>    	atomic_set(&sched->_score, 0);
>>>    	atomic64_set(&sched->job_id_count, 0);
>>>    	sched->pause_submit = false;
>>> +	sched->pause_free = false;
>>>    
>>>    	sched->ready = true;
>>>    	return 0;
>>> @@ -1350,33 +1358,90 @@ int drm_sched_init(struct drm_gpu_scheduler
>>> *sched, const struct drm_sched_init_
>>>    }
>>>    EXPORT_SYMBOL(drm_sched_init);
>>>    
>>> +/**
>>> + * drm_sched_submission_and_timeout_stop - stop everything except
>>> for free_job
>>> + * @sched: scheduler instance
>>> + *
>>> + * Helper for tearing down the scheduler in drm_sched_fini().
>>> + */
>>> +static void
>>> +drm_sched_submission_and_timeout_stop(struct drm_gpu_scheduler
>>> *sched)
>>> +{
>>> +	WRITE_ONCE(sched->pause_submit, true);
>>> +	cancel_work_sync(&sched->work_run_job);
>>> +	cancel_delayed_work_sync(&sched->work_tdr);
>>> +}
>>> +
>>> +/**
>>> + * drm_sched_free_stop - stop free_job
>>> + * @sched: scheduler instance
>>> + *
>>> + * Helper for tearing down the scheduler in drm_sched_fini().
>>> + */
>>> +static void drm_sched_free_stop(struct drm_gpu_scheduler *sched)
>>> +{
>>> +	WRITE_ONCE(sched->pause_free, true);
>>> +	cancel_work_sync(&sched->work_free_job);
>>> +}
>>> +
>>> +/**
>>> + * drm_sched_no_jobs_pending - check whether jobs are pending
>>> + * @sched: scheduler instance
>>> + *
>>> + * Checks if jobs are pending for @sched.
>>> + *
>>> + * Return: true if jobs are pending, false otherwise.
>>> + */
>>> +static bool drm_sched_no_jobs_pending(struct drm_gpu_scheduler
>>> *sched)
>>> +{
>>> +	bool empty;
>>> +
>>> +	spin_lock(&sched->job_list_lock);
>>> +	empty = list_empty(&sched->pending_list);
>>> +	spin_unlock(&sched->job_list_lock);
>>> +
>>> +	return empty;
>>> +}
>>> +
>>> +/**
>>> + * drm_sched_cancel_jobs_and_wait - trigger freeing of all pending
>>> jobs
>>> + * @sched: scheduler instance
>>> + *
>>> + * Must only be called if &struct
>>> drm_sched_backend_ops.cancel_pending_fences is
>>> + * implemented.
>>> + *
>>> + * Instructs the driver to kill the fence context associated with
>>> this scheduler,
>>> + * thereby signaling all pending fences. This, in turn, will
>>> trigger
>>> + * &struct drm_sched_backend_ops.free_job to be called for all
>>> pending jobs.
>>> + * The function then blocks until all pending jobs have been
>>> freed.
>>> + */
>>> +static void drm_sched_cancel_jobs_and_wait(struct
>>> drm_gpu_scheduler *sched)
>>> +{
>>> +	sched->ops->cancel_pending_fences(sched);
>>> +	wait_event(sched->pending_list_waitque,
>>> drm_sched_no_jobs_pending(sched));
>>> +}
>>> +
>>>    /**
>>>     * drm_sched_fini - Destroy a gpu scheduler
>>>     *
>>>     * @sched: scheduler instance
>>>     *
>>> - * Tears down and cleans up the scheduler.
>>> - *
>>> - * This stops submission of new jobs to the hardware through
>>> - * drm_sched_backend_ops.run_job(). Consequently,
>>> drm_sched_backend_ops.free_job()
>>> - * will not be called for all jobs still in
>>> drm_gpu_scheduler.pending_list.
>>> - * There is no solution for this currently. Thus, it is up to the
>>> driver to make
>>> - * sure that:
>>> - *
>>> - *  a) drm_sched_fini() is only called after for all submitted
>>> jobs
>>> - *     drm_sched_backend_ops.free_job() has been called or that
>>> - *  b) the jobs for which drm_sched_backend_ops.free_job() has not
>>> been called
>>> - *     after drm_sched_fini() ran are freed manually.
>>> - *
>>> - * FIXME: Take care of the above problem and prevent this function
>>> from leaking
>>> - * the jobs in drm_gpu_scheduler.pending_list under any
>>> circumstances.
>>> + * Tears down and cleans up the scheduler. Might leak memory if
>>> + * &struct drm_sched_backend_ops.cancel_pending_fences is not
>>> implemented.
>>>     */
>>>    void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>    {
>>>    	struct drm_sched_entity *s_entity;
>>>    	int i;
>>>    
>>> -	drm_sched_wqueue_stop(sched);
>>> +	if (sched->ops->cancel_pending_fences) {
>>> +		drm_sched_submission_and_timeout_stop(sched);
>>> +		drm_sched_cancel_jobs_and_wait(sched);
>>> +		drm_sched_free_stop(sched);
>>> +	} else {
>>> +		/* We're in "legacy free-mode" and ignore
>>> potential mem leaks */
>>> +		drm_sched_wqueue_stop(sched);
>>> +	}
>>>    
>>>    	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs;
>>> i++) {
>>>    		struct drm_sched_rq *rq = sched->sched_rq[i];
>>> @@ -1464,7 +1529,7 @@ bool drm_sched_wqueue_ready(struct
>>> drm_gpu_scheduler *sched)
>>>    EXPORT_SYMBOL(drm_sched_wqueue_ready);
>>>    
>>>    /**
>>> - * drm_sched_wqueue_stop - stop scheduler submission
>>> + * drm_sched_wqueue_stop - stop scheduler submission and freeing
>>>     * @sched: scheduler instance
>>>     *
>>>     * Stops the scheduler from pulling new jobs from entities. It
>>> also stops
>>> @@ -1473,13 +1538,14 @@ EXPORT_SYMBOL(drm_sched_wqueue_ready);
>>>    void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched)
>>>    {
>>>    	WRITE_ONCE(sched->pause_submit, true);
>>> +	WRITE_ONCE(sched->pause_free, true);
>>>    	cancel_work_sync(&sched->work_run_job);
>>>    	cancel_work_sync(&sched->work_free_job);
>>>    }
>>>    EXPORT_SYMBOL(drm_sched_wqueue_stop);
>>>    
>>>    /**
>>> - * drm_sched_wqueue_start - start scheduler submission
>>> + * drm_sched_wqueue_start - start scheduler submission and freeing
>>>     * @sched: scheduler instance
>>>     *
>>>     * Restarts the scheduler after drm_sched_wqueue_stop() has
>>> stopped it.
>>> @@ -1490,6 +1556,7 @@ EXPORT_SYMBOL(drm_sched_wqueue_stop);
>>>    void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched)
>>>    {
>>>    	WRITE_ONCE(sched->pause_submit, false);
>>> +	WRITE_ONCE(sched->pause_free, false);
>>>    	queue_work(sched->submit_wq, &sched->work_run_job);
>>>    	queue_work(sched->submit_wq, &sched->work_free_job);
>>>    }
>>> diff --git a/include/drm/gpu_scheduler.h
>>> b/include/drm/gpu_scheduler.h
>>> index d860db087ea5..d8bd5b605336 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -29,6 +29,7 @@
>>>    #include <linux/completion.h>
>>>    #include <linux/xarray.h>
>>>    #include <linux/workqueue.h>
>>> +#include <linux/wait.h>
>>>    
>>>    #define MAX_WAIT_SCHED_ENTITY_Q_EMPTY msecs_to_jiffies(1000)
>>>    
>>> @@ -508,6 +509,19 @@ struct drm_sched_backend_ops {
>>>             * and it's time to clean it up.
>>>    	 */
>>>    	void (*free_job)(struct drm_sched_job *sched_job);
>>> +
>>> +	/**
>>> +	 * @cancel_pending_fences: cancel all unsignaled hardware
>>> fences
>>> +	 *
>>> +	 * This callback must signal all unsignaled hardware
>>> fences associated
>>> +	 * with @sched with an appropriate error code (e.g., -
>>> ECANCELED). This
>>> +	 * ensures that all jobs will get freed by the scheduler
>>> before
>>> +	 * teardown.
>>> +	 *
>>> +	 * This callback is optional, but it is highly recommended
>>> to implement
>>> +	 * it to avoid memory leaks.
>>> +	 */
>>> +	void (*cancel_pending_fences)(struct drm_gpu_scheduler
>>> *sched);
>>
>> I still don't understand why insist to use a new term in the backend
>> ops, and even the whole scheduler API. Nothing in the API so far has
>> fences in the name. Something like cancel(_all|pending)_jobs or
>> sched_fini would read more aligned with the rest to me.
> 
> Nothing has fences in the name, but they are the central concept of the
> API: run_job() returns them, and they are the main mechanism through
> which scheduler and driver communicate.
> 
> As mentioned in the other mail, the idea behind the callback is to get
> all hardware fences signaled. Just that. That's a simple, well
> established concept, easy to understand by drivers. In contrast, it
> would be far less clear what "cancel" even means. That's evident from
> the other patch where we're suddenly wondering whether the driver
> should also cancel timers etc. in the callback.

That mock scheduler uses the hrtimers to simulate a GPU and so that uses 
hrtimer_cancel() to cancel a job is irrelevant for the discussion about 
the name cancel in the DRM scheduler API. That is just implementation 
detail. It could have been using a thread or a workqueue, or whatever.

Besides, I don't see how you can argue against the name cancel if I 
propose cancel_(all|pending_)jobs() and you are proposing 
cancel_pending_fences().

> It should not. The contract is simple: "signal everything".
> 
> It's also more difficult to abuse, since signaling is always valid. But
> when is canceling valid?
> 
> Considering how internal APIs have been abused in the past, I can very
> well see some party using a cancel_job() callback in the future to
> "cancel" single jobs, for example in our (still unsolved)
> resubmit_jobs() problem.

I don't know but this feels a bit of a stretch. It would not be 
scheduler internal API, but a vfunc driver provided to the scheduler. If 
the driver then decides to call it outside the scheduler paths it can do 
that today with other vfuncs too. Or worse.

And it would be well defined in the API contract as something scheduler 
calls in order on scheduler shutdown.
  >>>    };
>>>    
>>>    /**
>>> @@ -533,6 +547,8 @@ struct drm_sched_backend_ops {
>>>     *            timeout interval is over.
>>>     * @pending_list: the list of jobs which are currently in the job
>>> queue.
>>>     * @job_list_lock: lock to protect the pending_list.
>>> + * @pending_list_waitque: a waitqueue for drm_sched_fini() to
>>> block on until all
>>> + *		          pending jobs have been finished.
>>>     * @hang_limit: once the hangs by a job crosses this limit then
>>> it is marked
>>>     *              guilty and it will no longer be considered for
>>> scheduling.
>>>     * @score: score to help loadbalancer pick a idle sched
>>> @@ -540,6 +556,7 @@ struct drm_sched_backend_ops {
>>>     * @ready: marks if the underlying HW is ready to work
>>>     * @free_guilty: A hit to time out handler to free the guilty
>>> job.
>>>     * @pause_submit: pause queuing of @work_run_job on @submit_wq
>>> + * @pause_free: pause queueing of @work_free_job on @submit_wq
>>>     * @own_submit_wq: scheduler owns allocation of @submit_wq
>>>     * @dev: system &struct device
>>>     *
>>> @@ -562,12 +579,14 @@ struct drm_gpu_scheduler {
>>>    	struct delayed_work		work_tdr;
>>>    	struct list_head		pending_list;
>>>    	spinlock_t			job_list_lock;
>>> +	wait_queue_head_t		pending_list_waitque;
>>>    	int				hang_limit;
>>>    	atomic_t                        *score;
>>>    	atomic_t                        _score;
>>>    	bool				ready;
>>>    	bool				free_guilty;
>>>    	bool				pause_submit;
>>> +	bool				pause_free;
>>>    	bool				own_submit_wq;
>>>    	struct device			*dev;
>>>    };
>>
>> And, as you know, another thing I don't understand is why would we
>> choose to add more of the state machine when I have shown how it can
>> be
>> done more elegantly. You don't have to reply, this is more a for the
>> record against v3.
> 
> I do like your approach to a degree, and I reimplemented and tested it
> during the last days! Don't think I just easily toss aside a good idea;
> in fact, weighing between both approaches did cause me some headache :)
> 
> The thing is that while implementing it for the unit tests (not even to
> begin with Nouveau, where the implementation is a bit more difficult
> because some helpers need to be moved), I ran into a ton of faults
> because of how the tests are constructed. When do I have to cancel
> which timer for which job, all before calling drm_sched_fini(), or each
> one separately in cancel_job()? What about timedout jobs?
> 
> This [1] for exmaple is an implementation attempt I made which differs
> only slightly from yours but does not work and causes all sorts of
> issues with timer interrupts.
> 
> Now, that could obviously be fixed, and maybe I fail – but the thing
> is, if I can fail, others can, too. And porting the unit tests is a
> good beta-test.

Well you did not come up with the whole series also in a day, took quite 
some time, so the fact converting unit tests also took some time is not 
a red flag for any approach.

> When you look at the waitqueue solution, it is easy to implement for
> both Nouveau and sched/tests, with minimal adjustments. My approach
> works for both and is already well tested in Nouveau.
> 
> The ease with which it can be used and the simplicity of the contract
> ("signal all fences!") gives me confidence that this is, despite the
> larger state machine, more maintainable and that it's easier to port
> other drivers to the new cleanup method. I've had some success with
> similar approaches when cleaning up 16 year old broken PCI code [2].
> 
> Moreover, our long term goal should anyways be to, ideally, port all
> drivers to the new cleanup design and then make the callback mandatory.
> Then some places in drm_sched_fini() can be cleaned up, too.
> 
> Besides, the above reasons like the resistence against abuse of a
> cancel_job() also apply.
> 
> 
> Does that sound convincing? :)

Not yet I'm afraid.

If I look at the nouveau implementation I see eventually this:

nouveau_fence_context_kill(struct nouveau_fence_chan *fctx, int error)
{
	struct nouveau_fence *fence, *tmp;
	unsigned long flags;

	spin_lock_irqsave(&fctx->lock, flags);
	list_for_each_entry_safe(fence, tmp, &fctx->pending, head) {
		if (error && !dma_fence_is_signaled_locked(&fence->base))
			dma_fence_set_error(&fence->base, error);

		if (nouveau_fence_signal(fence))
			nvif_event_block(&fctx->event);
	}
	fctx->killed = 1;
	spin_unlock_irqrestore(&fctx->lock, flags);
}


While my proposal was to have this in the scheduler core:

  void drm_sched_fini(struct drm_gpu_scheduler *sched)
  {
@@ -1397,6 +1395,15 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
         /* Confirm no work left behind accessing device structures */
         cancel_delayed_work_sync(&sched->work_tdr);

+       if (sched->ops->cancel_job) {
+               struct drm_sched_job *job;
+
+               list_for_each_entry_reverse(job, &sched->pending_list, 
list) {
+                       sched->ops->cancel_job(job);
+                       sched->ops->free_job(job);
+               }
+       }

Really similar loops. Would it be good match to make 
nouveau_cancel_job() simply like:

	spin_lock_irqsave(&fctx->lock, flags);
	if (!dma_fence_is_signaled_locked(&fence->base))
		dma_fence_set_error(&fence->base, ECANCELED);

	if (nouveau_fence_signal(fence))
		nvif_event_block(&fctx->event);

	spin_unlock_irqrestore(&fctx->lock, flags);

Assuming you can get to fctx from the fence. That doesn't work?

Or those chan->killed and fctx->killed are critical and have to happen 
before/after canceling jobs?

Regards,

Tvrtko

> P.
> 
> 
> [1] https://paste.debian.net/1376021/
> [2] https://lore.kernel.org/linux-pci/20250520085938.GB261485@rocinante/T/#m6e28eabdb22286238545c1fa6026445a4001d8e2
> 
> 
> 
>>
>> Regards,
>>
>> Tvrtko
>>
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 2/5] drm/sched/tests: Port tests to new cleanup method
  2025-05-22 14:59     ` Philipp Stanner
@ 2025-05-23 15:49       ` Tvrtko Ursulin
  0 siblings, 0 replies; 13+ messages in thread
From: Tvrtko Ursulin @ 2025-05-23 15:49 UTC (permalink / raw)
  To: phasta, Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Christian König, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann
  Cc: dri-devel, nouveau, linux-kernel


On 22/05/2025 15:59, Philipp Stanner wrote:
> On Thu, 2025-05-22 at 15:06 +0100, Tvrtko Ursulin wrote:
>>
>> On 22/05/2025 09:27, Philipp Stanner wrote:
>>> The drm_gpu_scheduler now supports a callback to help
>>> drm_sched_fini()
>>> avoid memory leaks. This callback instructs the driver to signal
>>> all
>>> pending hardware fences.
>>>
>>> Implement the new callback
>>> drm_sched_backend_ops.cancel_pending_fences().
>>>
>>> Have the callback use drm_mock_sched_job_complete() with a new
>>> error
>>> field for the fence error.
>>>
>>> Keep the job status as DRM_MOCK_SCHED_JOB_DONE for now, since there
>>> is
>>> no party for which checking for a CANCELED status would be useful
>>> currently.
>>>
>>> Signed-off-by: Philipp Stanner <phasta@kernel.org>
>>> ---
>>>    .../gpu/drm/scheduler/tests/mock_scheduler.c  | 67 +++++++-------
>>> -----
>>>    drivers/gpu/drm/scheduler/tests/sched_tests.h |  4 +-
>>>    2 files changed, 25 insertions(+), 46 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>>> b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>>> index f999c8859cf7..eca47f0395bc 100644
>>> --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>>> +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>>> @@ -55,7 +55,7 @@ void drm_mock_sched_entity_free(struct
>>> drm_mock_sched_entity *entity)
>>>    	drm_sched_entity_destroy(&entity->base);
>>>    }
>>>    
>>> -static void drm_mock_sched_job_complete(struct drm_mock_sched_job
>>> *job)
>>> +static void drm_mock_sched_job_complete(struct drm_mock_sched_job
>>> *job, int err)
>>>    {
>>>    	struct drm_mock_scheduler *sched =
>>>    		drm_sched_to_mock_sched(job->base.sched);
>>> @@ -63,7 +63,9 @@ static void drm_mock_sched_job_complete(struct
>>> drm_mock_sched_job *job)
>>>    	lockdep_assert_held(&sched->lock);
>>>    
>>>    	job->flags |= DRM_MOCK_SCHED_JOB_DONE;
>>> -	list_move_tail(&job->link, &sched->done_list);
>>> +	list_del(&job->link);
>>> +	if (err)
>>> +		dma_fence_set_error(&job->hw_fence, err);
>>>    	dma_fence_signal(&job->hw_fence);
>>>    	complete(&job->done);
>>>    }
>>> @@ -89,7 +91,7 @@ drm_mock_sched_job_signal_timer(struct hrtimer
>>> *hrtimer)
>>>    			break;
>>>    
>>>    		sched->hw_timeline.cur_seqno = job-
>>>> hw_fence.seqno;
>>> -		drm_mock_sched_job_complete(job);
>>> +		drm_mock_sched_job_complete(job, 0);
>>>    	}
>>>    	spin_unlock_irqrestore(&sched->lock, flags);
>>>    
>>> @@ -212,26 +214,33 @@ mock_sched_timedout_job(struct drm_sched_job
>>> *sched_job)
>>>    
>>>    static void mock_sched_free_job(struct drm_sched_job *sched_job)
>>>    {
>>> -	struct drm_mock_scheduler *sched =
>>> -			drm_sched_to_mock_sched(sched_job->sched);
>>>    	struct drm_mock_sched_job *job =
>>> drm_sched_job_to_mock_job(sched_job);
>>> -	unsigned long flags;
>>>    
>>> -	/* Remove from the scheduler done list. */
>>> -	spin_lock_irqsave(&sched->lock, flags);
>>> -	list_del(&job->link);
>>> -	spin_unlock_irqrestore(&sched->lock, flags);
>>>    	dma_fence_put(&job->hw_fence);
>>> -
>>>    	drm_sched_job_cleanup(sched_job);
>>>    
>>>    	/* Mock job itself is freed by the kunit framework. */
>>>    }
>>>    
>>> +static void mock_sched_cancel_pending_fences(struct
>>> drm_gpu_scheduler *gsched)
>>
>> "gsched" feels like a first time invention. Maybe drm_sched?
> 
> Alright
> 
>>
>>> +{
>>> +	struct drm_mock_sched_job *job, *next;
>>> +	struct drm_mock_scheduler *sched;
>>> +	unsigned long flags;
>>> +
>>> +	sched = container_of(gsched, struct drm_mock_scheduler,
>>> base);
>>> +
>>> +	spin_lock_irqsave(&sched->lock, flags);
>>> +	list_for_each_entry_safe(job, next, &sched->job_list,
>>> link)
>>> +		drm_mock_sched_job_complete(job, -ECANCELED);
>>> +	spin_unlock_irqrestore(&sched->lock, flags);
>>
>> Canceling of the timers belongs in this call back I think. Otherwise
>> jobs are not fully cancelled.
> 
> I wouldn't say so – the timers represent things like the hardware
> interrupts. And those must be deactivated by the driver itself.
> 
> See, one big reason why I like my approach is that the contract between
> driver and scheduler is made very simple:
> 
> "Driver, signal all fences that you ever returned through run_job() to
> this scheduler!"
> 
> That always works, and the driver always has all those fences. It's
> based on the most fundamental agreement regarding dma_fences: they must
> all be signaled.
> 
>>
>> Hm, I also think, conceptually, the order of first canceling the
>> timer
>> and then signaling the fence should be kept.
> 
> That's the case here, no?

Sorry I got confused, I was context switching over many different topics 
this week.

Okay yes, the flow with your patch is correct on the high level and is 
like this:

  list_for_each_entry(job, &sched->job_list, link)
  	hrtimer_cancel(&job->timer);

  drm_sched_fini(&sched->base);
	drm_sched_wqueue_stop()
  	-> mock_sched_cancel_pending_fences()
  		-> drm_mock_sched_job_complete()..

But one problem is that you have no locking around the list walk and 
scheduler submit wq has not been stopped yet. So AFAICT there is risk of 
list corruption since mock_sched_run_job() can be called in parallel to 
unit test exiting. It can also arm a new timer which will then not be 
canceled and will be UAF when it fires.

> It must indeed be kept, otherwise the timers could fire after
> everything is torn down -> UAF
> 
>>
>> At the moment it does not matter hugely, since the timer does not
>> signal
>> the jobs directly and will not find unlinked jobs, but if that
>> changes
>> in the future, the reversed order could cause double signaling. So if
>> you keep it in the correct logical order that potential gotcha is
>> avoided. Basically just keep the two pass approach verbatim, as is in
>> the current drm_mock_sched_fini.
>>
>> The rest of the conversion is I think good.
> 
> :)
> 
>>
>> Only a slight uncertainty after I cross-referenced with my version
>> (->cancel_job()) around why I needed to add signaling to
>> mock_sched_timedout_job() and manual job cleanup to the timeout test.
>> It
>> was more than a month ago that I wrote it so can't remember right
>> now.
>> You checked for memory leaks and the usual stuff?
> 
> Hm, it seems I indeed ran into that leak that you fixed (in addition to
> the other stuff) in your RFC, for the timeout tests.
> 
> We should fix that in a separate patch, probably.

Hm I am pretty sure that was only needed after I added ->cancel_job() so 
I am not sure there is something to fix as standalone. I would have to 
dive back in but that will have to wait a week or so.

Regards,

Tvrtko

>>> +}
>>> +
>>>    static const struct drm_sched_backend_ops drm_mock_scheduler_ops
>>> = {
>>>    	.run_job = mock_sched_run_job,
>>>    	.timedout_job = mock_sched_timedout_job,
>>> -	.free_job = mock_sched_free_job
>>> +	.free_job = mock_sched_free_job,
>>> +	.cancel_pending_fences = mock_sched_cancel_pending_fences,
>>>    };
>>>    
>>>    /**
>>> @@ -265,7 +274,6 @@ struct drm_mock_scheduler
>>> *drm_mock_sched_new(struct kunit *test, long timeout)
>>>    	sched->hw_timeline.context = dma_fence_context_alloc(1);
>>>    	atomic_set(&sched->hw_timeline.next_seqno, 0);
>>>    	INIT_LIST_HEAD(&sched->job_list);
>>> -	INIT_LIST_HEAD(&sched->done_list);
>>>    	spin_lock_init(&sched->lock);
>>>    
>>>    	return sched;
>>> @@ -280,38 +288,11 @@ struct drm_mock_scheduler
>>> *drm_mock_sched_new(struct kunit *test, long timeout)
>>>     */
>>>    void drm_mock_sched_fini(struct drm_mock_scheduler *sched)
>>>    {
>>> -	struct drm_mock_sched_job *job, *next;
>>> -	unsigned long flags;
>>> -	LIST_HEAD(list);
>>> +	struct drm_mock_sched_job *job;
>>>    
>>> -	drm_sched_wqueue_stop(&sched->base);
>>> -
>>> -	/* Force complete all unfinished jobs. */
>>> -	spin_lock_irqsave(&sched->lock, flags);
>>> -	list_for_each_entry_safe(job, next, &sched->job_list,
>>> link)
>>> -		list_move_tail(&job->link, &list);
>>> -	spin_unlock_irqrestore(&sched->lock, flags);
>>> -
>>> -	list_for_each_entry(job, &list, link)
>>> +	list_for_each_entry(job, &sched->job_list, link)
>>>    		hrtimer_cancel(&job->timer);
>>>    
>>> -	spin_lock_irqsave(&sched->lock, flags);
>>> -	list_for_each_entry_safe(job, next, &list, link)
>>> -		drm_mock_sched_job_complete(job);
>>> -	spin_unlock_irqrestore(&sched->lock, flags);
>>> -
>>> -	/*
>>> -	 * Free completed jobs and jobs not yet processed by the
>>> DRM scheduler
>>> -	 * free worker.
>>> -	 */
>>> -	spin_lock_irqsave(&sched->lock, flags);
>>> -	list_for_each_entry_safe(job, next, &sched->done_list,
>>> link)
>>> -		list_move_tail(&job->link, &list);
>>> -	spin_unlock_irqrestore(&sched->lock, flags);
>>> -
>>> -	list_for_each_entry_safe(job, next, &list, link)
>>> -		mock_sched_free_job(&job->base);
>>> -
>>>    	drm_sched_fini(&sched->base);
>>>    }
>>>    
>>> @@ -346,7 +327,7 @@ unsigned int drm_mock_sched_advance(struct
>>> drm_mock_scheduler *sched,
>>>    		if (sched->hw_timeline.cur_seqno < job-
>>>> hw_fence.seqno)
>>>    			break;
>>>    
>>> -		drm_mock_sched_job_complete(job);
>>> +		drm_mock_sched_job_complete(job, 0);
>>>    		found++;
>>>    	}
>>>    unlock:
>>> diff --git a/drivers/gpu/drm/scheduler/tests/sched_tests.h
>>> b/drivers/gpu/drm/scheduler/tests/sched_tests.h
>>> index 27caf8285fb7..22e530d87791 100644
>>> --- a/drivers/gpu/drm/scheduler/tests/sched_tests.h
>>> +++ b/drivers/gpu/drm/scheduler/tests/sched_tests.h
>>> @@ -32,9 +32,8 @@
>>>     *
>>>     * @base: DRM scheduler base class
>>>     * @test: Backpointer to owning the kunit test case
>>> - * @lock: Lock to protect the simulated @hw_timeline, @job_list
>>> and @done_list
>>> + * @lock: Lock to protect the simulated @hw_timeline and @job_list
>>>     * @job_list: List of jobs submitted to the mock GPU
>>> - * @done_list: List of jobs completed by the mock GPU
>>>     * @hw_timeline: Simulated hardware timeline has a @context,
>>> @next_seqno and
>>>     *		 @cur_seqno for implementing a struct dma_fence
>>> signaling the
>>>     *		 simulated job completion.
>>> @@ -49,7 +48,6 @@ struct drm_mock_scheduler {
>>>    
>>>    	spinlock_t		lock;
>>>    	struct list_head	job_list;
>>> -	struct list_head	done_list;
>>>    
>>>    	struct {
>>>    		u64		context;
>>
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-05-23 15:50 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-22  8:27 [PATCH v3 0/5] Fix memory leaks in drm_sched_fini() Philipp Stanner
2025-05-22  8:27 ` [PATCH v3 1/5] drm/sched: Fix teardown leaks with waitqueue Philipp Stanner
2025-05-22 12:44   ` Danilo Krummrich
2025-05-22 13:37   ` Tvrtko Ursulin
2025-05-22 15:32     ` Philipp Stanner
2025-05-23 15:35       ` Tvrtko Ursulin
2025-05-22  8:27 ` [PATCH v3 2/5] drm/sched/tests: Port tests to new cleanup method Philipp Stanner
2025-05-22 14:06   ` Tvrtko Ursulin
2025-05-22 14:59     ` Philipp Stanner
2025-05-23 15:49       ` Tvrtko Ursulin
2025-05-22  8:27 ` [PATCH v3 3/5] drm/sched: Warn if pending list is not empty Philipp Stanner
2025-05-22  8:27 ` [PATCH v3 4/5] drm/nouveau: Add new callback for scheduler teardown Philipp Stanner
2025-05-22  8:27 ` [PATCH v3 5/5] drm/nouveau: Remove waitque for sched teardown Philipp Stanner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).