linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] drm/sched: Fix memory leaks in drm_sched_fini()
@ 2025-04-07 15:22 Philipp Stanner
  2025-04-07 15:22 ` [PATCH 1/5] drm/sched: Fix teardown leaks with waitqueue Philipp Stanner
                   ` (4 more replies)
  0 siblings, 5 replies; 31+ messages in thread
From: Philipp Stanner @ 2025-04-07 15:22 UTC (permalink / raw)
  To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Philipp Stanner, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Tvrtko Ursulin
  Cc: dri-devel, nouveau, linux-kernel

Changes since the RFC:
  - (None)

Howdy,

as many of you know, we have potential memory leaks in drm_sched_fini()
which have been tried to be solved by various parties with various
methods in the past.

In our past discussions, we came to the conclusion, that the simplest
solution, blocking in drm_sched_fini(), is not possible because it could
cause processes ignoring SIGKILL and blocking for too long (which could
turn out to be an effective way to generate a funny email from Linus,
though :) )

Another idea was to have submitted jobs refcount the scheduler. I
investigated this and we found that this then *additionally* would
require us to have *the scheduler* refcount everything *in the driver*
that is accessed through the still running callbacks; since the driver
would want to unload possibly after a non-blocking drm_sched_fini()
call. So that's also no solution.

This RFC here is a new approach, somewhat based on the original
waitque-idea. It looks as follows:

1. Have drm_sched_fini() block until the pending_list becomes empty with
   a waitque, as a first step.
2. Provide the scheduler with a callback with which it can instruct the
   driver to kill the associated fence context. This will cause all
   pending hardware fences to get signalled. (Credit to Danilo, whose
   idea this was)
3. In drm_sched_fini(), first switch off submission of new jobs and
   timeouts (the latter might not be strictly necessary, but is probably
   cleaner).
4. Then, call the aformentioned callback, ensuring that free_job() will
   be called for all remaining jobs relatively quickly. This has the
   great advantage that the jobs get cleaned up through the standard
   mechanism.
5. Once all jobs are gone, also switch off the free_job() work item and
   then proceed as usual.

Furthermore, since there is now such a callback, we can provide an
if-branch checking for its existence. If the driver doesn't provide it,
drm_sched_fini() operates in "legacy mode". So none of the existing
drivers should notice a difference and we remain fully backwards
compatible.

Our glorious beta-tester is Nouveau, which so far had its own waitque
solution, which is now obsolete. The last two patches port Nouveau and
remove that waitque.

I've tested this on a desktop environment with Nouveau. Works fine and
solves the problem (though we did discover an unrelated problem inside
Nouveau in the process).

Tvrtko's unit tests also run as expected (except for the new warning
print in patch 3), which is not surprising since they don't provide the
callback.

I'm looking forward to your input and feedback. I really hope we can
work this RFC into something that can provide users with a more
reliable, clean scheduler API.

Philipp

Philipp Stanner (5):
  drm/sched: Fix teardown leaks with waitqueue
  drm/sched: Prevent teardown waitque from blocking too long
  drm/sched: Warn if pending list is not empty
  drm/nouveau: Add new callback for scheduler teardown
  drm/nouveau: Remove waitque for sched teardown

 drivers/gpu/drm/nouveau/nouveau_abi16.c |   4 +-
 drivers/gpu/drm/nouveau/nouveau_drm.c   |   2 +-
 drivers/gpu/drm/nouveau/nouveau_sched.c |  39 +++++----
 drivers/gpu/drm/nouveau/nouveau_sched.h |  12 +--
 drivers/gpu/drm/nouveau/nouveau_uvmm.c  |   8 +-
 drivers/gpu/drm/scheduler/sched_main.c  | 111 +++++++++++++++++++-----
 include/drm/gpu_scheduler.h             |  19 ++++
 7 files changed, 146 insertions(+), 49 deletions(-)

-- 
2.48.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/5] drm/sched: Fix teardown leaks with waitqueue
  2025-04-07 15:22 [PATCH 0/5] drm/sched: Fix memory leaks in drm_sched_fini() Philipp Stanner
@ 2025-04-07 15:22 ` Philipp Stanner
  2025-04-17  7:49   ` Philipp Stanner
  2025-04-07 15:22 ` [PATCH 2/5] drm/sched: Prevent teardown waitque from blocking too long Philipp Stanner
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 31+ messages in thread
From: Philipp Stanner @ 2025-04-07 15:22 UTC (permalink / raw)
  To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Philipp Stanner, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Tvrtko Ursulin
  Cc: dri-devel, nouveau, linux-kernel, Philipp Stanner

From: Philipp Stanner <pstanner@redhat.com>

The GPU scheduler currently does not ensure that its pending_list is
empty before performing various other teardown tasks in
drm_sched_fini().

If there are still jobs in the pending_list, this is problematic because
after scheduler teardown, no one will call backend_ops.free_job()
anymore. This would, consequently, result in memory leaks.

One way to solve this is to implement a waitqueue that drm_sched_fini()
blocks on until the pending_list has become empty.

Add a waitqueue to struct drm_gpu_scheduler. Wake up waiters once the
pending_list becomes empty. Wait in drm_sched_fini() for that to happen.

Signed-off-by: Philipp Stanner <pstanner@redhat.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 84 ++++++++++++++++++++------
 include/drm/gpu_scheduler.h            |  8 +++
 2 files changed, 75 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 829579c41c6b..a6a4deeda72b 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -367,7 +367,7 @@ static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
  */
 static void __drm_sched_run_free_queue(struct drm_gpu_scheduler *sched)
 {
-	if (!READ_ONCE(sched->pause_submit))
+	if (!READ_ONCE(sched->pause_free))
 		queue_work(sched->submit_wq, &sched->work_free_job);
 }
 
@@ -556,6 +556,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
 		 * is parked at which point it's safe.
 		 */
 		list_del_init(&job->list);
+
 		spin_unlock(&sched->job_list_lock);
 
 		status = job->sched->ops->timedout_job(job);
@@ -572,8 +573,10 @@ static void drm_sched_job_timedout(struct work_struct *work)
 		spin_unlock(&sched->job_list_lock);
 	}
 
-	if (status != DRM_GPU_SCHED_STAT_ENODEV)
-		drm_sched_start_timeout_unlocked(sched);
+	if (status != DRM_GPU_SCHED_STAT_ENODEV) {
+		if (!READ_ONCE(sched->cancel_timeout))
+			drm_sched_start_timeout_unlocked(sched);
+	}
 }
 
 /**
@@ -1119,12 +1122,22 @@ drm_sched_get_finished_job(struct drm_gpu_scheduler *sched)
 		/* remove job from pending_list */
 		list_del_init(&job->list);
 
+		/*
+		 * Inform tasks blocking in drm_sched_fini() that it's now safe to proceed.
+		 */
+		if (list_empty(&sched->pending_list))
+			wake_up(&sched->pending_list_waitque);
+
 		/* cancel this job's TO timer */
 		cancel_delayed_work(&sched->work_tdr);
 		/* make the scheduled timestamp more accurate */
 		next = list_first_entry_or_null(&sched->pending_list,
 						typeof(*next), list);
 
+		// TODO RFC: above we wake up the waitque which relies on the fact
+		// that timeouts have been deactivated. The below should never
+		// reactivate them since the list was empty above. Still, should
+		// we document that?
 		if (next) {
 			if (test_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT,
 				     &next->s_fence->scheduled.flags))
@@ -1324,6 +1337,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
 	init_waitqueue_head(&sched->job_scheduled);
 	INIT_LIST_HEAD(&sched->pending_list);
 	spin_lock_init(&sched->job_list_lock);
+	init_waitqueue_head(&sched->pending_list_waitque);
 	atomic_set(&sched->credit_count, 0);
 	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
 	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
@@ -1331,6 +1345,8 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
 	atomic_set(&sched->_score, 0);
 	atomic64_set(&sched->job_id_count, 0);
 	sched->pause_submit = false;
+	sched->pause_free = false;
+	sched->cancel_timeout = false;
 
 	sched->ready = true;
 	return 0;
@@ -1348,6 +1364,40 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
 }
 EXPORT_SYMBOL(drm_sched_init);
 
+/**
+ * drm_sched_submission_and_timeout_stop - stop everything except for free_job()
+ * @sched: scheduler instance
+ *
+ * Only needed to cleanly tear down the scheduler in drm_sched_fini().
+ */
+static inline void
+drm_sched_submission_and_timeout_stop(struct drm_gpu_scheduler *sched)
+{
+	WRITE_ONCE(sched->pause_submit, true);
+	WRITE_ONCE(sched->cancel_timeout, true);
+	cancel_work_sync(&sched->work_run_job);
+	cancel_delayed_work_sync(&sched->work_tdr);
+}
+
+static inline void
+drm_sched_free_stop(struct drm_gpu_scheduler *sched)
+{
+	WRITE_ONCE(sched->pause_free, true);
+	cancel_work_sync(&sched->work_free_job);
+}
+
+static inline bool
+drm_sched_no_jobs_pending(struct drm_gpu_scheduler *sched)
+{
+	bool empty;
+
+	spin_lock(&sched->job_list_lock);
+	empty = list_empty(&sched->pending_list);
+	spin_unlock(&sched->job_list_lock);
+
+	return empty;
+}
+
 /**
  * drm_sched_fini - Destroy a gpu scheduler
  *
@@ -1355,26 +1405,24 @@ EXPORT_SYMBOL(drm_sched_init);
  *
  * Tears down and cleans up the scheduler.
  *
- * This stops submission of new jobs to the hardware through
- * drm_sched_backend_ops.run_job(). Consequently, drm_sched_backend_ops.free_job()
- * will not be called for all jobs still in drm_gpu_scheduler.pending_list.
- * There is no solution for this currently. Thus, it is up to the driver to make
- * sure that:
- *
- *  a) drm_sched_fini() is only called after for all submitted jobs
- *     drm_sched_backend_ops.free_job() has been called or that
- *  b) the jobs for which drm_sched_backend_ops.free_job() has not been called
- *     after drm_sched_fini() ran are freed manually.
- *
- * FIXME: Take care of the above problem and prevent this function from leaking
- * the jobs in drm_gpu_scheduler.pending_list under any circumstances.
+ * Note that this function blocks until all the fences returned by
+ * &struct drm_sched_backend_ops.run_job have been signalled.
  */
 void drm_sched_fini(struct drm_gpu_scheduler *sched)
 {
 	struct drm_sched_entity *s_entity;
 	int i;
 
-	drm_sched_wqueue_stop(sched);
+	/*
+	 * Jobs that have neither been scheduled or which have timed out are
+	 * gone by now, but jobs that have been submitted through
+	 * backend_ops.run_job() and have not yet terminated are still pending.
+	 *
+	 * Wait for the pending_list to become empty to avoid leaking those jobs.
+	 */
+	drm_sched_submission_and_timeout_stop(sched);
+	wait_event(sched->pending_list_waitque, drm_sched_no_jobs_pending(sched));
+	drm_sched_free_stop(sched);
 
 	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
 		struct drm_sched_rq *rq = sched->sched_rq[i];
@@ -1471,6 +1519,7 @@ EXPORT_SYMBOL(drm_sched_wqueue_ready);
 void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched)
 {
 	WRITE_ONCE(sched->pause_submit, true);
+	WRITE_ONCE(sched->pause_free, true);
 	cancel_work_sync(&sched->work_run_job);
 	cancel_work_sync(&sched->work_free_job);
 }
@@ -1488,6 +1537,7 @@ EXPORT_SYMBOL(drm_sched_wqueue_stop);
 void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched)
 {
 	WRITE_ONCE(sched->pause_submit, false);
+	WRITE_ONCE(sched->pause_free, false);
 	queue_work(sched->submit_wq, &sched->work_run_job);
 	queue_work(sched->submit_wq, &sched->work_free_job);
 }
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 1a7e377d4cbb..badcf9ea4501 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -29,6 +29,7 @@
 #include <linux/completion.h>
 #include <linux/xarray.h>
 #include <linux/workqueue.h>
+#include <linux/wait.h>
 
 #define MAX_WAIT_SCHED_ENTITY_Q_EMPTY msecs_to_jiffies(1000)
 
@@ -533,6 +534,8 @@ struct drm_sched_backend_ops {
  *            timeout interval is over.
  * @pending_list: the list of jobs which are currently in the job queue.
  * @job_list_lock: lock to protect the pending_list.
+ * @pending_list_waitque: a waitqueue for drm_sched_fini() to block on until all
+ *		          pending jobs have been finished.
  * @hang_limit: once the hangs by a job crosses this limit then it is marked
  *              guilty and it will no longer be considered for scheduling.
  * @score: score to help loadbalancer pick a idle sched
@@ -540,6 +543,8 @@ struct drm_sched_backend_ops {
  * @ready: marks if the underlying HW is ready to work
  * @free_guilty: A hit to time out handler to free the guilty job.
  * @pause_submit: pause queuing of @work_run_job on @submit_wq
+ * @pause_free: pause queuing of @work_free_job on @submit_wq
+ * @cancel_timeout: pause queuing of @work_tdr on @submit_wq
  * @own_submit_wq: scheduler owns allocation of @submit_wq
  * @dev: system &struct device
  *
@@ -562,12 +567,15 @@ struct drm_gpu_scheduler {
 	struct delayed_work		work_tdr;
 	struct list_head		pending_list;
 	spinlock_t			job_list_lock;
+	wait_queue_head_t		pending_list_waitque;
 	int				hang_limit;
 	atomic_t                        *score;
 	atomic_t                        _score;
 	bool				ready;
 	bool				free_guilty;
 	bool				pause_submit;
+	bool				pause_free;
+	bool				cancel_timeout;
 	bool				own_submit_wq;
 	struct device			*dev;
 };
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/5] drm/sched: Prevent teardown waitque from blocking too long
  2025-04-07 15:22 [PATCH 0/5] drm/sched: Fix memory leaks in drm_sched_fini() Philipp Stanner
  2025-04-07 15:22 ` [PATCH 1/5] drm/sched: Fix teardown leaks with waitqueue Philipp Stanner
@ 2025-04-07 15:22 ` Philipp Stanner
  2025-04-07 15:22 ` [PATCH 3/5] drm/sched: Warn if pending list is not empty Philipp Stanner
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 31+ messages in thread
From: Philipp Stanner @ 2025-04-07 15:22 UTC (permalink / raw)
  To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Philipp Stanner, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Tvrtko Ursulin
  Cc: dri-devel, nouveau, linux-kernel, Danilo Krummrich

The waitqueue that ensures that drm_sched_fini() blocks until the
pending_list has become empty could theoretically cause that function to
block for a very long time. That, ultimately, could block userspace
procesess and prevent them from being killable through SIGKILL.

When a driver calls drm_sched_fini(), it is safe to assume that all
still pending jobs are not needed anymore anyways. Thus, they can be
cancelled and thereby it can be ensured that drm_sched_fini() will
return relatively quickly.

Implement a new helper to stop all work items / submission except for
the drm_sched_backend_ops.run_job().

Implement a driver callback, kill_fence_context(), that instructs the
driver to kill the fence context associated with this scheduler, thereby
causing all pending hardware fences to be signalled.

Call those new routines in drm_sched_fini() and ensure backwards
compatibility if the new callback is not implemented.

Suggested-by: Danilo Krummrich <dakr@redhat.com>
Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/scheduler/sched_main.c | 47 +++++++++++++++++---------
 include/drm/gpu_scheduler.h            | 11 ++++++
 2 files changed, 42 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index a6a4deeda72b..6b72278c4b72 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1398,31 +1398,46 @@ drm_sched_no_jobs_pending(struct drm_gpu_scheduler *sched)
 	return empty;
 }
 
+/**
+ * drm_sched_cancel_jobs_and_wait - trigger freeing of all pending jobs
+ * @sched: scheduler instance
+ *
+ * Must only be called if &struct drm_sched_backend_ops.kill_fence_context is
+ * implemented.
+ *
+ * Instructs the driver to kill the fence context associated with this scheduler,
+ * thereby signalling all pending fences. This, in turn, will trigger
+ * &struct drm_sched_backend_ops.free_job to be called for all pending jobs.
+ * The function then blocks until all pending jobs have been freed.
+ */
+static inline void
+drm_sched_cancel_jobs_and_wait(struct drm_gpu_scheduler *sched)
+{
+	sched->ops->kill_fence_context(sched);
+	wait_event(sched->pending_list_waitque, drm_sched_no_jobs_pending(sched));
+}
+
 /**
  * drm_sched_fini - Destroy a gpu scheduler
  *
  * @sched: scheduler instance
  *
- * Tears down and cleans up the scheduler.
- *
- * Note that this function blocks until all the fences returned by
- * &struct drm_sched_backend_ops.run_job have been signalled.
+ * Tears down and cleans up the scheduler. Might leak memory if
+ * &struct drm_sched_backend_ops.kill_fence_context is not implemented.
  */
 void drm_sched_fini(struct drm_gpu_scheduler *sched)
 {
 	struct drm_sched_entity *s_entity;
 	int i;
 
-	/*
-	 * Jobs that have neither been scheduled or which have timed out are
-	 * gone by now, but jobs that have been submitted through
-	 * backend_ops.run_job() and have not yet terminated are still pending.
-	 *
-	 * Wait for the pending_list to become empty to avoid leaking those jobs.
-	 */
-	drm_sched_submission_and_timeout_stop(sched);
-	wait_event(sched->pending_list_waitque, drm_sched_no_jobs_pending(sched));
-	drm_sched_free_stop(sched);
+	if (sched->ops->kill_fence_context) {
+		drm_sched_submission_and_timeout_stop(sched);
+		drm_sched_cancel_jobs_and_wait(sched);
+		drm_sched_free_stop(sched);
+	} else {
+		/* We're in "legacy free-mode" and ignore potential mem leaks */
+		drm_sched_wqueue_stop(sched);
+	}
 
 	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
 		struct drm_sched_rq *rq = sched->sched_rq[i];
@@ -1510,7 +1525,7 @@ bool drm_sched_wqueue_ready(struct drm_gpu_scheduler *sched)
 EXPORT_SYMBOL(drm_sched_wqueue_ready);
 
 /**
- * drm_sched_wqueue_stop - stop scheduler submission
+ * drm_sched_wqueue_stop - stop scheduler submission and freeing
  * @sched: scheduler instance
  *
  * Stops the scheduler from pulling new jobs from entities. It also stops
@@ -1526,7 +1541,7 @@ void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched)
 EXPORT_SYMBOL(drm_sched_wqueue_stop);
 
 /**
- * drm_sched_wqueue_start - start scheduler submission
+ * drm_sched_wqueue_start - start scheduler submission and freeing
  * @sched: scheduler instance
  *
  * Restarts the scheduler after drm_sched_wqueue_stop() has stopped it.
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index badcf9ea4501..e13eb7e4c93a 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -509,6 +509,17 @@ struct drm_sched_backend_ops {
          * and it's time to clean it up.
 	 */
 	void (*free_job)(struct drm_sched_job *sched_job);
+
+	/**
+	 * @kill_fence_context: kill the fence context belonging to this scheduler
+	 *
+	 * Needed to cleanly tear the scheduler down in drm_sched_fini(). This
+	 * callback will cause all hardware fences to be signalled by the driver,
+	 * which, ultimately, ensures that all jobs get freed before teardown.
+	 *
+	 * This callback is optional, but it is highly recommended to implement it.
+	 */
+	void (*kill_fence_context)(struct drm_gpu_scheduler *sched);
 };
 
 /**
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/5] drm/sched: Warn if pending list is not empty
  2025-04-07 15:22 [PATCH 0/5] drm/sched: Fix memory leaks in drm_sched_fini() Philipp Stanner
  2025-04-07 15:22 ` [PATCH 1/5] drm/sched: Fix teardown leaks with waitqueue Philipp Stanner
  2025-04-07 15:22 ` [PATCH 2/5] drm/sched: Prevent teardown waitque from blocking too long Philipp Stanner
@ 2025-04-07 15:22 ` Philipp Stanner
  2025-04-17  7:45   ` Philipp Stanner
  2025-04-07 15:22 ` [PATCH 4/5] drm/nouveau: Add new callback for scheduler teardown Philipp Stanner
  2025-04-07 15:22 ` [PATCH 5/5] drm/nouveau: Remove waitque for sched teardown Philipp Stanner
  4 siblings, 1 reply; 31+ messages in thread
From: Philipp Stanner @ 2025-04-07 15:22 UTC (permalink / raw)
  To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Philipp Stanner, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Tvrtko Ursulin
  Cc: dri-devel, nouveau, linux-kernel

drm_sched_fini() can leak jobs under certain circumstances.

Warn if that happens.

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/scheduler/sched_main.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 6b72278c4b72..ae3152beca14 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1465,6 +1465,10 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
 	sched->ready = false;
 	kfree(sched->sched_rq);
 	sched->sched_rq = NULL;
+
+	if (!list_empty(&sched->pending_list))
+		dev_err(sched->dev, "%s: Tearing down scheduler while jobs are pending!\n",
+			__func__);
 }
 EXPORT_SYMBOL(drm_sched_fini);
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 4/5] drm/nouveau: Add new callback for scheduler teardown
  2025-04-07 15:22 [PATCH 0/5] drm/sched: Fix memory leaks in drm_sched_fini() Philipp Stanner
                   ` (2 preceding siblings ...)
  2025-04-07 15:22 ` [PATCH 3/5] drm/sched: Warn if pending list is not empty Philipp Stanner
@ 2025-04-07 15:22 ` Philipp Stanner
  2025-04-07 15:22 ` [PATCH 5/5] drm/nouveau: Remove waitque for sched teardown Philipp Stanner
  4 siblings, 0 replies; 31+ messages in thread
From: Philipp Stanner @ 2025-04-07 15:22 UTC (permalink / raw)
  To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Philipp Stanner, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Tvrtko Ursulin
  Cc: dri-devel, nouveau, linux-kernel

There is a new callback for always tearing the scheduler down in a
leak-free, deadlock-free manner.

Port Nouveau as its first user by providing the scheduler with a
callback that ensures the fence context gets killed in drm_sched_fini().

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/nouveau/nouveau_abi16.c |  4 ++--
 drivers/gpu/drm/nouveau/nouveau_drm.c   |  2 +-
 drivers/gpu/drm/nouveau/nouveau_sched.c | 19 +++++++++++++++++--
 drivers/gpu/drm/nouveau/nouveau_sched.h |  3 +++
 4 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_abi16.c b/drivers/gpu/drm/nouveau/nouveau_abi16.c
index 2a0617e5fe2a..53b8a85a8adb 100644
--- a/drivers/gpu/drm/nouveau/nouveau_abi16.c
+++ b/drivers/gpu/drm/nouveau/nouveau_abi16.c
@@ -415,8 +415,8 @@ nouveau_abi16_ioctl_channel_alloc(ABI16_IOCTL_ARGS)
 	 * The client lock is already acquired by nouveau_abi16_get().
 	 */
 	if (nouveau_cli_uvmm(cli)) {
-		ret = nouveau_sched_create(&chan->sched, drm, drm->sched_wq,
-					   chan->chan->dma.ib_max);
+		ret = nouveau_sched_create(&chan->sched, drm, chan->chan,
+					   drm->sched_wq, chan->chan->dma.ib_max);
 		if (ret)
 			goto done;
 	}
diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c
index c69139701056..2a2b319dca5f 100644
--- a/drivers/gpu/drm/nouveau/nouveau_drm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
@@ -313,7 +313,7 @@ nouveau_cli_init(struct nouveau_drm *drm, const char *sname,
 	 * locks which indirectly or directly are held for allocations
 	 * elsewhere.
 	 */
-	ret = nouveau_sched_create(&cli->sched, drm, NULL, 1);
+	ret = nouveau_sched_create(&cli->sched, drm, NULL, NULL, 1);
 	if (ret)
 		goto done;
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
index d326e55d2d24..3659ac78bb3e 100644
--- a/drivers/gpu/drm/nouveau/nouveau_sched.c
+++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
@@ -11,6 +11,7 @@
 #include "nouveau_exec.h"
 #include "nouveau_abi16.h"
 #include "nouveau_sched.h"
+#include "nouveau_chan.h"
 
 #define NOUVEAU_SCHED_JOB_TIMEOUT_MS		10000
 
@@ -392,10 +393,22 @@ nouveau_sched_free_job(struct drm_sched_job *sched_job)
 	nouveau_job_fini(job);
 }
 
+static void
+nouveau_sched_fence_context_kill(struct drm_gpu_scheduler *sched)
+{
+	struct nouveau_sched *nsched;
+
+	nsched = container_of(sched, struct nouveau_sched, base);
+
+	if (nsched->chan)
+		nouveau_channel_kill(nsched->chan);
+}
+
 static const struct drm_sched_backend_ops nouveau_sched_ops = {
 	.run_job = nouveau_sched_run_job,
 	.timedout_job = nouveau_sched_timedout_job,
 	.free_job = nouveau_sched_free_job,
+	.kill_fence_context = nouveau_sched_fence_context_kill,
 };
 
 static int
@@ -461,7 +474,8 @@ nouveau_sched_init(struct nouveau_sched *sched, struct nouveau_drm *drm,
 
 int
 nouveau_sched_create(struct nouveau_sched **psched, struct nouveau_drm *drm,
-		     struct workqueue_struct *wq, u32 credit_limit)
+		     struct nouveau_channel *chan, struct workqueue_struct *wq,
+		     u32 credit_limit)
 {
 	struct nouveau_sched *sched;
 	int ret;
@@ -470,6 +484,8 @@ nouveau_sched_create(struct nouveau_sched **psched, struct nouveau_drm *drm,
 	if (!sched)
 		return -ENOMEM;
 
+	sched->chan = chan;
+
 	ret = nouveau_sched_init(sched, drm, wq, credit_limit);
 	if (ret) {
 		kfree(sched);
@@ -481,7 +497,6 @@ nouveau_sched_create(struct nouveau_sched **psched, struct nouveau_drm *drm,
 	return 0;
 }
 
-
 static void
 nouveau_sched_fini(struct nouveau_sched *sched)
 {
diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.h b/drivers/gpu/drm/nouveau/nouveau_sched.h
index 20cd1da8db73..e6e2016a3569 100644
--- a/drivers/gpu/drm/nouveau/nouveau_sched.h
+++ b/drivers/gpu/drm/nouveau/nouveau_sched.h
@@ -9,6 +9,7 @@
 #include <drm/gpu_scheduler.h>
 
 #include "nouveau_drv.h"
+#include "nouveau_chan.h"
 
 #define to_nouveau_job(sched_job)		\
 		container_of((sched_job), struct nouveau_job, base)
@@ -101,6 +102,7 @@ struct nouveau_sched {
 	struct drm_sched_entity entity;
 	struct workqueue_struct *wq;
 	struct mutex mutex;
+	struct nouveau_channel *chan;
 
 	struct {
 		struct {
@@ -112,6 +114,7 @@ struct nouveau_sched {
 };
 
 int nouveau_sched_create(struct nouveau_sched **psched, struct nouveau_drm *drm,
+			 struct nouveau_channel *chan,
 			 struct workqueue_struct *wq, u32 credit_limit);
 void nouveau_sched_destroy(struct nouveau_sched **psched);
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 5/5] drm/nouveau: Remove waitque for sched teardown
  2025-04-07 15:22 [PATCH 0/5] drm/sched: Fix memory leaks in drm_sched_fini() Philipp Stanner
                   ` (3 preceding siblings ...)
  2025-04-07 15:22 ` [PATCH 4/5] drm/nouveau: Add new callback for scheduler teardown Philipp Stanner
@ 2025-04-07 15:22 ` Philipp Stanner
  4 siblings, 0 replies; 31+ messages in thread
From: Philipp Stanner @ 2025-04-07 15:22 UTC (permalink / raw)
  To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Philipp Stanner, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Tvrtko Ursulin
  Cc: dri-devel, nouveau, linux-kernel

struct nouveau_sched contains a waitque needed to prevent
drm_sched_fini() from being called while there are still jobs pending.
Doing so so far would have caused memory leaks.

With the new memleak-free mode of operation switched on in
drm_sched_fini() by providing the callback
nouveau_sched_fence_context_kill() the waitque is not necessary anymore.

Remove the waitque.

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/nouveau/nouveau_sched.c | 20 +++++++-------------
 drivers/gpu/drm/nouveau/nouveau_sched.h |  9 +++------
 drivers/gpu/drm/nouveau/nouveau_uvmm.c  |  8 ++++----
 3 files changed, 14 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
index 3659ac78bb3e..d9ac76198616 100644
--- a/drivers/gpu/drm/nouveau/nouveau_sched.c
+++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
@@ -121,11 +121,9 @@ nouveau_job_done(struct nouveau_job *job)
 {
 	struct nouveau_sched *sched = job->sched;
 
-	spin_lock(&sched->job.list.lock);
+	spin_lock(&sched->job_list.lock);
 	list_del(&job->entry);
-	spin_unlock(&sched->job.list.lock);
-
-	wake_up(&sched->job.wq);
+	spin_unlock(&sched->job_list.lock);
 }
 
 void
@@ -306,9 +304,9 @@ nouveau_job_submit(struct nouveau_job *job)
 	}
 
 	/* Submit was successful; add the job to the schedulers job list. */
-	spin_lock(&sched->job.list.lock);
-	list_add(&job->entry, &sched->job.list.head);
-	spin_unlock(&sched->job.list.lock);
+	spin_lock(&sched->job_list.lock);
+	list_add(&job->entry, &sched->job_list.head);
+	spin_unlock(&sched->job_list.lock);
 
 	drm_sched_job_arm(&job->base);
 	job->done_fence = dma_fence_get(&job->base.s_fence->finished);
@@ -458,9 +456,8 @@ nouveau_sched_init(struct nouveau_sched *sched, struct nouveau_drm *drm,
 		goto fail_sched;
 
 	mutex_init(&sched->mutex);
-	spin_lock_init(&sched->job.list.lock);
-	INIT_LIST_HEAD(&sched->job.list.head);
-	init_waitqueue_head(&sched->job.wq);
+	spin_lock_init(&sched->job_list.lock);
+	INIT_LIST_HEAD(&sched->job_list.head);
 
 	return 0;
 
@@ -503,9 +500,6 @@ nouveau_sched_fini(struct nouveau_sched *sched)
 	struct drm_gpu_scheduler *drm_sched = &sched->base;
 	struct drm_sched_entity *entity = &sched->entity;
 
-	rmb(); /* for list_empty to work without lock */
-	wait_event(sched->job.wq, list_empty(&sched->job.list.head));
-
 	drm_sched_entity_fini(entity);
 	drm_sched_fini(drm_sched);
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.h b/drivers/gpu/drm/nouveau/nouveau_sched.h
index e6e2016a3569..339a14563fbb 100644
--- a/drivers/gpu/drm/nouveau/nouveau_sched.h
+++ b/drivers/gpu/drm/nouveau/nouveau_sched.h
@@ -105,12 +105,9 @@ struct nouveau_sched {
 	struct nouveau_channel *chan;
 
 	struct {
-		struct {
-			struct list_head head;
-			spinlock_t lock;
-		} list;
-		struct wait_queue_head wq;
-	} job;
+		struct list_head head;
+		spinlock_t lock;
+	} job_list;
 };
 
 int nouveau_sched_create(struct nouveau_sched **psched, struct nouveau_drm *drm,
diff --git a/drivers/gpu/drm/nouveau/nouveau_uvmm.c b/drivers/gpu/drm/nouveau/nouveau_uvmm.c
index 48f105239f42..ddfc46bc1b3e 100644
--- a/drivers/gpu/drm/nouveau/nouveau_uvmm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_uvmm.c
@@ -1019,8 +1019,8 @@ bind_validate_map_sparse(struct nouveau_job *job, u64 addr, u64 range)
 	u64 end = addr + range;
 
 again:
-	spin_lock(&sched->job.list.lock);
-	list_for_each_entry(__job, &sched->job.list.head, entry) {
+	spin_lock(&sched->job_list.lock);
+	list_for_each_entry(__job, &sched->job_list.head, entry) {
 		struct nouveau_uvmm_bind_job *bind_job = to_uvmm_bind_job(__job);
 
 		list_for_each_op(op, &bind_job->ops) {
@@ -1030,7 +1030,7 @@ bind_validate_map_sparse(struct nouveau_job *job, u64 addr, u64 range)
 
 				if (!(end <= op_addr || addr >= op_end)) {
 					nouveau_uvmm_bind_job_get(bind_job);
-					spin_unlock(&sched->job.list.lock);
+					spin_unlock(&sched->job_list.lock);
 					wait_for_completion(&bind_job->complete);
 					nouveau_uvmm_bind_job_put(bind_job);
 					goto again;
@@ -1038,7 +1038,7 @@ bind_validate_map_sparse(struct nouveau_job *job, u64 addr, u64 range)
 			}
 		}
 	}
-	spin_unlock(&sched->job.list.lock);
+	spin_unlock(&sched->job_list.lock);
 }
 
 static int
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
  2025-04-07 15:22 ` [PATCH 3/5] drm/sched: Warn if pending list is not empty Philipp Stanner
@ 2025-04-17  7:45   ` Philipp Stanner
  2025-04-17 11:27     ` Tvrtko Ursulin
  0 siblings, 1 reply; 31+ messages in thread
From: Philipp Stanner @ 2025-04-17  7:45 UTC (permalink / raw)
  To: Philipp Stanner, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Matthew Brost, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Tvrtko Ursulin
  Cc: dri-devel, nouveau, linux-kernel

On Mon, 2025-04-07 at 17:22 +0200, Philipp Stanner wrote:
> drm_sched_fini() can leak jobs under certain circumstances.
> 
> Warn if that happens.
> 
> Signed-off-by: Philipp Stanner <phasta@kernel.org>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 4 ++++

I hear a lot of amazed silence for this series ^_^

If there's no major pushback, my intention is to pull this in once the
blocking Nouveau-bug has been fixed (by pulling my patch).


In the mean time, let me review my own stuff:

>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 6b72278c4b72..ae3152beca14 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1465,6 +1465,10 @@ void drm_sched_fini(struct drm_gpu_scheduler
> *sched)
>  	sched->ready = false;
>  	kfree(sched->sched_rq);
>  	sched->sched_rq = NULL;
> +
> +	if (!list_empty(&sched->pending_list))
> +		dev_err(sched->dev, "%s: Tearing down scheduler
> while jobs are pending!\n",
> +			__func__);

The "%s" here will be removed since dev_err() alreday prints the
function name.

I find this dev_err() print more useful than the stack a WARN_ON prints
(telling you about invalid_asm_exec_op or stuff like that).

Plus, I guess the places where drivers call drm_sched_fini() are very
well defined and known, so a callstack wouldn't really be useful in the
first place.


P.

>  }
>  EXPORT_SYMBOL(drm_sched_fini);
>  


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] drm/sched: Fix teardown leaks with waitqueue
  2025-04-07 15:22 ` [PATCH 1/5] drm/sched: Fix teardown leaks with waitqueue Philipp Stanner
@ 2025-04-17  7:49   ` Philipp Stanner
  0 siblings, 0 replies; 31+ messages in thread
From: Philipp Stanner @ 2025-04-17  7:49 UTC (permalink / raw)
  To: Philipp Stanner, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Matthew Brost, Christian König,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Tvrtko Ursulin
  Cc: dri-devel, nouveau, linux-kernel

On Mon, 2025-04-07 at 17:22 +0200, Philipp Stanner wrote:
> From: Philipp Stanner <pstanner@redhat.com>
> 
> The GPU scheduler currently does not ensure that its pending_list is
> empty before performing various other teardown tasks in
> drm_sched_fini().
> 
> If there are still jobs in the pending_list, this is problematic
> because
> after scheduler teardown, no one will call backend_ops.free_job()
> anymore. This would, consequently, result in memory leaks.
> 
> One way to solve this is to implement a waitqueue that
> drm_sched_fini()
> blocks on until the pending_list has become empty.
> 
> Add a waitqueue to struct drm_gpu_scheduler. Wake up waiters once the
> pending_list becomes empty. Wait in drm_sched_fini() for that to
> happen.
> 
> Signed-off-by: Philipp Stanner <pstanner@redhat.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 84 ++++++++++++++++++++----
> --
>  include/drm/gpu_scheduler.h            |  8 +++
>  2 files changed, 75 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 829579c41c6b..a6a4deeda72b 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -367,7 +367,7 @@ static void drm_sched_run_job_queue(struct
> drm_gpu_scheduler *sched)
>   */
>  static void __drm_sched_run_free_queue(struct drm_gpu_scheduler
> *sched)
>  {
> -	if (!READ_ONCE(sched->pause_submit))
> +	if (!READ_ONCE(sched->pause_free))
>  		queue_work(sched->submit_wq, &sched->work_free_job);
>  }
>  
> @@ -556,6 +556,7 @@ static void drm_sched_job_timedout(struct
> work_struct *work)
>  		 * is parked at which point it's safe.
>  		 */
>  		list_del_init(&job->list);
> +
>  		spin_unlock(&sched->job_list_lock);
>  
>  		status = job->sched->ops->timedout_job(job);
> @@ -572,8 +573,10 @@ static void drm_sched_job_timedout(struct
> work_struct *work)
>  		spin_unlock(&sched->job_list_lock);
>  	}
>  
> -	if (status != DRM_GPU_SCHED_STAT_ENODEV)
> -		drm_sched_start_timeout_unlocked(sched);
> +	if (status != DRM_GPU_SCHED_STAT_ENODEV) {
> +		if (!READ_ONCE(sched->cancel_timeout))

Another thing I want to investigate is whether all those booleans we
read with READ_ONCE are actually necessary?

Because I suspect they are not. The cancel_work_sync() functions should
do the trick, and if they're too late we'd just wait one cycle longer.

@Christian, Tvrtko, could be helpful if you can take a look.

> +			drm_sched_start_timeout_unlocked(sched);
> +	}
>  }
>  
>  /**
> @@ -1119,12 +1122,22 @@ drm_sched_get_finished_job(struct
> drm_gpu_scheduler *sched)
>  		/* remove job from pending_list */
>  		list_del_init(&job->list);
>  
> +		/*
> +		 * Inform tasks blocking in drm_sched_fini() that
> it's now safe to proceed.
> +		 */
> +		if (list_empty(&sched->pending_list))
> +			wake_up(&sched->pending_list_waitque);
> +
>  		/* cancel this job's TO timer */
>  		cancel_delayed_work(&sched->work_tdr);
>  		/* make the scheduled timestamp more accurate */
>  		next = list_first_entry_or_null(&sched-
> >pending_list,
>  						typeof(*next),
> list);
>  
> +		// TODO RFC: above we wake up the waitque which
> relies on the fact
> +		// that timeouts have been deactivated. The below
> should never
> +		// reactivate them since the list was empty above.
> Still, should
> +		// we document that?

Regarding this, I re-read the code a few times and conclude that this
is not a problem.


P.

>  		if (next) {
>  			if (test_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT,
>  				     &next->s_fence-
> >scheduled.flags))
> @@ -1324,6 +1337,7 @@ int drm_sched_init(struct drm_gpu_scheduler
> *sched, const struct drm_sched_init_
>  	init_waitqueue_head(&sched->job_scheduled);
>  	INIT_LIST_HEAD(&sched->pending_list);
>  	spin_lock_init(&sched->job_list_lock);
> +	init_waitqueue_head(&sched->pending_list_waitque);
>  	atomic_set(&sched->credit_count, 0);
>  	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
>  	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
> @@ -1331,6 +1345,8 @@ int drm_sched_init(struct drm_gpu_scheduler
> *sched, const struct drm_sched_init_
>  	atomic_set(&sched->_score, 0);
>  	atomic64_set(&sched->job_id_count, 0);
>  	sched->pause_submit = false;
> +	sched->pause_free = false;
> +	sched->cancel_timeout = false;
>  
>  	sched->ready = true;
>  	return 0;
> @@ -1348,6 +1364,40 @@ int drm_sched_init(struct drm_gpu_scheduler
> *sched, const struct drm_sched_init_
>  }
>  EXPORT_SYMBOL(drm_sched_init);
>  
> +/**
> + * drm_sched_submission_and_timeout_stop - stop everything except
> for free_job()
> + * @sched: scheduler instance
> + *
> + * Only needed to cleanly tear down the scheduler in
> drm_sched_fini().
> + */
> +static inline void
> +drm_sched_submission_and_timeout_stop(struct drm_gpu_scheduler
> *sched)
> +{
> +	WRITE_ONCE(sched->pause_submit, true);
> +	WRITE_ONCE(sched->cancel_timeout, true);
> +	cancel_work_sync(&sched->work_run_job);
> +	cancel_delayed_work_sync(&sched->work_tdr);
> +}
> +
> +static inline void
> +drm_sched_free_stop(struct drm_gpu_scheduler *sched)
> +{
> +	WRITE_ONCE(sched->pause_free, true);
> +	cancel_work_sync(&sched->work_free_job);
> +}
> +
> +static inline bool
> +drm_sched_no_jobs_pending(struct drm_gpu_scheduler *sched)
> +{
> +	bool empty;
> +
> +	spin_lock(&sched->job_list_lock);
> +	empty = list_empty(&sched->pending_list);
> +	spin_unlock(&sched->job_list_lock);
> +
> +	return empty;
> +}
> +
>  /**
>   * drm_sched_fini - Destroy a gpu scheduler
>   *
> @@ -1355,26 +1405,24 @@ EXPORT_SYMBOL(drm_sched_init);
>   *
>   * Tears down and cleans up the scheduler.
>   *
> - * This stops submission of new jobs to the hardware through
> - * drm_sched_backend_ops.run_job(). Consequently,
> drm_sched_backend_ops.free_job()
> - * will not be called for all jobs still in
> drm_gpu_scheduler.pending_list.
> - * There is no solution for this currently. Thus, it is up to the
> driver to make
> - * sure that:
> - *
> - *  a) drm_sched_fini() is only called after for all submitted jobs
> - *     drm_sched_backend_ops.free_job() has been called or that
> - *  b) the jobs for which drm_sched_backend_ops.free_job() has not
> been called
> - *     after drm_sched_fini() ran are freed manually.
> - *
> - * FIXME: Take care of the above problem and prevent this function
> from leaking
> - * the jobs in drm_gpu_scheduler.pending_list under any
> circumstances.
> + * Note that this function blocks until all the fences returned by
> + * &struct drm_sched_backend_ops.run_job have been signalled.
>   */
>  void drm_sched_fini(struct drm_gpu_scheduler *sched)
>  {
>  	struct drm_sched_entity *s_entity;
>  	int i;
>  
> -	drm_sched_wqueue_stop(sched);
> +	/*
> +	 * Jobs that have neither been scheduled or which have timed
> out are
> +	 * gone by now, but jobs that have been submitted through
> +	 * backend_ops.run_job() and have not yet terminated are
> still pending.
> +	 *
> +	 * Wait for the pending_list to become empty to avoid
> leaking those jobs.
> +	 */
> +	drm_sched_submission_and_timeout_stop(sched);
> +	wait_event(sched->pending_list_waitque,
> drm_sched_no_jobs_pending(sched));
> +	drm_sched_free_stop(sched);
>  
>  	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++)
> {
>  		struct drm_sched_rq *rq = sched->sched_rq[i];
> @@ -1471,6 +1519,7 @@ EXPORT_SYMBOL(drm_sched_wqueue_ready);
>  void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched)
>  {
>  	WRITE_ONCE(sched->pause_submit, true);
> +	WRITE_ONCE(sched->pause_free, true);
>  	cancel_work_sync(&sched->work_run_job);
>  	cancel_work_sync(&sched->work_free_job);
>  }
> @@ -1488,6 +1537,7 @@ EXPORT_SYMBOL(drm_sched_wqueue_stop);
>  void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched)
>  {
>  	WRITE_ONCE(sched->pause_submit, false);
> +	WRITE_ONCE(sched->pause_free, false);
>  	queue_work(sched->submit_wq, &sched->work_run_job);
>  	queue_work(sched->submit_wq, &sched->work_free_job);
>  }
> diff --git a/include/drm/gpu_scheduler.h
> b/include/drm/gpu_scheduler.h
> index 1a7e377d4cbb..badcf9ea4501 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -29,6 +29,7 @@
>  #include <linux/completion.h>
>  #include <linux/xarray.h>
>  #include <linux/workqueue.h>
> +#include <linux/wait.h>
>  
>  #define MAX_WAIT_SCHED_ENTITY_Q_EMPTY msecs_to_jiffies(1000)
>  
> @@ -533,6 +534,8 @@ struct drm_sched_backend_ops {
>   *            timeout interval is over.
>   * @pending_list: the list of jobs which are currently in the job
> queue.
>   * @job_list_lock: lock to protect the pending_list.
> + * @pending_list_waitque: a waitqueue for drm_sched_fini() to block
> on until all
> + *		          pending jobs have been finished.
>   * @hang_limit: once the hangs by a job crosses this limit then it
> is marked
>   *              guilty and it will no longer be considered for
> scheduling.
>   * @score: score to help loadbalancer pick a idle sched
> @@ -540,6 +543,8 @@ struct drm_sched_backend_ops {
>   * @ready: marks if the underlying HW is ready to work
>   * @free_guilty: A hit to time out handler to free the guilty job.
>   * @pause_submit: pause queuing of @work_run_job on @submit_wq
> + * @pause_free: pause queuing of @work_free_job on @submit_wq
> + * @cancel_timeout: pause queuing of @work_tdr on @submit_wq
>   * @own_submit_wq: scheduler owns allocation of @submit_wq
>   * @dev: system &struct device
>   *
> @@ -562,12 +567,15 @@ struct drm_gpu_scheduler {
>  	struct delayed_work		work_tdr;
>  	struct list_head		pending_list;
>  	spinlock_t			job_list_lock;
> +	wait_queue_head_t		pending_list_waitque;
>  	int				hang_limit;
>  	atomic_t                        *score;
>  	atomic_t                        _score;
>  	bool				ready;
>  	bool				free_guilty;
>  	bool				pause_submit;
> +	bool				pause_free;
> +	bool				cancel_timeout;
>  	bool				own_submit_wq;
>  	struct device			*dev;
>  };


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
  2025-04-17  7:45   ` Philipp Stanner
@ 2025-04-17 11:27     ` Tvrtko Ursulin
  2025-04-17 12:11       ` Danilo Krummrich
  0 siblings, 1 reply; 31+ messages in thread
From: Tvrtko Ursulin @ 2025-04-17 11:27 UTC (permalink / raw)
  To: phasta, Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Matthew Brost, Christian König, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann
  Cc: dri-devel, nouveau, linux-kernel


On 17/04/2025 08:45, Philipp Stanner wrote:
> On Mon, 2025-04-07 at 17:22 +0200, Philipp Stanner wrote:
>> drm_sched_fini() can leak jobs under certain circumstances.
>>
>> Warn if that happens.
>>
>> Signed-off-by: Philipp Stanner <phasta@kernel.org>
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 4 ++++
> 
> I hear a lot of amazed silence for this series ^_^
> 
> If there's no major pushback, my intention is to pull this in once the
> blocking Nouveau-bug has been fixed (by pulling my patch).

It was on my todo list to read it but failed to get to it due various 
reasons.

I only managed to skim over it it and am not quite convinced. But 
because I did not have time to think about it very much my feedback at 
this point is not very deep.

On the high level I understand for nouveau the series is "net zero", 
right? No leaks before, no leaks after. What about other drivers? Which 
ones have known leaks which could be addressed by them implementing this 
new callback?

Presumably you would document the callback should only be implemented by 
drivers which are 100% sure the clean up can always reliably done? 
Otherwise unkillable process on stuck hardware or drivers with not good 
enough reset support can still happen. (Which would be worse than a leak 
on shutdown.)

Have you thought about any observable effects from the userspace point 
of view? For example something if which now works, such as submitting a 
job which writes to a shared buffer and then exiting could stop working 
after the callback is implemented? I don't see it, because it would be 
unreliable even today (timing dependent whether job is in the queue or 
scheduled at exit time) so just thinking out loud.

Also, since the cover letter mentions job reference counting was one 
idea that was discarded another related problem is about the lifetimes. 
I think it would be good to discuss potentially reference counting 
"everything" because for instance today I can crash the kernel trivially 
with the xe driver*. Probably other drivers too.

Problem exactly is that jobs can outlive the entities and the scheduler, 
while some userspace may have a dma fence reference to the job via sync 
file. This new callback would not solve it for xe, but if everything 
required was reference counted it would.

Back to the series.

On the design level it feels like it adds too much state machine and 
makes things hard to follow/maintain. It would be nice to find a 
solutiuon where we end up with less state machine and not more.

On the low level there are some patch ordering and naming, spelling and 
other small improvements to be made.

But as said at the start, I would need to set aside more time to provide 
better comments and/or ideas.

*)
https://patchwork.freedesktop.org/patch/642709/?series=146211&rev=2

> In the mean time, let me review my own stuff:
> 
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index 6b72278c4b72..ae3152beca14 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -1465,6 +1465,10 @@ void drm_sched_fini(struct drm_gpu_scheduler
>> *sched)
>>   	sched->ready = false;
>>   	kfree(sched->sched_rq);
>>   	sched->sched_rq = NULL;
>> +
>> +	if (!list_empty(&sched->pending_list))
>> +		dev_err(sched->dev, "%s: Tearing down scheduler
>> while jobs are pending!\n",
>> +			__func__);

It isn't fair to add this error since it would out of the blue start 
firing for everyone expect nouveau, no? Regardless if there is a leak or 
not.

> 
> The "%s" here will be removed since dev_err() alreday prints the
> function name.

It does? But anyway, function name is redundant and irrelevant and 
should not be logged IMO. I would rather prefer we replaced sched->dev 
with sched->drm so could use drm loggers for clarity throughout.

> I find this dev_err() print more useful than the stack a WARN_ON prints
> (telling you about invalid_asm_exec_op or stuff like that).
> 
> Plus, I guess the places where drivers call drm_sched_fini() are very
> well defined and known, so a callstack wouldn't really be useful in the
> first place.

Agreed.

Regards,

Tvrtko

> 
> P.
> 
>>   }
>>   EXPORT_SYMBOL(drm_sched_fini);
>>   
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
  2025-04-17 11:27     ` Tvrtko Ursulin
@ 2025-04-17 12:11       ` Danilo Krummrich
  2025-04-17 14:20         ` Tvrtko Ursulin
  0 siblings, 1 reply; 31+ messages in thread
From: Danilo Krummrich @ 2025-04-17 12:11 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: phasta, Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, dri-devel, nouveau, linux-kernel

On Thu, Apr 17, 2025 at 12:27:29PM +0100, Tvrtko Ursulin wrote:
> 
> On 17/04/2025 08:45, Philipp Stanner wrote:
> > On Mon, 2025-04-07 at 17:22 +0200, Philipp Stanner wrote:
> 
> Problem exactly is that jobs can outlive the entities and the scheduler,
> while some userspace may have a dma fence reference to the job via sync
> file. This new callback would not solve it for xe, but if everything
> required was reference counted it would.

I think you're mixing up the job and the dma_fence here, if a job outlives the
scheduler, it clearly is a bug, always has been.

AFAIK, Xe reference counts it's driver specific job structures *and* the driver
specific scheduler structure, such that drm_sched_fini() won't be called before
all jobs have finished.

The reference counting solution, however, potentially creates subsequent
lifetime problems, which I think have been discussed already in a different
thread already. Philipp can probably link in the corresponding discussion.

> On the design level it feels like it adds too much state machine and makes
> things hard to follow/maintain. It would be nice to find a solutiuon where
> we end up with less state machine and not more.

Multiple solutions have been discussed already, e.g. just wait for the pending
list to be empty, reference count the scheduler for every pending job. Those all
had significant downsides, which I don't see with this proposal.

I'm all for better ideas though -- what do you propose?

> 
> > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > index 6b72278c4b72..ae3152beca14 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > @@ -1465,6 +1465,10 @@ void drm_sched_fini(struct drm_gpu_scheduler
> > > *sched)
> > >   	sched->ready = false;
> > >   	kfree(sched->sched_rq);
> > >   	sched->sched_rq = NULL;
> > > +
> > > +	if (!list_empty(&sched->pending_list))
> > > +		dev_err(sched->dev, "%s: Tearing down scheduler
> > > while jobs are pending!\n",
> > > +			__func__);
> 
> It isn't fair to add this error since it would out of the blue start firing
> for everyone expect nouveau, no? Regardless if there is a leak or not.

I think it is pretty fair to warn when detecting a guaranteed bug, no?

If drm_sched_fini() is call while jobs are still on the pending_list, they won't
ever be freed, because all workqueues are stopped.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
  2025-04-17 12:11       ` Danilo Krummrich
@ 2025-04-17 14:20         ` Tvrtko Ursulin
  2025-04-17 14:48           ` Danilo Krummrich
  0 siblings, 1 reply; 31+ messages in thread
From: Tvrtko Ursulin @ 2025-04-17 14:20 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: phasta, Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, dri-devel, nouveau, linux-kernel


On 17/04/2025 13:11, Danilo Krummrich wrote:
> On Thu, Apr 17, 2025 at 12:27:29PM +0100, Tvrtko Ursulin wrote:
>>
>> On 17/04/2025 08:45, Philipp Stanner wrote:
>>> On Mon, 2025-04-07 at 17:22 +0200, Philipp Stanner wrote:
>>
>> Problem exactly is that jobs can outlive the entities and the scheduler,
>> while some userspace may have a dma fence reference to the job via sync
>> file. This new callback would not solve it for xe, but if everything
>> required was reference counted it would.
> 
> I think you're mixing up the job and the dma_fence here, if a job outlives the
> scheduler, it clearly is a bug, always has been.
> 
> AFAIK, Xe reference counts it's driver specific job structures *and* the driver
> specific scheduler structure, such that drm_sched_fini() won't be called before
> all jobs have finished.

Yes, sorry, dma fence. But it is not enough to postpone drm_sched_fini 
until the job is not finished. Problem is exported dma fence holds the 
pointer to drm_sched_fence (and so oopses in 
drm_sched_fence_get_timeline_name on fence->sched->name) *after* job had 
finished and driver was free to tear everything down.

That is what the "fini callback" wouldn't solve but reference counting 
would.

I don't see that any other driver which can export a fence couldn't 
suffer from the same problem, and even if I agree this maybe isn't 
strictly a scheduler problem, I cannot see it solvable without involving 
the scheduler, and in any other way than reference counting. And if 
reference counting could solve both problems then it feels attractive.

> The reference counting solution, however, potentially creates subsequent
> lifetime problems, which I think have been discussed already in a different
> thread already. Philipp can probably link in the corresponding discussion.

I don't doubt it causes problems but maybe there will be no other way 
than to tackle them.

>> On the design level it feels like it adds too much state machine and makes
>> things hard to follow/maintain. It would be nice to find a solutiuon where
>> we end up with less state machine and not more.
> 
> Multiple solutions have been discussed already, e.g. just wait for the pending
> list to be empty, reference count the scheduler for every pending job. Those all
> had significant downsides, which I don't see with this proposal.
> 
> I'm all for better ideas though -- what do you propose?

I think we need to brainstorm both issues and see if there is a solution 
which solves them both, with bonus points for being elegant.

Use after free is IMO even a worse problem so if reference counting is 
the only way then lets try that. But I will wait for those links to 
catch up on the past discussions.

>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index 6b72278c4b72..ae3152beca14 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -1465,6 +1465,10 @@ void drm_sched_fini(struct drm_gpu_scheduler
>>>> *sched)
>>>>    	sched->ready = false;
>>>>    	kfree(sched->sched_rq);
>>>>    	sched->sched_rq = NULL;
>>>> +
>>>> +	if (!list_empty(&sched->pending_list))
>>>> +		dev_err(sched->dev, "%s: Tearing down scheduler
>>>> while jobs are pending!\n",
>>>> +			__func__);
>>
>> It isn't fair to add this error since it would out of the blue start firing
>> for everyone expect nouveau, no? Regardless if there is a leak or not.
> 
> I think it is pretty fair to warn when detecting a guaranteed bug, no?
> 
> If drm_sched_fini() is call while jobs are still on the pending_list, they won't
> ever be freed, because all workqueues are stopped.

Is it a guaranteed bug for drivers are aware of the drm_sched_fini() 
limitation and are cleaning up upon themselves?

In other words if you apply the series up to here would it trigger for 
nouveau? Reportedly it triggers for the mock scheduler which also has no 
leak.

Also, I asked in my initial reply if we have a list of which of the 
current drivers suffer from memory leaks. Is it all or some etc.

Regards,

Tvrtko


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
  2025-04-17 14:20         ` Tvrtko Ursulin
@ 2025-04-17 14:48           ` Danilo Krummrich
  2025-04-17 16:08             ` Tvrtko Ursulin
  0 siblings, 1 reply; 31+ messages in thread
From: Danilo Krummrich @ 2025-04-17 14:48 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: phasta, Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, dri-devel, nouveau, linux-kernel

On Thu, Apr 17, 2025 at 03:20:44PM +0100, Tvrtko Ursulin wrote:
> 
> On 17/04/2025 13:11, Danilo Krummrich wrote:
> > On Thu, Apr 17, 2025 at 12:27:29PM +0100, Tvrtko Ursulin wrote:
> > > 
> > > On 17/04/2025 08:45, Philipp Stanner wrote:
> > > > On Mon, 2025-04-07 at 17:22 +0200, Philipp Stanner wrote:
> > > 
> > > Problem exactly is that jobs can outlive the entities and the scheduler,
> > > while some userspace may have a dma fence reference to the job via sync
> > > file. This new callback would not solve it for xe, but if everything
> > > required was reference counted it would.
> > 
> > I think you're mixing up the job and the dma_fence here, if a job outlives the
> > scheduler, it clearly is a bug, always has been.
> > 
> > AFAIK, Xe reference counts it's driver specific job structures *and* the driver
> > specific scheduler structure, such that drm_sched_fini() won't be called before
> > all jobs have finished.
> 
> Yes, sorry, dma fence. But it is not enough to postpone drm_sched_fini until
> the job is not finished. Problem is exported dma fence holds the pointer to
> drm_sched_fence (and so oopses in drm_sched_fence_get_timeline_name on
> fence->sched->name) *after* job had finished and driver was free to tear
> everything down.

Well, that's a bug in drm_sched_fence then and independent from the other topic.
Once the finished fence in a struct drm_sched_fence has been signaled it must
live independent of the scheduler.

The lifetime of the drm_sched_fence is entirely independent from the scheduler
itself, as you correctly point out.

Starting to reference count things to keep the whole scheduler etc. alive as
long as the drm_sched_fence lives is not the correct solution.

> > Multiple solutions have been discussed already, e.g. just wait for the pending
> > list to be empty, reference count the scheduler for every pending job. Those all
> > had significant downsides, which I don't see with this proposal.
> > 
> > I'm all for better ideas though -- what do you propose?
> 
> I think we need to brainstorm both issues and see if there is a solution
> which solves them both, with bonus points for being elegant.

The problems are not related. As mentioned above, once signaled a
drm_sched_fence must not depend on the scheduler any longer.

> > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > index 6b72278c4b72..ae3152beca14 100644
> > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > @@ -1465,6 +1465,10 @@ void drm_sched_fini(struct drm_gpu_scheduler
> > > > > *sched)
> > > > >    	sched->ready = false;
> > > > >    	kfree(sched->sched_rq);
> > > > >    	sched->sched_rq = NULL;
> > > > > +
> > > > > +	if (!list_empty(&sched->pending_list))
> > > > > +		dev_err(sched->dev, "%s: Tearing down scheduler
> > > > > while jobs are pending!\n",
> > > > > +			__func__);
> > > 
> > > It isn't fair to add this error since it would out of the blue start firing
> > > for everyone expect nouveau, no? Regardless if there is a leak or not.
> > 
> > I think it is pretty fair to warn when detecting a guaranteed bug, no?
> > 
> > If drm_sched_fini() is call while jobs are still on the pending_list, they won't
> > ever be freed, because all workqueues are stopped.
> 
> Is it a guaranteed bug for drivers are aware of the drm_sched_fini()
> limitation and are cleaning up upon themselves?

How could a driver clean up on itself (unless the driver holds its own list of
pending jobs)?

Once a job is in flight (i.e. it's on the pending_list) we must guarantee that
free_job() is called by the scheduler, which it can't do if we call
drm_sched_fini() before the pending_list is empty.

> In other words if you apply the series up to here would it trigger for
> nouveau?

No, because nouveau does something very stupid, i.e. replicate the pending_list.

> Reportedly it triggers for the mock scheduler which also has no
> leak.

That sounds impossible. How do you ensure you do *not* leak memory when you tear
down the scheduler while it still has pending jobs? Or in other words, who calls
free_job() if not the scheduler itself?

> Also, I asked in my initial reply if we have a list of which of the current
> drivers suffer from memory leaks. Is it all or some etc.

Not all, but quite some I think. The last time I looked (which is about a year
ago) amdgpu for instance could leak memory when you unbind the driver while
enough jobs are in flight.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
  2025-04-17 14:48           ` Danilo Krummrich
@ 2025-04-17 16:08             ` Tvrtko Ursulin
  2025-04-17 17:07               ` Danilo Krummrich
  2025-04-22  6:06               ` Philipp Stanner
  0 siblings, 2 replies; 31+ messages in thread
From: Tvrtko Ursulin @ 2025-04-17 16:08 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: phasta, Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, dri-devel, nouveau, linux-kernel


On 17/04/2025 15:48, Danilo Krummrich wrote:
> On Thu, Apr 17, 2025 at 03:20:44PM +0100, Tvrtko Ursulin wrote:
>>
>> On 17/04/2025 13:11, Danilo Krummrich wrote:
>>> On Thu, Apr 17, 2025 at 12:27:29PM +0100, Tvrtko Ursulin wrote:
>>>>
>>>> On 17/04/2025 08:45, Philipp Stanner wrote:
>>>>> On Mon, 2025-04-07 at 17:22 +0200, Philipp Stanner wrote:
>>>>
>>>> Problem exactly is that jobs can outlive the entities and the scheduler,
>>>> while some userspace may have a dma fence reference to the job via sync
>>>> file. This new callback would not solve it for xe, but if everything
>>>> required was reference counted it would.
>>>
>>> I think you're mixing up the job and the dma_fence here, if a job outlives the
>>> scheduler, it clearly is a bug, always has been.
>>>
>>> AFAIK, Xe reference counts it's driver specific job structures *and* the driver
>>> specific scheduler structure, such that drm_sched_fini() won't be called before
>>> all jobs have finished.
>>
>> Yes, sorry, dma fence. But it is not enough to postpone drm_sched_fini until
>> the job is not finished. Problem is exported dma fence holds the pointer to
>> drm_sched_fence (and so oopses in drm_sched_fence_get_timeline_name on
>> fence->sched->name) *after* job had finished and driver was free to tear
>> everything down.
> 
> Well, that's a bug in drm_sched_fence then and independent from the other topic.
> Once the finished fence in a struct drm_sched_fence has been signaled it must
> live independent of the scheduler.
> 
> The lifetime of the drm_sched_fence is entirely independent from the scheduler
> itself, as you correctly point out.

Connection (re. independent or not) I made was *if* drm_sched would be 
reference counted, would that satisfy both the requirement to keep 
working drm_sched_fence_get_timeline_name and to allow a different 
flavour of the memory leak fix.

I agree drm_sched_fence_get_timeline_name can also be fixed by removing 
the fence->sched dereference and losing the (pretty) name. Historically 
there has been a lot of trouble with those names so maybe that would be 
acceptable.

Revoking s_fence->sched on job completion as an alternative does not 
sound feasible.

To further complicate matters, I suspect rmmod gpu-sched.ko is also 
something which would break exported fences since that would remove the 
fence ops. But that is solvable by module_get/put().

> Starting to reference count things to keep the whole scheduler etc. alive as
> long as the drm_sched_fence lives is not the correct solution.

To catch up on why if you could dig out the links to past discussions it 
would be helpful.

I repeat how there is a lot of attractiveness to reference counting. 
Already mentioned memory leak, s_fence oops, and also not having to 
clear job->entity could be useful for things like tracking per entity 
submission stats (imagine CFS like scheduling, generic scheduling DRM 
cgroup controller). So it would be good for me to hear what pitfalls 
were identified in this space.

>>> Multiple solutions have been discussed already, e.g. just wait for the pending
>>> list to be empty, reference count the scheduler for every pending job. Those all
>>> had significant downsides, which I don't see with this proposal.
>>>
>>> I'm all for better ideas though -- what do you propose?
>>
>> I think we need to brainstorm both issues and see if there is a solution
>> which solves them both, with bonus points for being elegant.
> 
> The problems are not related. As mentioned above, once signaled a
> drm_sched_fence must not depend on the scheduler any longer.
> 
>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> index 6b72278c4b72..ae3152beca14 100644
>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> @@ -1465,6 +1465,10 @@ void drm_sched_fini(struct drm_gpu_scheduler
>>>>>> *sched)
>>>>>>     	sched->ready = false;
>>>>>>     	kfree(sched->sched_rq);
>>>>>>     	sched->sched_rq = NULL;
>>>>>> +
>>>>>> +	if (!list_empty(&sched->pending_list))
>>>>>> +		dev_err(sched->dev, "%s: Tearing down scheduler
>>>>>> while jobs are pending!\n",
>>>>>> +			__func__);
>>>>
>>>> It isn't fair to add this error since it would out of the blue start firing
>>>> for everyone expect nouveau, no? Regardless if there is a leak or not.
>>>
>>> I think it is pretty fair to warn when detecting a guaranteed bug, no?
>>>
>>> If drm_sched_fini() is call while jobs are still on the pending_list, they won't
>>> ever be freed, because all workqueues are stopped.
>>
>> Is it a guaranteed bug for drivers are aware of the drm_sched_fini()
>> limitation and are cleaning up upon themselves?
> 
> How could a driver clean up on itself (unless the driver holds its own list of
> pending jobs)?
> 
> Once a job is in flight (i.e. it's on the pending_list) we must guarantee that
> free_job() is called by the scheduler, which it can't do if we call
> drm_sched_fini() before the pending_list is empty.
> 
>> In other words if you apply the series up to here would it trigger for
>> nouveau?
> 
> No, because nouveau does something very stupid, i.e. replicate the pending_list.

Ah okay I see it now, it waits for all jobs to finish before calling 
drm_sched_fini(). For some reason I did not think it was doing that 
given the cover letter starts with how that is a big no-no.

>> Reportedly it triggers for the mock scheduler which also has no
>> leak.
> 
> That sounds impossible. How do you ensure you do *not* leak memory when you tear
> down the scheduler while it still has pending jobs? Or in other words, who calls
> free_job() if not the scheduler itself?

Well the cover letter says it triggers so it is possible. :)

Mock scheduler also tracks the pending jobs itself, but different from 
nouveau it does not wait for jobs to finish and free worker to process 
them all, but having stopped the "hw" backend it cancels them and calls 
the free_job vfunc directly.

Going back to the topic of this series, if we go with a solution along 
the lines of the proposed, I wonder if it would be doable without 
mandating that drivers keep a list parallel to pending_list. Instead 
have a vfunc DRM scheduler would call to cancel job at a time from *its* 
pending list. It would go nicely together with prepare/run/timedout/free.

Would it allow getting rid of the new state machinery and just 
cancelling and freeing in one go directly from drm_sched_fini()?

Regards,

Tvrtko

>> Also, I asked in my initial reply if we have a list of which of the current
>> drivers suffer from memory leaks. Is it all or some etc.
> 
> Not all, but quite some I think. The last time I looked (which is about a year
> ago) amdgpu for instance could leak memory when you unbind the driver while
> enough jobs are in flight.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
  2025-04-17 16:08             ` Tvrtko Ursulin
@ 2025-04-17 17:07               ` Danilo Krummrich
  2025-04-22  6:06               ` Philipp Stanner
  1 sibling, 0 replies; 31+ messages in thread
From: Danilo Krummrich @ 2025-04-17 17:07 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: phasta, Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, dri-devel, nouveau, linux-kernel

On Thu, Apr 17, 2025 at 05:08:12PM +0100, Tvrtko Ursulin wrote:
> To catch up on why if you could dig out the links to past discussions it
> would be helpful.

I can't look it up currently, sorry. That's why I said Philipp will loop you in
once he's back. :)

> I repeat how there is a lot of attractiveness to reference counting. Already
> mentioned memory leak, s_fence oops, and also not having to clear
> job->entity could be useful for things like tracking per entity submission
> stats (imagine CFS like scheduling, generic scheduling DRM cgroup
> controller). So it would be good for me to hear what pitfalls were
> identified in this space.

<snip>

> > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > index 6b72278c4b72..ae3152beca14 100644
> > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > @@ -1465,6 +1465,10 @@ void drm_sched_fini(struct drm_gpu_scheduler
> > > > > > > *sched)
> > > > > > >     	sched->ready = false;
> > > > > > >     	kfree(sched->sched_rq);
> > > > > > >     	sched->sched_rq = NULL;
> > > > > > > +
> > > > > > > +	if (!list_empty(&sched->pending_list))
> > > > > > > +		dev_err(sched->dev, "%s: Tearing down scheduler
> > > > > > > while jobs are pending!\n",
> > > > > > > +			__func__);
> > > > > 
> > > > > It isn't fair to add this error since it would out of the blue start firing
> > > > > for everyone expect nouveau, no? Regardless if there is a leak or not.
> > > > 
> > > > I think it is pretty fair to warn when detecting a guaranteed bug, no?
> > > > 
> > > > If drm_sched_fini() is call while jobs are still on the pending_list, they won't
> > > > ever be freed, because all workqueues are stopped.
> > > 
> > > Is it a guaranteed bug for drivers are aware of the drm_sched_fini()
> > > limitation and are cleaning up upon themselves?
> > 
> > How could a driver clean up on itself (unless the driver holds its own list of
> > pending jobs)?
> > 
> > Once a job is in flight (i.e. it's on the pending_list) we must guarantee that
> > free_job() is called by the scheduler, which it can't do if we call
> > drm_sched_fini() before the pending_list is empty.
> > 
> > > In other words if you apply the series up to here would it trigger for
> > > nouveau?
> > 
> > No, because nouveau does something very stupid, i.e. replicate the pending_list.
> 
> Ah okay I see it now, it waits for all jobs to finish before calling
> drm_sched_fini(). For some reason I did not think it was doing that given
> the cover letter starts with how that is a big no-no.
> 
> > > Reportedly it triggers for the mock scheduler which also has no
> > > leak.
> > 
> > That sounds impossible. How do you ensure you do *not* leak memory when you tear
> > down the scheduler while it still has pending jobs? Or in other words, who calls
> > free_job() if not the scheduler itself?
> 
> Well the cover letter says it triggers so it is possible. :)

I mean it should be impossible to have jobs on the pending_list when calling
drm_sched_fini(), but not have memory leaks, unless you let the driver do weird
things, such as peeking into implementation details of the scheduler, etc.

> Mock scheduler also tracks the pending jobs itself, but different from
> nouveau it does not wait for jobs to finish and free worker to process them
> all, but having stopped the "hw" backend it cancels them and calls the
> free_job vfunc directly.

That seems very wrong to me. This is exactly what I mean with the driver peeks
into implementation details.

If the API of a component requires to know about and modify internals of the
component, it's pretty much broken and must be fixed.

> Going back to the topic of this series, if we go with a solution along the
> lines of the proposed, I wonder if it would be doable without mandating that
> drivers keep a list parallel to pending_list. Instead have a vfunc DRM
> scheduler would call to cancel job at a time from *its* pending list. It
> would go nicely together with prepare/run/timedout/free.

That's pretty much what this series does, no? With the new callback the driver
is supposed to kill the corresponding fence context, which signals all
associated fences and hence the pending list will clear subsequently.

> Would it allow getting rid of the new state machinery and just cancelling
> and freeing in one go directly from drm_sched_fini()?

Again, I think that's what it does, unless I misunderstand you.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
  2025-04-17 16:08             ` Tvrtko Ursulin
  2025-04-17 17:07               ` Danilo Krummrich
@ 2025-04-22  6:06               ` Philipp Stanner
  2025-04-22 10:39                 ` Tvrtko Ursulin
  1 sibling, 1 reply; 31+ messages in thread
From: Philipp Stanner @ 2025-04-22  6:06 UTC (permalink / raw)
  To: Tvrtko Ursulin, Danilo Krummrich
  Cc: phasta, Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, dri-devel, nouveau, linux-kernel

On Thu, 2025-04-17 at 17:08 +0100, Tvrtko Ursulin wrote:
> 
> On 17/04/2025 15:48, Danilo Krummrich wrote:
> > On Thu, Apr 17, 2025 at 03:20:44PM +0100, Tvrtko Ursulin wrote:
> > > 
> > > On 17/04/2025 13:11, Danilo Krummrich wrote:
> > > > On Thu, Apr 17, 2025 at 12:27:29PM +0100, Tvrtko Ursulin wrote:
> > > > > 
> > > > > On 17/04/2025 08:45, Philipp Stanner wrote:
> > > > > > On Mon, 2025-04-07 at 17:22 +0200, Philipp Stanner wrote:
> > > > > 
> > > > > Problem exactly is that jobs can outlive the entities and the
> > > > > scheduler,
> > > > > while some userspace may have a dma fence reference to the
> > > > > job via sync
> > > > > file. This new callback would not solve it for xe, but if
> > > > > everything
> > > > > required was reference counted it would.
> > > > 
> > > > I think you're mixing up the job and the dma_fence here, if a
> > > > job outlives the
> > > > scheduler, it clearly is a bug, always has been.
> > > > 
> > > > AFAIK, Xe reference counts it's driver specific job structures
> > > > *and* the driver
> > > > specific scheduler structure, such that drm_sched_fini() won't
> > > > be called before
> > > > all jobs have finished.
> > > 
> > > Yes, sorry, dma fence. But it is not enough to postpone
> > > drm_sched_fini until
> > > the job is not finished. Problem is exported dma fence holds the
> > > pointer to
> > > drm_sched_fence (and so oopses in
> > > drm_sched_fence_get_timeline_name on
> > > fence->sched->name) *after* job had finished and driver was free
> > > to tear
> > > everything down.
> > 
> > Well, that's a bug in drm_sched_fence then and independent from the
> > other topic.
> > Once the finished fence in a struct drm_sched_fence has been
> > signaled it must
> > live independent of the scheduler.
> > 
> > The lifetime of the drm_sched_fence is entirely independent from
> > the scheduler
> > itself, as you correctly point out.
> 
> Connection (re. independent or not) I made was *if* drm_sched would
> be 
> reference counted, would that satisfy both the requirement to keep 
> working drm_sched_fence_get_timeline_name and to allow a different 
> flavour of the memory leak fix.
> 
> I agree drm_sched_fence_get_timeline_name can also be fixed by
> removing 
> the fence->sched dereference and losing the (pretty) name.
> Historically 
> there has been a lot of trouble with those names so maybe that would
> be 
> acceptable.
> 
> Revoking s_fence->sched on job completion as an alternative does not 
> sound feasible.
> 
> To further complicate matters, I suspect rmmod gpu-sched.ko is also 
> something which would break exported fences since that would remove
> the 
> fence ops. But that is solvable by module_get/put().

Is it a common kernel policy to be immune against crazy people just
calling rmmod on central, shared kernel infrastructure?

We cannot protect people from everything, especially from themselves.

> 
> > Starting to reference count things to keep the whole scheduler etc.
> > alive as
> > long as the drm_sched_fence lives is not the correct solution.
> 
> To catch up on why if you could dig out the links to past discussions
> it 
> would be helpful.
> 
> I repeat how there is a lot of attractiveness to reference counting. 
> Already mentioned memory leak, s_fence oops, and also not having to 
> clear job->entity could be useful for things like tracking per entity
> submission stats (imagine CFS like scheduling, generic scheduling DRM
> cgroup controller). So it would be good for me to hear what pitfalls 
> were identified in this space.

Reference counting does _appear_ attractive, but our (mostly internal)
attempts and discussions revealed that it's not feasable.

There was an RFC by me about it last year, but it was incomplete and
few people reacted:

https://lore.kernel.org/dri-devel/20240903094531.29893-2-pstanner@redhat.com/

When you try to implement refcounting, you quickly discover that it's
not enough if submitted jobs refcount the scheduler.

If the scheduler is refcounted, this means that it can now outlive the
driver. This means that it can continue calling into the driver with
run_job(), free_job(), timedout_job() and so on.

Since this would fire 500 million UAFs, you, hence, now would have to
guard *all* drivers' callbacks in some way against the driver being
unloaded. You might even end up having to refcount thinks like entire
modules, depending on the driver.


So effectively, with refcounting, you would fix a problem in central
infrastructure (the scheduler) in all users (the driver) – therefore,
you would not fix the problem at all, but just work around the
scheduler being broken in the drivers. IOW, everything would be exactly
as it is now, where everyone works around the problem in their own way.
Nouveau uses its redundant pending list, Xe refcounts and amdgpu does……
IDK, things.


Another problem is the kernel workflow and kernel politics. The
solution I propose here allows for phasing the problem out, first in
Nouveau, then step by step others. It's the only minimalist backward-
compatible solution I can see.


That said, if you have the capacity to look deeper into a refcount
solution, feel free to go for it. But you'd have to find a way to solve
it in a centralized manner, otherwise we could just leave everything
be.


> 
> > > > Multiple solutions have been discussed already, e.g. just wait
> > > > for the pending
> > > > list to be empty, reference count the scheduler for every
> > > > pending job. Those all
> > > > had significant downsides, which I don't see with this
> > > > proposal.
> > > > 
> > > > I'm all for better ideas though -- what do you propose?
> > > 
> > > I think we need to brainstorm both issues and see if there is a
> > > solution
> > > which solves them both, with bonus points for being elegant.
> > 
> > The problems are not related. As mentioned above, once signaled a
> > drm_sched_fence must not depend on the scheduler any longer.
> > 
> > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > index 6b72278c4b72..ae3152beca14 100644
> > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > @@ -1465,6 +1465,10 @@ void drm_sched_fini(struct
> > > > > > > drm_gpu_scheduler
> > > > > > > *sched)
> > > > > > >      sched->ready = false;
> > > > > > >      kfree(sched->sched_rq);
> > > > > > >      sched->sched_rq = NULL;
> > > > > > > +
> > > > > > > + if (!list_empty(&sched->pending_list))
> > > > > > > + dev_err(sched->dev, "%s: Tearing down scheduler
> > > > > > > while jobs are pending!\n",
> > > > > > > + __func__);
> > > > > 
> > > > > It isn't fair to add this error since it would out of the
> > > > > blue start firing
> > > > > for everyone expect nouveau, no? Regardless if there is a
> > > > > leak or not.
> > > > 
> > > > I think it is pretty fair to warn when detecting a guaranteed
> > > > bug, no?
> > > > 
> > > > If drm_sched_fini() is call while jobs are still on the
> > > > pending_list, they won't
> > > > ever be freed, because all workqueues are stopped.
> > > 
> > > Is it a guaranteed bug for drivers are aware of the
> > > drm_sched_fini()
> > > limitation and are cleaning up upon themselves?
> > 
> > How could a driver clean up on itself (unless the driver holds its
> > own list of
> > pending jobs)?
> > 
> > Once a job is in flight (i.e. it's on the pending_list) we must
> > guarantee that
> > free_job() is called by the scheduler, which it can't do if we call
> > drm_sched_fini() before the pending_list is empty.
> > 
> > > In other words if you apply the series up to here would it
> > > trigger for
> > > nouveau?
> > 
> > No, because nouveau does something very stupid, i.e. replicate the
> > pending_list.
> 
> Ah okay I see it now, it waits for all jobs to finish before calling 
> drm_sched_fini(). For some reason I did not think it was doing that 
> given the cover letter starts with how that is a big no-no.
> 
> > > Reportedly it triggers for the mock scheduler which also has no
> > > leak.
> > 
> > That sounds impossible. How do you ensure you do *not* leak memory
> > when you tear
> > down the scheduler while it still has pending jobs? Or in other
> > words, who calls
> > free_job() if not the scheduler itself?
> 
> Well the cover letter says it triggers so it is possible. :)

Where does it say that?

I mean, the warning would trigger if a driver is leaking jobs, which is
clearly a bug (and since a list is leaked, it might then even be a
false-negative in kmemleak).

It's actually quite simple:
   1. jobs are being freed by free_job()
   2. If you call drm_sched_fini(), the work item for free_job() gets
      deactivated.
   3. Depending on the load (race), you leak the jobs.

That's not that much of a problem for hardware schedulers, but firmware
schedulers who tear down scheduler instances all the time could leak
quite some amount of memory over time.

I can't see why a warning print would ever be bad there. Even Christian
wants it.

P.

> 
> Mock scheduler also tracks the pending jobs itself, but different
> from 
> nouveau it does not wait for jobs to finish and free worker to
> process 
> them all, but having stopped the "hw" backend it cancels them and
> calls 
> the free_job vfunc directly.
> 
> Going back to the topic of this series, if we go with a solution
> along 
> the lines of the proposed, I wonder if it would be doable without 
> mandating that drivers keep a list parallel to pending_list. Instead 
> have a vfunc DRM scheduler would call to cancel job at a time from
> *its* 
> pending list. It would go nicely together with
> prepare/run/timedout/free.
> 
> Would it allow getting rid of the new state machinery and just 
> cancelling and freeing in one go directly from drm_sched_fini()?
> 
> Regards,
> 
> Tvrtko
> 
> > > Also, I asked in my initial reply if we have a list of which of
> > > the current
> > > drivers suffer from memory leaks. Is it all or some etc.
> > 
> > Not all, but quite some I think. The last time I looked (which is
> > about a year
> > ago) amdgpu for instance could leak memory when you unbind the
> > driver while
> > enough jobs are in flight.
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
  2025-04-22  6:06               ` Philipp Stanner
@ 2025-04-22 10:39                 ` Tvrtko Ursulin
  2025-04-22 11:13                   ` Danilo Krummrich
  0 siblings, 1 reply; 31+ messages in thread
From: Tvrtko Ursulin @ 2025-04-22 10:39 UTC (permalink / raw)
  To: phasta, Danilo Krummrich
  Cc: Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, dri-devel, nouveau, linux-kernel


On 22/04/2025 07:06, Philipp Stanner wrote:
> On Thu, 2025-04-17 at 17:08 +0100, Tvrtko Ursulin wrote:
>>
>> On 17/04/2025 15:48, Danilo Krummrich wrote:
>>> On Thu, Apr 17, 2025 at 03:20:44PM +0100, Tvrtko Ursulin wrote:
>>>>
>>>> On 17/04/2025 13:11, Danilo Krummrich wrote:
>>>>> On Thu, Apr 17, 2025 at 12:27:29PM +0100, Tvrtko Ursulin wrote:
>>>>>>
>>>>>> On 17/04/2025 08:45, Philipp Stanner wrote:
>>>>>>> On Mon, 2025-04-07 at 17:22 +0200, Philipp Stanner wrote:
>>>>>>
>>>>>> Problem exactly is that jobs can outlive the entities and the
>>>>>> scheduler,
>>>>>> while some userspace may have a dma fence reference to the
>>>>>> job via sync
>>>>>> file. This new callback would not solve it for xe, but if
>>>>>> everything
>>>>>> required was reference counted it would.
>>>>>
>>>>> I think you're mixing up the job and the dma_fence here, if a
>>>>> job outlives the
>>>>> scheduler, it clearly is a bug, always has been.
>>>>>
>>>>> AFAIK, Xe reference counts it's driver specific job structures
>>>>> *and* the driver
>>>>> specific scheduler structure, such that drm_sched_fini() won't
>>>>> be called before
>>>>> all jobs have finished.
>>>>
>>>> Yes, sorry, dma fence. But it is not enough to postpone
>>>> drm_sched_fini until
>>>> the job is not finished. Problem is exported dma fence holds the
>>>> pointer to
>>>> drm_sched_fence (and so oopses in
>>>> drm_sched_fence_get_timeline_name on
>>>> fence->sched->name) *after* job had finished and driver was free
>>>> to tear
>>>> everything down.
>>>
>>> Well, that's a bug in drm_sched_fence then and independent from the
>>> other topic.
>>> Once the finished fence in a struct drm_sched_fence has been
>>> signaled it must
>>> live independent of the scheduler.
>>>
>>> The lifetime of the drm_sched_fence is entirely independent from
>>> the scheduler
>>> itself, as you correctly point out.
>>
>> Connection (re. independent or not) I made was *if* drm_sched would
>> be
>> reference counted, would that satisfy both the requirement to keep
>> working drm_sched_fence_get_timeline_name and to allow a different
>> flavour of the memory leak fix.
>>
>> I agree drm_sched_fence_get_timeline_name can also be fixed by
>> removing
>> the fence->sched dereference and losing the (pretty) name.
>> Historically
>> there has been a lot of trouble with those names so maybe that would
>> be
>> acceptable.
>>
>> Revoking s_fence->sched on job completion as an alternative does not
>> sound feasible.
>>
>> To further complicate matters, I suspect rmmod gpu-sched.ko is also
>> something which would break exported fences since that would remove
>> the
>> fence ops. But that is solvable by module_get/put().
> 
> Is it a common kernel policy to be immune against crazy people just
> calling rmmod on central, shared kernel infrastructure?
> 
> We cannot protect people from everything, especially from themselves.

I would say if DRM supports driver unbind, and it does, then we need to 
decide on the module unload. Otherwise it is possible and allows for bad 
things to happen.

Even if solution might be to just perma-raise the module refcount, or 
something. It does not feel valid to dismiss the discussion straight 
away. Hence I have sent 
https://lore.kernel.org/dri-devel/20250418164246.72426-1-tvrtko.ursulin@igalia.com/ 
to discuss on this topic separately.

>>> Starting to reference count things to keep the whole scheduler etc.
>>> alive as
>>> long as the drm_sched_fence lives is not the correct solution.
>>
>> To catch up on why if you could dig out the links to past discussions
>> it
>> would be helpful.
>>
>> I repeat how there is a lot of attractiveness to reference counting.
>> Already mentioned memory leak, s_fence oops, and also not having to
>> clear job->entity could be useful for things like tracking per entity
>> submission stats (imagine CFS like scheduling, generic scheduling DRM
>> cgroup controller). So it would be good for me to hear what pitfalls
>> were identified in this space.
> 
> Reference counting does _appear_ attractive, but our (mostly internal)
> attempts and discussions revealed that it's not feasable.
> 
> There was an RFC by me about it last year, but it was incomplete and
> few people reacted:
> 
> https://lore.kernel.org/dri-devel/20240903094531.29893-2-pstanner@redhat.com/
> 
> When you try to implement refcounting, you quickly discover that it's
> not enough if submitted jobs refcount the scheduler.

Right, I did imply in my initial reply that "everything" would need to 
be reference counted.

> If the scheduler is refcounted, this means that it can now outlive the
> driver. This means that it can continue calling into the driver with
> run_job(), free_job(), timedout_job() and so on.
> 
> Since this would fire 500 million UAFs, you, hence, now would have to
> guard *all* drivers' callbacks in some way against the driver being
> unloaded. You might even end up having to refcount thinks like entire
> modules, depending on the driver.
> 
> 
> So effectively, with refcounting, you would fix a problem in central
> infrastructure (the scheduler) in all users (the driver) – therefore,
> you would not fix the problem at all, but just work around the
> scheduler being broken in the drivers. IOW, everything would be exactly
> as it is now, where everyone works around the problem in their own way.
> Nouveau uses its redundant pending list, Xe refcounts and amdgpu does……
> IDK, things.

Xe _fails_ to handle it as demonstrated in the above linked RFC. I don't 
think any driver can actually handle it on their own. Hence what I am 
discussing is completely opposite of "not fixing the problem at all" - 
it is about what if anything can we do to fix it robustly.

> Another problem is the kernel workflow and kernel politics. The
> solution I propose here allows for phasing the problem out, first in
> Nouveau, then step by step others. It's the only minimalist backward-
> compatible solution I can see.

It is two separate things. Memory leaks in this series and the ref 
counting angle as a potential fix for _both_ memory leaks _and_ use 
after free problems.

> That said, if you have the capacity to look deeper into a refcount
> solution, feel free to go for it. But you'd have to find a way to solve
> it in a centralized manner, otherwise we could just leave everything
> be.

I will leave discussion for my linked RFC.

>>>>> Multiple solutions have been discussed already, e.g. just wait
>>>>> for the pending
>>>>> list to be empty, reference count the scheduler for every
>>>>> pending job. Those all
>>>>> had significant downsides, which I don't see with this
>>>>> proposal.
>>>>>
>>>>> I'm all for better ideas though -- what do you propose?
>>>>
>>>> I think we need to brainstorm both issues and see if there is a
>>>> solution
>>>> which solves them both, with bonus points for being elegant.
>>>
>>> The problems are not related. As mentioned above, once signaled a
>>> drm_sched_fence must not depend on the scheduler any longer.
>>>
>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> index 6b72278c4b72..ae3152beca14 100644
>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> @@ -1465,6 +1465,10 @@ void drm_sched_fini(struct
>>>>>>>> drm_gpu_scheduler
>>>>>>>> *sched)
>>>>>>>>       sched->ready = false;
>>>>>>>>       kfree(sched->sched_rq);
>>>>>>>>       sched->sched_rq = NULL;
>>>>>>>> +
>>>>>>>> + if (!list_empty(&sched->pending_list))
>>>>>>>> + dev_err(sched->dev, "%s: Tearing down scheduler
>>>>>>>> while jobs are pending!\n",
>>>>>>>> + __func__);
>>>>>>
>>>>>> It isn't fair to add this error since it would out of the
>>>>>> blue start firing
>>>>>> for everyone expect nouveau, no? Regardless if there is a
>>>>>> leak or not.
>>>>>
>>>>> I think it is pretty fair to warn when detecting a guaranteed
>>>>> bug, no?
>>>>>
>>>>> If drm_sched_fini() is call while jobs are still on the
>>>>> pending_list, they won't
>>>>> ever be freed, because all workqueues are stopped.
>>>>
>>>> Is it a guaranteed bug for drivers are aware of the
>>>> drm_sched_fini()
>>>> limitation and are cleaning up upon themselves?
>>>
>>> How could a driver clean up on itself (unless the driver holds its
>>> own list of
>>> pending jobs)?
>>>
>>> Once a job is in flight (i.e. it's on the pending_list) we must
>>> guarantee that
>>> free_job() is called by the scheduler, which it can't do if we call
>>> drm_sched_fini() before the pending_list is empty.
>>>
>>>> In other words if you apply the series up to here would it
>>>> trigger for
>>>> nouveau?
>>>
>>> No, because nouveau does something very stupid, i.e. replicate the
>>> pending_list.
>>
>> Ah okay I see it now, it waits for all jobs to finish before calling
>> drm_sched_fini(). For some reason I did not think it was doing that
>> given the cover letter starts with how that is a big no-no.
>>
>>>> Reportedly it triggers for the mock scheduler which also has no
>>>> leak.
>>>
>>> That sounds impossible. How do you ensure you do *not* leak memory
>>> when you tear
>>> down the scheduler while it still has pending jobs? Or in other
>>> words, who calls
>>> free_job() if not the scheduler itself?
>>
>> Well the cover letter says it triggers so it is possible. :)
> 
> Where does it say that?

"""
Tvrtko's unit tests also run as expected (except for the new warning
print in patch 3), which is not surprising since they don't provide the
callback.
"""

> I mean, the warning would trigger if a driver is leaking jobs, which is
> clearly a bug (and since a list is leaked, it might then even be a
> false-negative in kmemleak).
> 
> It's actually quite simple:
>     1. jobs are being freed by free_job()
>     2. If you call drm_sched_fini(), the work item for free_job() gets
>        deactivated.
>     3. Depending on the load (race), you leak the jobs.
> 
> That's not that much of a problem for hardware schedulers, but firmware
> schedulers who tear down scheduler instances all the time could leak
> quite some amount of memory over time.
> 
> I can't see why a warning print would ever be bad there. Even Christian
> wants it.

Question I raised is if there are other drivers which manage to clean up 
everything correctly (like the mock scheduler does), but trigger that 
warning. Maybe there are not and maybe mock scheduler is the only false 
positive.

Regards,

Tvrtko

>> Mock scheduler also tracks the pending jobs itself, but different
>> from
>> nouveau it does not wait for jobs to finish and free worker to
>> process
>> them all, but having stopped the "hw" backend it cancels them and
>> calls
>> the free_job vfunc directly.
>>
>> Going back to the topic of this series, if we go with a solution
>> along
>> the lines of the proposed, I wonder if it would be doable without
>> mandating that drivers keep a list parallel to pending_list. Instead
>> have a vfunc DRM scheduler would call to cancel job at a time from
>> *its*
>> pending list. It would go nicely together with
>> prepare/run/timedout/free.
>>
>> Would it allow getting rid of the new state machinery and just
>> cancelling and freeing in one go directly from drm_sched_fini()?
>>
>> Regards,
>>
>> Tvrtko
>>
>>>> Also, I asked in my initial reply if we have a list of which of
>>>> the current
>>>> drivers suffer from memory leaks. Is it all or some etc.
>>>
>>> Not all, but quite some I think. The last time I looked (which is
>>> about a year
>>> ago) amdgpu for instance could leak memory when you unbind the
>>> driver while
>>> enough jobs are in flight.
>>
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
  2025-04-22 10:39                 ` Tvrtko Ursulin
@ 2025-04-22 11:13                   ` Danilo Krummrich
  2025-04-22 12:00                     ` Philipp Stanner
  2025-04-22 12:07                     ` Tvrtko Ursulin
  0 siblings, 2 replies; 31+ messages in thread
From: Danilo Krummrich @ 2025-04-22 11:13 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: phasta, Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, dri-devel, nouveau, linux-kernel

On Tue, Apr 22, 2025 at 11:39:11AM +0100, Tvrtko Ursulin wrote:
> Question I raised is if there are other drivers which manage to clean up
> everything correctly (like the mock scheduler does), but trigger that
> warning. Maybe there are not and maybe mock scheduler is the only false
> positive.

So far the scheduler simply does not give any guideline on how to address the
problem, hence every driver simply does something (or nothing, effectively
ignoring the problem). This is what we want to fix.

The mock scheduler keeps it's own list of pending jobs and on tear down stops
the scheduler's workqueues, traverses it's own list and eventually frees the
pending jobs without updating the scheduler's internal pending list.

So yes, it does avoid memory leaks, but it also leaves the schedulers internal
structures with an invalid state, i.e. the pending list of the scheduler has
pointers to already freed memory.

What if the drm_sched_fini() starts touching the pending list? Then you'd end up
with UAF bugs with this implementation. We cannot invalidate the schedulers
internal structures and yet call scheduler functions - e.g. drm_sched_fini() -
subsequently.

Hence, the current implementation of the mock scheduler is fundamentally flawed.
And so would be *every* driver that still has entries within the scheduler's
pending list.

This is not a false positive, it already caught a real bug -- in the mock
scheduler.

- Danilo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
  2025-04-22 11:13                   ` Danilo Krummrich
@ 2025-04-22 12:00                     ` Philipp Stanner
  2025-04-22 13:25                       ` Tvrtko Ursulin
  2025-04-22 12:07                     ` Tvrtko Ursulin
  1 sibling, 1 reply; 31+ messages in thread
From: Philipp Stanner @ 2025-04-22 12:00 UTC (permalink / raw)
  To: Danilo Krummrich, Tvrtko Ursulin
  Cc: phasta, Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, dri-devel, nouveau, linux-kernel

On Tue, 2025-04-22 at 13:13 +0200, Danilo Krummrich wrote:
> On Tue, Apr 22, 2025 at 11:39:11AM +0100, Tvrtko Ursulin wrote:
> > Question I raised is if there are other drivers which manage to
> > clean up
> > everything correctly (like the mock scheduler does), but trigger
> > that
> > warning. Maybe there are not and maybe mock scheduler is the only
> > false
> > positive.

For clarification:

I messed up the comment from the cover letter.

What I did was run the *old* unit tests (v5 IIRC) from Tvrtko that
still had the memory leaks. Those then trigger the warning print, as is
expected, since they don't provide fence_context_kill().

The current unit tests are fine memory-leak-wise.

IOW, both with Nouveau and the unit tests, everything behaves as
expected, without issues.

Sorry for the confusion.

P.

> 
> So far the scheduler simply does not give any guideline on how to
> address the
> problem, hence every driver simply does something (or nothing,
> effectively
> ignoring the problem). This is what we want to fix.
> 
> The mock scheduler keeps it's own list of pending jobs and on tear
> down stops
> the scheduler's workqueues, traverses it's own list and eventually
> frees the
> pending jobs without updating the scheduler's internal pending list.
> 
> So yes, it does avoid memory leaks, but it also leaves the schedulers
> internal
> structures with an invalid state, i.e. the pending list of the
> scheduler has
> pointers to already freed memory.
> 
> What if the drm_sched_fini() starts touching the pending list? Then
> you'd end up
> with UAF bugs with this implementation. We cannot invalidate the
> schedulers
> internal structures and yet call scheduler functions - e.g.
> drm_sched_fini() -
> subsequently.
> 
> Hence, the current implementation of the mock scheduler is
> fundamentally flawed.
> And so would be *every* driver that still has entries within the
> scheduler's
> pending list.
> 
> This is not a false positive, it already caught a real bug -- in the
> mock
> scheduler.
> 
> - Danilo


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
  2025-04-22 11:13                   ` Danilo Krummrich
  2025-04-22 12:00                     ` Philipp Stanner
@ 2025-04-22 12:07                     ` Tvrtko Ursulin
  2025-04-22 12:21                       ` Philipp Stanner
  2025-04-22 12:32                       ` Danilo Krummrich
  1 sibling, 2 replies; 31+ messages in thread
From: Tvrtko Ursulin @ 2025-04-22 12:07 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: phasta, Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, dri-devel, nouveau, linux-kernel


On 22/04/2025 12:13, Danilo Krummrich wrote:
> On Tue, Apr 22, 2025 at 11:39:11AM +0100, Tvrtko Ursulin wrote:
>> Question I raised is if there are other drivers which manage to clean up
>> everything correctly (like the mock scheduler does), but trigger that
>> warning. Maybe there are not and maybe mock scheduler is the only false
>> positive.
> 
> So far the scheduler simply does not give any guideline on how to address the
> problem, hence every driver simply does something (or nothing, effectively
> ignoring the problem). This is what we want to fix.
> 
> The mock scheduler keeps it's own list of pending jobs and on tear down stops
> the scheduler's workqueues, traverses it's own list and eventually frees the
> pending jobs without updating the scheduler's internal pending list.
> 
> So yes, it does avoid memory leaks, but it also leaves the schedulers internal
> structures with an invalid state, i.e. the pending list of the scheduler has
> pointers to already freed memory.
> 
> What if the drm_sched_fini() starts touching the pending list? Then you'd end up
> with UAF bugs with this implementation. We cannot invalidate the schedulers
> internal structures and yet call scheduler functions - e.g. drm_sched_fini() -
> subsequently.
> 
> Hence, the current implementation of the mock scheduler is fundamentally flawed.
> And so would be *every* driver that still has entries within the scheduler's
> pending list.
> 
> This is not a false positive, it already caught a real bug -- in the mock
> scheduler.

To avoid furher splitting hairs on whether real bugs need to be able to 
manifest or not, lets move past this with a conclusion that there are 
two potential things to do here:

First one is to either send separately or include in this series 
something like:

diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c 
b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
index f999c8859cf7..7c4df0e890ac 100644
--- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
+++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
@@ -300,6 +300,8 @@ void drm_mock_sched_fini(struct drm_mock_scheduler 
*sched)
                 drm_mock_sched_job_complete(job);
         spin_unlock_irqrestore(&sched->lock, flags);

+       drm_sched_fini(&sched->base);
+
         /*
          * Free completed jobs and jobs not yet processed by the DRM 
scheduler
          * free worker.
@@ -311,8 +313,6 @@ void drm_mock_sched_fini(struct drm_mock_scheduler 
*sched)

         list_for_each_entry_safe(job, next, &list, link)
                 mock_sched_free_job(&job->base);
-
-       drm_sched_fini(&sched->base);
  }

  /**

That should satisfy the requirement to "clear" memory about to be freed 
and be 100% compliant with drm_sched_fini() kerneldoc (guideline b).

But the new warning from 3/5 here will still be there AFAICT and would 
you then agree it is a false positive?

Secondly, the series should modify all drivers (including the unit 
tests) which are known to trigger this false positive.

Regards,

Tvrtko


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
  2025-04-22 12:07                     ` Tvrtko Ursulin
@ 2025-04-22 12:21                       ` Philipp Stanner
  2025-04-22 12:32                       ` Danilo Krummrich
  1 sibling, 0 replies; 31+ messages in thread
From: Philipp Stanner @ 2025-04-22 12:21 UTC (permalink / raw)
  To: Tvrtko Ursulin, Danilo Krummrich
  Cc: phasta, Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, dri-devel, nouveau, linux-kernel

On Tue, 2025-04-22 at 13:07 +0100, Tvrtko Ursulin wrote:
> 
> On 22/04/2025 12:13, Danilo Krummrich wrote:
> > On Tue, Apr 22, 2025 at 11:39:11AM +0100, Tvrtko Ursulin wrote:
> > > Question I raised is if there are other drivers which manage to
> > > clean up
> > > everything correctly (like the mock scheduler does), but trigger
> > > that
> > > warning. Maybe there are not and maybe mock scheduler is the only
> > > false
> > > positive.
> > 
> > So far the scheduler simply does not give any guideline on how to
> > address the
> > problem, hence every driver simply does something (or nothing,
> > effectively
> > ignoring the problem). This is what we want to fix.
> > 
> > The mock scheduler keeps it's own list of pending jobs and on tear
> > down stops
> > the scheduler's workqueues, traverses it's own list and eventually
> > frees the
> > pending jobs without updating the scheduler's internal pending
> > list.
> > 
> > So yes, it does avoid memory leaks, but it also leaves the
> > schedulers internal
> > structures with an invalid state, i.e. the pending list of the
> > scheduler has
> > pointers to already freed memory.
> > 
> > What if the drm_sched_fini() starts touching the pending list? Then
> > you'd end up
> > with UAF bugs with this implementation. We cannot invalidate the
> > schedulers
> > internal structures and yet call scheduler functions - e.g.
> > drm_sched_fini() -
> > subsequently.
> > 
> > Hence, the current implementation of the mock scheduler is
> > fundamentally flawed.
> > And so would be *every* driver that still has entries within the
> > scheduler's
> > pending list.
> > 
> > This is not a false positive, it already caught a real bug -- in
> > the mock
> > scheduler.
> 
> To avoid furher splitting hairs on whether real bugs need to be able
> to 
> manifest or not, lets move past this with a conclusion that there are
> two potential things to do here:
> 
> First one is to either send separately or include in this series 
> something like:
> 
> diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c 
> b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> index f999c8859cf7..7c4df0e890ac 100644
> --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> @@ -300,6 +300,8 @@ void drm_mock_sched_fini(struct
> drm_mock_scheduler 
> *sched)
>                  drm_mock_sched_job_complete(job);
>          spin_unlock_irqrestore(&sched->lock, flags);
> 
> +       drm_sched_fini(&sched->base);
> +
>          /*
>           * Free completed jobs and jobs not yet processed by the DRM
> scheduler
>           * free worker.
> @@ -311,8 +313,6 @@ void drm_mock_sched_fini(struct
> drm_mock_scheduler 
> *sched)
> 
>          list_for_each_entry_safe(job, next, &list, link)
>                  mock_sched_free_job(&job->base);
> -
> -       drm_sched_fini(&sched->base);
>   }
> 
>   /**
> 
> That should satisfy the requirement to "clear" memory about to be
> freed 
> and be 100% compliant with drm_sched_fini() kerneldoc (guideline b).
> 
> But the new warning from 3/5 here will still be there AFAICT and
> would 
> you then agree it is a false positive?
> 
> Secondly, the series should modify all drivers (including the unit 
> tests) which are known to trigger this false positive.

There is no false positive. I just stated that in my other answer.

That said, I agree that the unit tests should be a second user of this
code in addition to Nouveau. They then should also provide the
fence_context_kill() callback, though.

And it seems we also agree that the memory leaks must be solved
centrally in drm_sched_fini().


P.

> 
> Regards,
> 
> Tvrtko
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
  2025-04-22 12:07                     ` Tvrtko Ursulin
  2025-04-22 12:21                       ` Philipp Stanner
@ 2025-04-22 12:32                       ` Danilo Krummrich
  2025-04-22 13:39                         ` Tvrtko Ursulin
  1 sibling, 1 reply; 31+ messages in thread
From: Danilo Krummrich @ 2025-04-22 12:32 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: phasta, Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, dri-devel, nouveau, linux-kernel

On Tue, Apr 22, 2025 at 01:07:47PM +0100, Tvrtko Ursulin wrote:
> 
> On 22/04/2025 12:13, Danilo Krummrich wrote:
> > On Tue, Apr 22, 2025 at 11:39:11AM +0100, Tvrtko Ursulin wrote:
> > > Question I raised is if there are other drivers which manage to clean up
> > > everything correctly (like the mock scheduler does), but trigger that
> > > warning. Maybe there are not and maybe mock scheduler is the only false
> > > positive.
> > 
> > So far the scheduler simply does not give any guideline on how to address the
> > problem, hence every driver simply does something (or nothing, effectively
> > ignoring the problem). This is what we want to fix.
> > 
> > The mock scheduler keeps it's own list of pending jobs and on tear down stops
> > the scheduler's workqueues, traverses it's own list and eventually frees the
> > pending jobs without updating the scheduler's internal pending list.
> > 
> > So yes, it does avoid memory leaks, but it also leaves the schedulers internal
> > structures with an invalid state, i.e. the pending list of the scheduler has
> > pointers to already freed memory.
> > 
> > What if the drm_sched_fini() starts touching the pending list? Then you'd end up
> > with UAF bugs with this implementation. We cannot invalidate the schedulers
> > internal structures and yet call scheduler functions - e.g. drm_sched_fini() -
> > subsequently.
> > 
> > Hence, the current implementation of the mock scheduler is fundamentally flawed.
> > And so would be *every* driver that still has entries within the scheduler's
> > pending list.
> > 
> > This is not a false positive, it already caught a real bug -- in the mock
> > scheduler.
> 
> To avoid furher splitting hairs on whether real bugs need to be able to
> manifest or not, lets move past this with a conclusion that there are two
> potential things to do here:

This is not about splitting hairs, it is about understanding that abusing
knowledge about internals of a component to clean things up is *never* valid.

> First one is to either send separately or include in this series something
> like:
> 
> diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> index f999c8859cf7..7c4df0e890ac 100644
> --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> @@ -300,6 +300,8 @@ void drm_mock_sched_fini(struct drm_mock_scheduler
> *sched)
>                 drm_mock_sched_job_complete(job);
>         spin_unlock_irqrestore(&sched->lock, flags);
> 
> +       drm_sched_fini(&sched->base);
> +
>         /*
>          * Free completed jobs and jobs not yet processed by the DRM
> scheduler
>          * free worker.
> @@ -311,8 +313,6 @@ void drm_mock_sched_fini(struct drm_mock_scheduler
> *sched)
> 
>         list_for_each_entry_safe(job, next, &list, link)
>                 mock_sched_free_job(&job->base);
> -
> -       drm_sched_fini(&sched->base);
>  }
> 
>  /**
> 
> That should satisfy the requirement to "clear" memory about to be freed and
> be 100% compliant with drm_sched_fini() kerneldoc (guideline b).
> 
> But the new warning from 3/5 here will still be there AFAICT and would you
> then agree it is a false positive?

No, I do not agree.

Even if a driver does what you describe it is not the correct thing to do and
having a warning call it out makes sense.

This way of cleaning things up entirely relies on knowing specific scheduler
internals, which if changed, may fall apart.

> Secondly, the series should modify all drivers (including the unit tests)
> which are known to trigger this false positive.

Again, there are no false positives. It is the scheduler that needs to call
free_job() and other potential cleanups. You can't just stop the scheduler,
leave it in an intermediate state and try to clean it up by hand relying on
knowledge about internals.

Consequently, when the pending list isn't empty when drm_sched_fini() is called,
it *always* is a bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
  2025-04-22 12:00                     ` Philipp Stanner
@ 2025-04-22 13:25                       ` Tvrtko Ursulin
  0 siblings, 0 replies; 31+ messages in thread
From: Tvrtko Ursulin @ 2025-04-22 13:25 UTC (permalink / raw)
  To: phasta, Danilo Krummrich
  Cc: Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, dri-devel, nouveau, linux-kernel


On 22/04/2025 13:00, Philipp Stanner wrote:
> On Tue, 2025-04-22 at 13:13 +0200, Danilo Krummrich wrote:
>> On Tue, Apr 22, 2025 at 11:39:11AM +0100, Tvrtko Ursulin wrote:
>>> Question I raised is if there are other drivers which manage to
>>> clean up
>>> everything correctly (like the mock scheduler does), but trigger
>>> that
>>> warning. Maybe there are not and maybe mock scheduler is the only
>>> false
>>> positive.
> 
> For clarification:
> 
> I messed up the comment from the cover letter.
> 
> What I did was run the *old* unit tests (v5 IIRC) from Tvrtko that
> still had the memory leaks. Those then trigger the warning print, as is
> expected, since they don't provide fence_context_kill().
> 
> The current unit tests are fine memory-leak-wise.
> 
> IOW, both with Nouveau and the unit tests, everything behaves as
> expected, without issues.

Hmm how? Pending list can be non-empty so it should be possible to hit 
that warn.

Regards,

Tvrtko

>> So far the scheduler simply does not give any guideline on how to
>> address the
>> problem, hence every driver simply does something (or nothing,
>> effectively
>> ignoring the problem). This is what we want to fix.
>>
>> The mock scheduler keeps it's own list of pending jobs and on tear
>> down stops
>> the scheduler's workqueues, traverses it's own list and eventually
>> frees the
>> pending jobs without updating the scheduler's internal pending list.
>>
>> So yes, it does avoid memory leaks, but it also leaves the schedulers
>> internal
>> structures with an invalid state, i.e. the pending list of the
>> scheduler has
>> pointers to already freed memory.
>>
>> What if the drm_sched_fini() starts touching the pending list? Then
>> you'd end up
>> with UAF bugs with this implementation. We cannot invalidate the
>> schedulers
>> internal structures and yet call scheduler functions - e.g.
>> drm_sched_fini() -
>> subsequently.
>>
>> Hence, the current implementation of the mock scheduler is
>> fundamentally flawed.
>> And so would be *every* driver that still has entries within the
>> scheduler's
>> pending list.
>>
>> This is not a false positive, it already caught a real bug -- in the
>> mock
>> scheduler.
>>
>> - Danilo
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
  2025-04-22 12:32                       ` Danilo Krummrich
@ 2025-04-22 13:39                         ` Tvrtko Ursulin
  2025-04-22 13:46                           ` Philipp Stanner
  2025-04-22 14:08                           ` Danilo Krummrich
  0 siblings, 2 replies; 31+ messages in thread
From: Tvrtko Ursulin @ 2025-04-22 13:39 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: phasta, Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, dri-devel, nouveau, linux-kernel


On 22/04/2025 13:32, Danilo Krummrich wrote:
> On Tue, Apr 22, 2025 at 01:07:47PM +0100, Tvrtko Ursulin wrote:
>>
>> On 22/04/2025 12:13, Danilo Krummrich wrote:
>>> On Tue, Apr 22, 2025 at 11:39:11AM +0100, Tvrtko Ursulin wrote:
>>>> Question I raised is if there are other drivers which manage to clean up
>>>> everything correctly (like the mock scheduler does), but trigger that
>>>> warning. Maybe there are not and maybe mock scheduler is the only false
>>>> positive.
>>>
>>> So far the scheduler simply does not give any guideline on how to address the
>>> problem, hence every driver simply does something (or nothing, effectively
>>> ignoring the problem). This is what we want to fix.
>>>
>>> The mock scheduler keeps it's own list of pending jobs and on tear down stops
>>> the scheduler's workqueues, traverses it's own list and eventually frees the
>>> pending jobs without updating the scheduler's internal pending list.
>>>
>>> So yes, it does avoid memory leaks, but it also leaves the schedulers internal
>>> structures with an invalid state, i.e. the pending list of the scheduler has
>>> pointers to already freed memory.
>>>
>>> What if the drm_sched_fini() starts touching the pending list? Then you'd end up
>>> with UAF bugs with this implementation. We cannot invalidate the schedulers
>>> internal structures and yet call scheduler functions - e.g. drm_sched_fini() -
>>> subsequently.
>>>
>>> Hence, the current implementation of the mock scheduler is fundamentally flawed.
>>> And so would be *every* driver that still has entries within the scheduler's
>>> pending list.
>>>
>>> This is not a false positive, it already caught a real bug -- in the mock
>>> scheduler.
>>
>> To avoid furher splitting hairs on whether real bugs need to be able to
>> manifest or not, lets move past this with a conclusion that there are two
>> potential things to do here:
> 
> This is not about splitting hairs, it is about understanding that abusing
> knowledge about internals of a component to clean things up is *never* valid.
> 
>> First one is to either send separately or include in this series something
>> like:
>>
>> diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>> b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>> index f999c8859cf7..7c4df0e890ac 100644
>> --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>> +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>> @@ -300,6 +300,8 @@ void drm_mock_sched_fini(struct drm_mock_scheduler
>> *sched)
>>                  drm_mock_sched_job_complete(job);
>>          spin_unlock_irqrestore(&sched->lock, flags);
>>
>> +       drm_sched_fini(&sched->base);
>> +
>>          /*
>>           * Free completed jobs and jobs not yet processed by the DRM
>> scheduler
>>           * free worker.
>> @@ -311,8 +313,6 @@ void drm_mock_sched_fini(struct drm_mock_scheduler
>> *sched)
>>
>>          list_for_each_entry_safe(job, next, &list, link)
>>                  mock_sched_free_job(&job->base);
>> -
>> -       drm_sched_fini(&sched->base);
>>   }
>>
>>   /**
>>
>> That should satisfy the requirement to "clear" memory about to be freed and
>> be 100% compliant with drm_sched_fini() kerneldoc (guideline b).
>>
>> But the new warning from 3/5 here will still be there AFAICT and would you
>> then agree it is a false positive?
> 
> No, I do not agree.
> 
> Even if a driver does what you describe it is not the correct thing to do and
> having a warning call it out makes sense.
> 
> This way of cleaning things up entirely relies on knowing specific scheduler
> internals, which if changed, may fall apart.
> 
>> Secondly, the series should modify all drivers (including the unit tests)
>> which are known to trigger this false positive.
> 
> Again, there are no false positives. It is the scheduler that needs to call
> free_job() and other potential cleanups. You can't just stop the scheduler,
> leave it in an intermediate state and try to clean it up by hand relying on
> knowledge about internals.

Sorry I don't see the argument for the claim it is relying on the 
internals with the re-positioned drm_sched_fini call. In that case it is 
fully compliant with:

/**
  * drm_sched_fini - Destroy a gpu scheduler
  *
  * @sched: scheduler instance
  *
  * Tears down and cleans up the scheduler.
  *
  * This stops submission of new jobs to the hardware through
  * drm_sched_backend_ops.run_job(). Consequently, 
drm_sched_backend_ops.free_job()
  * will not be called for all jobs still in drm_gpu_scheduler.pending_list.
  * There is no solution for this currently. Thus, it is up to the 
driver to make
  * sure that:
  *
  *  a) drm_sched_fini() is only called after for all submitted jobs
  *     drm_sched_backend_ops.free_job() has been called or that
  *  b) the jobs for which drm_sched_backend_ops.free_job() has not been 
called
  *
  * FIXME: Take care of the above problem and prevent this function from 
leaking
  * the jobs in drm_gpu_scheduler.pending_list under any circumstances.

^^^ recommended solution b).

> Consequently, when the pending list isn't empty when drm_sched_fini() is called,
> it *always* is a bug.

I am simply arguing that a quick audit of the drivers should be done to 
see that the dev_err is not added for drivers which clean up in 
compliance with drm_sched_fini() kerneldoc.

Starting to log errors from those would be adding work for many people 
in the bug handling chain. Sure you can say lets add the dev_err and 
then we don't have to look into the code base, just wait for bug reports 
to come our way. That works too (on some level) but lets please state 
the intent clearly and explicitly.

Regards,

Tvrtko


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
  2025-04-22 13:39                         ` Tvrtko Ursulin
@ 2025-04-22 13:46                           ` Philipp Stanner
  2025-04-22 14:08                           ` Danilo Krummrich
  1 sibling, 0 replies; 31+ messages in thread
From: Philipp Stanner @ 2025-04-22 13:46 UTC (permalink / raw)
  To: Tvrtko Ursulin, Danilo Krummrich
  Cc: phasta, Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, dri-devel, nouveau, linux-kernel

On Tue, 2025-04-22 at 14:39 +0100, Tvrtko Ursulin wrote:
> 
> On 22/04/2025 13:32, Danilo Krummrich wrote:
> > On Tue, Apr 22, 2025 at 01:07:47PM +0100, Tvrtko Ursulin wrote:
> > > 
> > > On 22/04/2025 12:13, Danilo Krummrich wrote:
> > > > On Tue, Apr 22, 2025 at 11:39:11AM +0100, Tvrtko Ursulin wrote:
> > > > > Question I raised is if there are other drivers which manage
> > > > > to clean up
> > > > > everything correctly (like the mock scheduler does), but
> > > > > trigger that
> > > > > warning. Maybe there are not and maybe mock scheduler is the
> > > > > only false
> > > > > positive.
> > > > 
> > > > So far the scheduler simply does not give any guideline on how
> > > > to address the
> > > > problem, hence every driver simply does something (or nothing,
> > > > effectively
> > > > ignoring the problem). This is what we want to fix.
> > > > 
> > > > The mock scheduler keeps it's own list of pending jobs and on
> > > > tear down stops
> > > > the scheduler's workqueues, traverses it's own list and
> > > > eventually frees the
> > > > pending jobs without updating the scheduler's internal pending
> > > > list.
> > > > 
> > > > So yes, it does avoid memory leaks, but it also leaves the
> > > > schedulers internal
> > > > structures with an invalid state, i.e. the pending list of the
> > > > scheduler has
> > > > pointers to already freed memory.
> > > > 
> > > > What if the drm_sched_fini() starts touching the pending list?
> > > > Then you'd end up
> > > > with UAF bugs with this implementation. We cannot invalidate
> > > > the schedulers
> > > > internal structures and yet call scheduler functions - e.g.
> > > > drm_sched_fini() -
> > > > subsequently.
> > > > 
> > > > Hence, the current implementation of the mock scheduler is
> > > > fundamentally flawed.
> > > > And so would be *every* driver that still has entries within
> > > > the scheduler's
> > > > pending list.
> > > > 
> > > > This is not a false positive, it already caught a real bug --
> > > > in the mock
> > > > scheduler.
> > > 
> > > To avoid furher splitting hairs on whether real bugs need to be
> > > able to
> > > manifest or not, lets move past this with a conclusion that there
> > > are two
> > > potential things to do here:
> > 
> > This is not about splitting hairs, it is about understanding that
> > abusing
> > knowledge about internals of a component to clean things up is
> > *never* valid.
> > 
> > > First one is to either send separately or include in this series
> > > something
> > > like:
> > > 
> > > diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > index f999c8859cf7..7c4df0e890ac 100644
> > > --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > @@ -300,6 +300,8 @@ void drm_mock_sched_fini(struct
> > > drm_mock_scheduler
> > > *sched)
> > >                  drm_mock_sched_job_complete(job);
> > >          spin_unlock_irqrestore(&sched->lock, flags);
> > > 
> > > +       drm_sched_fini(&sched->base);
> > > +
> > >          /*
> > >           * Free completed jobs and jobs not yet processed by the
> > > DRM
> > > scheduler
> > >           * free worker.
> > > @@ -311,8 +313,6 @@ void drm_mock_sched_fini(struct
> > > drm_mock_scheduler
> > > *sched)
> > > 
> > >          list_for_each_entry_safe(job, next, &list, link)
> > >                  mock_sched_free_job(&job->base);
> > > -
> > > -       drm_sched_fini(&sched->base);
> > >   }
> > > 
> > >   /**
> > > 
> > > That should satisfy the requirement to "clear" memory about to be
> > > freed and
> > > be 100% compliant with drm_sched_fini() kerneldoc (guideline b).
> > > 
> > > But the new warning from 3/5 here will still be there AFAICT and
> > > would you
> > > then agree it is a false positive?
> > 
> > No, I do not agree.
> > 
> > Even if a driver does what you describe it is not the correct thing
> > to do and
> > having a warning call it out makes sense.
> > 
> > This way of cleaning things up entirely relies on knowing specific
> > scheduler
> > internals, which if changed, may fall apart.
> > 
> > > Secondly, the series should modify all drivers (including the
> > > unit tests)
> > > which are known to trigger this false positive.
> > 
> > Again, there are no false positives. It is the scheduler that needs
> > to call
> > free_job() and other potential cleanups. You can't just stop the
> > scheduler,
> > leave it in an intermediate state and try to clean it up by hand
> > relying on
> > knowledge about internals.
> 
> Sorry I don't see the argument for the claim it is relying on the 
> internals with the re-positioned drm_sched_fini call. In that case it
> is 
> fully compliant with:
> 
> /**
>   * drm_sched_fini - Destroy a gpu scheduler
>   *
>   * @sched: scheduler instance
>   *
>   * Tears down and cleans up the scheduler.
>   *
>   * This stops submission of new jobs to the hardware through
>   * drm_sched_backend_ops.run_job(). Consequently, 
> drm_sched_backend_ops.free_job()
>   * will not be called for all jobs still in
> drm_gpu_scheduler.pending_list.
>   * There is no solution for this currently. Thus, it is up to the 
> driver to make
>   * sure that:
>   *
>   *  a) drm_sched_fini() is only called after for all submitted jobs
>   *     drm_sched_backend_ops.free_job() has been called or that
>   *  b) the jobs for which drm_sched_backend_ops.free_job() has not
> been 
> called
>   *
>   * FIXME: Take care of the above problem and prevent this function
> from 
> leaking
>   * the jobs in drm_gpu_scheduler.pending_list under any
> circumstances.
> 
> ^^^ recommended solution b).
> 
> > Consequently, when the pending list isn't empty when
> > drm_sched_fini() is called,
> > it *always* is a bug.
> 
> I am simply arguing that a quick audit of the drivers should be done
> to 
> see that the dev_err is not added for drivers which clean up in 
> compliance with drm_sched_fini() kerneldoc.
> 
> Starting to log errors from those would be adding work for many
> people 
> in the bug handling chain. Sure you can say lets add the dev_err and 
> then we don't have to look into the code base, just wait for bug
> reports 
> to come our way. That works too (on some level) but lets please state
> the intent clearly and explicitly.

Well, yes, that exactly is my intention.

All driver's must currently ensure in some custom way that a) all
fences get signaled and b) that the scheduler had time to call
free_job() for all jobs in pending_list.

If there is anyone in-tree currently who has len(pending_list) > 0
after drm_sched_fini() ran that are clearly memory leaks that need to
be fixed.

And, thus, firing the warning for all those drivers is appropriate.

I think it's unlikely to happen though. The hardware schedulers rarely
call drm_sched_fini(), and the firmware schedulers would have memory
leaks so large that they are likely to have been discovered by now.

P.

> 
> Regards,
> 
> Tvrtko
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
  2025-04-22 13:39                         ` Tvrtko Ursulin
  2025-04-22 13:46                           ` Philipp Stanner
@ 2025-04-22 14:08                           ` Danilo Krummrich
  2025-04-22 14:16                             ` Philipp Stanner
  1 sibling, 1 reply; 31+ messages in thread
From: Danilo Krummrich @ 2025-04-22 14:08 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: phasta, Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, dri-devel, nouveau, linux-kernel

On Tue, Apr 22, 2025 at 02:39:21PM +0100, Tvrtko Ursulin wrote:
> 
> On 22/04/2025 13:32, Danilo Krummrich wrote:
> > On Tue, Apr 22, 2025 at 01:07:47PM +0100, Tvrtko Ursulin wrote:
> > > 
> > > On 22/04/2025 12:13, Danilo Krummrich wrote:
> > > > On Tue, Apr 22, 2025 at 11:39:11AM +0100, Tvrtko Ursulin wrote:
> > > > > Question I raised is if there are other drivers which manage to clean up
> > > > > everything correctly (like the mock scheduler does), but trigger that
> > > > > warning. Maybe there are not and maybe mock scheduler is the only false
> > > > > positive.
> > > > 
> > > > So far the scheduler simply does not give any guideline on how to address the
> > > > problem, hence every driver simply does something (or nothing, effectively
> > > > ignoring the problem). This is what we want to fix.
> > > > 
> > > > The mock scheduler keeps it's own list of pending jobs and on tear down stops
> > > > the scheduler's workqueues, traverses it's own list and eventually frees the
> > > > pending jobs without updating the scheduler's internal pending list.
> > > > 
> > > > So yes, it does avoid memory leaks, but it also leaves the schedulers internal
> > > > structures with an invalid state, i.e. the pending list of the scheduler has
> > > > pointers to already freed memory.
> > > > 
> > > > What if the drm_sched_fini() starts touching the pending list? Then you'd end up
> > > > with UAF bugs with this implementation. We cannot invalidate the schedulers
> > > > internal structures and yet call scheduler functions - e.g. drm_sched_fini() -
> > > > subsequently.
> > > > 
> > > > Hence, the current implementation of the mock scheduler is fundamentally flawed.
> > > > And so would be *every* driver that still has entries within the scheduler's
> > > > pending list.
> > > > 
> > > > This is not a false positive, it already caught a real bug -- in the mock
> > > > scheduler.
> > > 
> > > To avoid furher splitting hairs on whether real bugs need to be able to
> > > manifest or not, lets move past this with a conclusion that there are two
> > > potential things to do here:
> > 
> > This is not about splitting hairs, it is about understanding that abusing
> > knowledge about internals of a component to clean things up is *never* valid.
> > 
> > > First one is to either send separately or include in this series something
> > > like:
> > > 
> > > diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > index f999c8859cf7..7c4df0e890ac 100644
> > > --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > @@ -300,6 +300,8 @@ void drm_mock_sched_fini(struct drm_mock_scheduler
> > > *sched)
> > >                  drm_mock_sched_job_complete(job);
> > >          spin_unlock_irqrestore(&sched->lock, flags);
> > > 
> > > +       drm_sched_fini(&sched->base);
> > > +
> > >          /*
> > >           * Free completed jobs and jobs not yet processed by the DRM
> > > scheduler
> > >           * free worker.
> > > @@ -311,8 +313,6 @@ void drm_mock_sched_fini(struct drm_mock_scheduler
> > > *sched)
> > > 
> > >          list_for_each_entry_safe(job, next, &list, link)
> > >                  mock_sched_free_job(&job->base);
> > > -
> > > -       drm_sched_fini(&sched->base);
> > >   }
> > > 
> > >   /**
> > > 
> > > That should satisfy the requirement to "clear" memory about to be freed and
> > > be 100% compliant with drm_sched_fini() kerneldoc (guideline b).
> > > 
> > > But the new warning from 3/5 here will still be there AFAICT and would you
> > > then agree it is a false positive?
> > 
> > No, I do not agree.
> > 
> > Even if a driver does what you describe it is not the correct thing to do and
> > having a warning call it out makes sense.
> > 
> > This way of cleaning things up entirely relies on knowing specific scheduler
> > internals, which if changed, may fall apart.
> > 
> > > Secondly, the series should modify all drivers (including the unit tests)
> > > which are known to trigger this false positive.
> > 
> > Again, there are no false positives. It is the scheduler that needs to call
> > free_job() and other potential cleanups. You can't just stop the scheduler,
> > leave it in an intermediate state and try to clean it up by hand relying on
> > knowledge about internals.
> 
> Sorry I don't see the argument for the claim it is relying on the internals
> with the re-positioned drm_sched_fini call. In that case it is fully
> compliant with:
> 
> /**
>  * drm_sched_fini - Destroy a gpu scheduler
>  *
>  * @sched: scheduler instance
>  *
>  * Tears down and cleans up the scheduler.
>  *
>  * This stops submission of new jobs to the hardware through
>  * drm_sched_backend_ops.run_job(). Consequently,
> drm_sched_backend_ops.free_job()
>  * will not be called for all jobs still in drm_gpu_scheduler.pending_list.
>  * There is no solution for this currently. Thus, it is up to the driver to
> make
>  * sure that:
>  *
>  *  a) drm_sched_fini() is only called after for all submitted jobs
>  *     drm_sched_backend_ops.free_job() has been called or that
>  *  b) the jobs for which drm_sched_backend_ops.free_job() has not been
> called
>  *
>  * FIXME: Take care of the above problem and prevent this function from
> leaking
>  * the jobs in drm_gpu_scheduler.pending_list under any circumstances.
> 
> ^^^ recommended solution b).

This has been introduced recently with commit baf4afc58314 ("drm/sched: Improve
teardown documentation") and I do not agree with this. The scheduler should
*not* make any promises about implementation details to enable drivers to abuse
their knowledge about component internals.

This makes the problem *worse* as it encourages drivers to rely on
implementation details, making maintainability of the scheduler even worse.

For instance, what if I change the scheduler implementation, such that for every
entry in the pending_list the scheduler allocates another internal object for
${something}? Then drivers would already fall apart leaking those internal
objects.

Now, obviously that's pretty unlikely, but I assume you get the idea.

The b) paragraph in drm_sched_fini() should be removed for the given reasons.

AFAICS, since the introduction of this commit, driver implementations haven't
changed in this regard, hence we should be good.

So, for me this doesn't change the fact that every driver implementation that
just stops the scheduler at an arbitrary point of time and tries to clean things
up manually relying on knowledge about component internals is broken.

However, this doesn't mean we can't do a brief audit.

- Danilo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
  2025-04-22 14:08                           ` Danilo Krummrich
@ 2025-04-22 14:16                             ` Philipp Stanner
  2025-04-22 14:52                               ` Danilo Krummrich
  0 siblings, 1 reply; 31+ messages in thread
From: Philipp Stanner @ 2025-04-22 14:16 UTC (permalink / raw)
  To: Danilo Krummrich, Tvrtko Ursulin
  Cc: phasta, Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, dri-devel, nouveau, linux-kernel

On Tue, 2025-04-22 at 16:08 +0200, Danilo Krummrich wrote:
> On Tue, Apr 22, 2025 at 02:39:21PM +0100, Tvrtko Ursulin wrote:
> > 
> > On 22/04/2025 13:32, Danilo Krummrich wrote:
> > > On Tue, Apr 22, 2025 at 01:07:47PM +0100, Tvrtko Ursulin wrote:
> > > > 
> > > > On 22/04/2025 12:13, Danilo Krummrich wrote:
> > > > > On Tue, Apr 22, 2025 at 11:39:11AM +0100, Tvrtko Ursulin
> > > > > wrote:
> > > > > > Question I raised is if there are other drivers which
> > > > > > manage to clean up
> > > > > > everything correctly (like the mock scheduler does), but
> > > > > > trigger that
> > > > > > warning. Maybe there are not and maybe mock scheduler is
> > > > > > the only false
> > > > > > positive.
> > > > > 
> > > > > So far the scheduler simply does not give any guideline on
> > > > > how to address the
> > > > > problem, hence every driver simply does something (or
> > > > > nothing, effectively
> > > > > ignoring the problem). This is what we want to fix.
> > > > > 
> > > > > The mock scheduler keeps it's own list of pending jobs and on
> > > > > tear down stops
> > > > > the scheduler's workqueues, traverses it's own list and
> > > > > eventually frees the
> > > > > pending jobs without updating the scheduler's internal
> > > > > pending list.
> > > > > 
> > > > > So yes, it does avoid memory leaks, but it also leaves the
> > > > > schedulers internal
> > > > > structures with an invalid state, i.e. the pending list of
> > > > > the scheduler has
> > > > > pointers to already freed memory.
> > > > > 
> > > > > What if the drm_sched_fini() starts touching the pending
> > > > > list? Then you'd end up
> > > > > with UAF bugs with this implementation. We cannot invalidate
> > > > > the schedulers
> > > > > internal structures and yet call scheduler functions - e.g.
> > > > > drm_sched_fini() -
> > > > > subsequently.
> > > > > 
> > > > > Hence, the current implementation of the mock scheduler is
> > > > > fundamentally flawed.
> > > > > And so would be *every* driver that still has entries within
> > > > > the scheduler's
> > > > > pending list.
> > > > > 
> > > > > This is not a false positive, it already caught a real bug --
> > > > > in the mock
> > > > > scheduler.
> > > > 
> > > > To avoid furher splitting hairs on whether real bugs need to be
> > > > able to
> > > > manifest or not, lets move past this with a conclusion that
> > > > there are two
> > > > potential things to do here:
> > > 
> > > This is not about splitting hairs, it is about understanding that
> > > abusing
> > > knowledge about internals of a component to clean things up is
> > > *never* valid.
> > > 
> > > > First one is to either send separately or include in this
> > > > series something
> > > > like:
> > > > 
> > > > diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > > b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > > index f999c8859cf7..7c4df0e890ac 100644
> > > > --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > > +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > > > @@ -300,6 +300,8 @@ void drm_mock_sched_fini(struct
> > > > drm_mock_scheduler
> > > > *sched)
> > > >                  drm_mock_sched_job_complete(job);
> > > >          spin_unlock_irqrestore(&sched->lock, flags);
> > > > 
> > > > +       drm_sched_fini(&sched->base);
> > > > +
> > > >          /*
> > > >           * Free completed jobs and jobs not yet processed by
> > > > the DRM
> > > > scheduler
> > > >           * free worker.
> > > > @@ -311,8 +313,6 @@ void drm_mock_sched_fini(struct
> > > > drm_mock_scheduler
> > > > *sched)
> > > > 
> > > >          list_for_each_entry_safe(job, next, &list, link)
> > > >                  mock_sched_free_job(&job->base);
> > > > -
> > > > -       drm_sched_fini(&sched->base);
> > > >   }
> > > > 
> > > >   /**
> > > > 
> > > > That should satisfy the requirement to "clear" memory about to
> > > > be freed and
> > > > be 100% compliant with drm_sched_fini() kerneldoc (guideline
> > > > b).
> > > > 
> > > > But the new warning from 3/5 here will still be there AFAICT
> > > > and would you
> > > > then agree it is a false positive?
> > > 
> > > No, I do not agree.
> > > 
> > > Even if a driver does what you describe it is not the correct
> > > thing to do and
> > > having a warning call it out makes sense.
> > > 
> > > This way of cleaning things up entirely relies on knowing
> > > specific scheduler
> > > internals, which if changed, may fall apart.
> > > 
> > > > Secondly, the series should modify all drivers (including the
> > > > unit tests)
> > > > which are known to trigger this false positive.
> > > 
> > > Again, there are no false positives. It is the scheduler that
> > > needs to call
> > > free_job() and other potential cleanups. You can't just stop the
> > > scheduler,
> > > leave it in an intermediate state and try to clean it up by hand
> > > relying on
> > > knowledge about internals.
> > 
> > Sorry I don't see the argument for the claim it is relying on the
> > internals
> > with the re-positioned drm_sched_fini call. In that case it is
> > fully
> > compliant with:
> > 
> > /**
> >  * drm_sched_fini - Destroy a gpu scheduler
> >  *
> >  * @sched: scheduler instance
> >  *
> >  * Tears down and cleans up the scheduler.
> >  *
> >  * This stops submission of new jobs to the hardware through
> >  * drm_sched_backend_ops.run_job(). Consequently,
> > drm_sched_backend_ops.free_job()
> >  * will not be called for all jobs still in
> > drm_gpu_scheduler.pending_list.
> >  * There is no solution for this currently. Thus, it is up to the
> > driver to
> > make
> >  * sure that:
> >  *
> >  *  a) drm_sched_fini() is only called after for all submitted jobs
> >  *     drm_sched_backend_ops.free_job() has been called or that
> >  *  b) the jobs for which drm_sched_backend_ops.free_job() has not
> > been
> > called
> >  *
> >  * FIXME: Take care of the above problem and prevent this function
> > from
> > leaking
> >  * the jobs in drm_gpu_scheduler.pending_list under any
> > circumstances.
> > 
> > ^^^ recommended solution b).
> 
> This has been introduced recently with commit baf4afc58314
> ("drm/sched: Improve
> teardown documentation") and I do not agree with this. The scheduler
> should
> *not* make any promises about implementation details to enable
> drivers to abuse
> their knowledge about component internals.
> 
> This makes the problem *worse* as it encourages drivers to rely on
> implementation details, making maintainability of the scheduler even
> worse.
> 
> For instance, what if I change the scheduler implementation, such
> that for every
> entry in the pending_list the scheduler allocates another internal
> object for
> ${something}? Then drivers would already fall apart leaking those
> internal
> objects.
> 
> Now, obviously that's pretty unlikely, but I assume you get the idea.
> 
> The b) paragraph in drm_sched_fini() should be removed for the given
> reasons.
> 
> AFAICS, since the introduction of this commit, driver implementations
> haven't
> changed in this regard, hence we should be good.
> 
> So, for me this doesn't change the fact that every driver
> implementation that
> just stops the scheduler at an arbitrary point of time and tries to
> clean things
> up manually relying on knowledge about component internals is broken.

To elaborate on that, this documentation has been written so that we at
least have *some* documentation about the problem, instead of just
letting new drivers run into the knife.

The commit explicitly introduced the FIXME, marking those two hacky
workarounds as undesirable.

But back then we couldn't fix the problem quickly, so it was either
document the issue at least a bit, or leave it completely undocumented.

P.

> 
> However, this doesn't mean we can't do a brief audit.
> 
> - Danilo


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
  2025-04-22 14:16                             ` Philipp Stanner
@ 2025-04-22 14:52                               ` Danilo Krummrich
  2025-04-23  7:34                                 ` Tvrtko Ursulin
  0 siblings, 1 reply; 31+ messages in thread
From: Danilo Krummrich @ 2025-04-22 14:52 UTC (permalink / raw)
  To: phasta
  Cc: Tvrtko Ursulin, Lyude Paul, David Airlie, Simona Vetter,
	Matthew Brost, Christian König, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, dri-devel, nouveau,
	linux-kernel

On Tue, Apr 22, 2025 at 04:16:48PM +0200, Philipp Stanner wrote:
> On Tue, 2025-04-22 at 16:08 +0200, Danilo Krummrich wrote:
> > On Tue, Apr 22, 2025 at 02:39:21PM +0100, Tvrtko Ursulin wrote:
> 
> > > Sorry I don't see the argument for the claim it is relying on the
> > > internals
> > > with the re-positioned drm_sched_fini call. In that case it is
> > > fully
> > > compliant with:
> > > 
> > > /**
> > >  * drm_sched_fini - Destroy a gpu scheduler
> > >  *
> > >  * @sched: scheduler instance
> > >  *
> > >  * Tears down and cleans up the scheduler.
> > >  *
> > >  * This stops submission of new jobs to the hardware through
> > >  * drm_sched_backend_ops.run_job(). Consequently,
> > > drm_sched_backend_ops.free_job()
> > >  * will not be called for all jobs still in
> > > drm_gpu_scheduler.pending_list.
> > >  * There is no solution for this currently. Thus, it is up to the
> > > driver to
> > > make
> > >  * sure that:
> > >  *
> > >  *  a) drm_sched_fini() is only called after for all submitted jobs
> > >  *     drm_sched_backend_ops.free_job() has been called or that
> > >  *  b) the jobs for which drm_sched_backend_ops.free_job() has not
> > > been
> > > called
> > >  *
> > >  * FIXME: Take care of the above problem and prevent this function
> > > from
> > > leaking
> > >  * the jobs in drm_gpu_scheduler.pending_list under any
> > > circumstances.
> > > 
> > > ^^^ recommended solution b).
> > 
> > This has been introduced recently with commit baf4afc58314
> > ("drm/sched: Improve
> > teardown documentation") and I do not agree with this. The scheduler
> > should
> > *not* make any promises about implementation details to enable
> > drivers to abuse
> > their knowledge about component internals.
> > 
> > This makes the problem *worse* as it encourages drivers to rely on
> > implementation details, making maintainability of the scheduler even
> > worse.
> > 
> > For instance, what if I change the scheduler implementation, such
> > that for every
> > entry in the pending_list the scheduler allocates another internal
> > object for
> > ${something}? Then drivers would already fall apart leaking those
> > internal
> > objects.
> > 
> > Now, obviously that's pretty unlikely, but I assume you get the idea.
> > 
> > The b) paragraph in drm_sched_fini() should be removed for the given
> > reasons.
> > 
> > AFAICS, since the introduction of this commit, driver implementations
> > haven't
> > changed in this regard, hence we should be good.
> > 
> > So, for me this doesn't change the fact that every driver
> > implementation that
> > just stops the scheduler at an arbitrary point of time and tries to
> > clean things
> > up manually relying on knowledge about component internals is broken.
> 
> To elaborate on that, this documentation has been written so that we at
> least have *some* documentation about the problem, instead of just
> letting new drivers run into the knife.
> 
> The commit explicitly introduced the FIXME, marking those two hacky
> workarounds as undesirable.
> 
> But back then we couldn't fix the problem quickly, so it was either
> document the issue at least a bit, or leave it completely undocumented.

Agreed, but b) really sounds like an invitation (or even justification) for
doing the wrong thing, let's removed it.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
  2025-04-22 14:52                               ` Danilo Krummrich
@ 2025-04-23  7:34                                 ` Tvrtko Ursulin
  2025-04-23  8:48                                   ` Danilo Krummrich
  0 siblings, 1 reply; 31+ messages in thread
From: Tvrtko Ursulin @ 2025-04-23  7:34 UTC (permalink / raw)
  To: Danilo Krummrich, phasta
  Cc: Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, dri-devel, nouveau, linux-kernel


On 22/04/2025 15:52, Danilo Krummrich wrote:
> On Tue, Apr 22, 2025 at 04:16:48PM +0200, Philipp Stanner wrote:
>> On Tue, 2025-04-22 at 16:08 +0200, Danilo Krummrich wrote:
>>> On Tue, Apr 22, 2025 at 02:39:21PM +0100, Tvrtko Ursulin wrote:
>>
>>>> Sorry I don't see the argument for the claim it is relying on the
>>>> internals
>>>> with the re-positioned drm_sched_fini call. In that case it is
>>>> fully
>>>> compliant with:
>>>>
>>>> /**
>>>>   * drm_sched_fini - Destroy a gpu scheduler
>>>>   *
>>>>   * @sched: scheduler instance
>>>>   *
>>>>   * Tears down and cleans up the scheduler.
>>>>   *
>>>>   * This stops submission of new jobs to the hardware through
>>>>   * drm_sched_backend_ops.run_job(). Consequently,
>>>> drm_sched_backend_ops.free_job()
>>>>   * will not be called for all jobs still in
>>>> drm_gpu_scheduler.pending_list.
>>>>   * There is no solution for this currently. Thus, it is up to the
>>>> driver to
>>>> make
>>>>   * sure that:
>>>>   *
>>>>   *  a) drm_sched_fini() is only called after for all submitted jobs
>>>>   *     drm_sched_backend_ops.free_job() has been called or that
>>>>   *  b) the jobs for which drm_sched_backend_ops.free_job() has not
>>>> been
>>>> called
>>>>   *
>>>>   * FIXME: Take care of the above problem and prevent this function
>>>> from
>>>> leaking
>>>>   * the jobs in drm_gpu_scheduler.pending_list under any
>>>> circumstances.
>>>>
>>>> ^^^ recommended solution b).
>>>
>>> This has been introduced recently with commit baf4afc58314
>>> ("drm/sched: Improve
>>> teardown documentation") and I do not agree with this. The scheduler
>>> should
>>> *not* make any promises about implementation details to enable
>>> drivers to abuse
>>> their knowledge about component internals.
>>>
>>> This makes the problem *worse* as it encourages drivers to rely on
>>> implementation details, making maintainability of the scheduler even
>>> worse.
>>>
>>> For instance, what if I change the scheduler implementation, such
>>> that for every
>>> entry in the pending_list the scheduler allocates another internal
>>> object for
>>> ${something}? Then drivers would already fall apart leaking those
>>> internal
>>> objects.
>>>
>>> Now, obviously that's pretty unlikely, but I assume you get the idea.
>>>
>>> The b) paragraph in drm_sched_fini() should be removed for the given
>>> reasons.
>>>
>>> AFAICS, since the introduction of this commit, driver implementations
>>> haven't
>>> changed in this regard, hence we should be good.
>>>
>>> So, for me this doesn't change the fact that every driver
>>> implementation that
>>> just stops the scheduler at an arbitrary point of time and tries to
>>> clean things
>>> up manually relying on knowledge about component internals is broken.
>>
>> To elaborate on that, this documentation has been written so that we at
>> least have *some* documentation about the problem, instead of just
>> letting new drivers run into the knife.
>>
>> The commit explicitly introduced the FIXME, marking those two hacky
>> workarounds as undesirable.
>>
>> But back then we couldn't fix the problem quickly, so it was either
>> document the issue at least a bit, or leave it completely undocumented.
> 
> Agreed, but b) really sounds like an invitation (or even justification) for
> doing the wrong thing, let's removed it.

IMO it is better to leave it. Regardless of whether it was added because 
some driver is actually operating like that, it does describe a 
_currently_ workable option to avoid memory leaks. Once a better method 
is there, ie. FIXME is addressed, then it can be removed or replaced.

Regards,

Tvrtko


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
  2025-04-23  7:34                                 ` Tvrtko Ursulin
@ 2025-04-23  8:48                                   ` Danilo Krummrich
  2025-04-23 10:10                                     ` Tvrtko Ursulin
  0 siblings, 1 reply; 31+ messages in thread
From: Danilo Krummrich @ 2025-04-23  8:48 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: phasta, Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, dri-devel, nouveau, linux-kernel

On Wed, Apr 23, 2025 at 08:34:08AM +0100, Tvrtko Ursulin wrote:
> 
> IMO it is better to leave it. Regardless of whether it was added because
> some driver is actually operating like that, it does describe a _currently_
> workable option to avoid memory leaks. Once a better method is there, ie.
> FIXME is addressed, then it can be removed or replaced.

I'm not willing to sign off on encouraging drivers to rely on scheduler
internals -- also not in this case, sorry.

Our primary goal with the scheduler is to *remove* such broken contracts where
drivers rely on scheduler internal implementation details, mess with scheduler
internal data structures etc. This is clearly a step back.

And AFAICT, as by now drivers either do a) or simply nothing (with the exception
of the mock scheduler). Drivers can do a) in the meantime, there's no reason at
all to additionally offer b).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
  2025-04-23  8:48                                   ` Danilo Krummrich
@ 2025-04-23 10:10                                     ` Tvrtko Ursulin
  2025-04-23 10:26                                       ` Danilo Krummrich
  0 siblings, 1 reply; 31+ messages in thread
From: Tvrtko Ursulin @ 2025-04-23 10:10 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: phasta, Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, dri-devel, nouveau, linux-kernel


On 23/04/2025 09:48, Danilo Krummrich wrote:
> On Wed, Apr 23, 2025 at 08:34:08AM +0100, Tvrtko Ursulin wrote:
>>
>> IMO it is better to leave it. Regardless of whether it was added because
>> some driver is actually operating like that, it does describe a _currently_
>> workable option to avoid memory leaks. Once a better method is there, ie.
>> FIXME is addressed, then it can be removed or replaced.
> 
> I'm not willing to sign off on encouraging drivers to rely on scheduler
> internals -- also not in this case, sorry.
> 
> Our primary goal with the scheduler is to *remove* such broken contracts where
> drivers rely on scheduler internal implementation details, mess with scheduler
> internal data structures etc. This is clearly a step back.
> 
> And AFAICT, as by now drivers either do a) or simply nothing (with the exception
> of the mock scheduler). Drivers can do a) in the meantime, there's no reason at
> all to additionally offer b).

What mechanism do we currently have to enable using a), and which you 
would not consider needing knowledge of scheduler internals?

Regards,

Tvrtko


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
  2025-04-23 10:10                                     ` Tvrtko Ursulin
@ 2025-04-23 10:26                                       ` Danilo Krummrich
  0 siblings, 0 replies; 31+ messages in thread
From: Danilo Krummrich @ 2025-04-23 10:26 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: phasta, Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
	Christian König, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, dri-devel, nouveau, linux-kernel

On Wed, Apr 23, 2025 at 11:10:51AM +0100, Tvrtko Ursulin wrote:
> 
> On 23/04/2025 09:48, Danilo Krummrich wrote:
> > On Wed, Apr 23, 2025 at 08:34:08AM +0100, Tvrtko Ursulin wrote:
> > > 
> > > IMO it is better to leave it. Regardless of whether it was added because
> > > some driver is actually operating like that, it does describe a _currently_
> > > workable option to avoid memory leaks. Once a better method is there, ie.
> > > FIXME is addressed, then it can be removed or replaced.
> > 
> > I'm not willing to sign off on encouraging drivers to rely on scheduler
> > internals -- also not in this case, sorry.
> > 
> > Our primary goal with the scheduler is to *remove* such broken contracts where
> > drivers rely on scheduler internal implementation details, mess with scheduler
> > internal data structures etc. This is clearly a step back.
> > 
> > And AFAICT, as by now drivers either do a) or simply nothing (with the exception
> > of the mock scheduler). Drivers can do a) in the meantime, there's no reason at
> > all to additionally offer b).
> 
> What mechanism do we currently have to enable using a), and which you would
> not consider needing knowledge of scheduler internals?

The rule is that drivers must not call drm_sched_fini() before all jobs are
processed.

For this, drivers may reference count their scheduler "subclass" for each job
in flight and only call drm_sched_fini() once the reference count becomes zero,
or keep their own list of jobs in flight and just wait for the list to become
empty.

This is possible with the regular API and none of this requires relying on
scheduler internals or messing with scheduler internal data structures.

The problem is limited to the aspect that the GPU scheduler does not provide an
API for drivers to solve this problem.

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2025-04-23 10:26 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-07 15:22 [PATCH 0/5] drm/sched: Fix memory leaks in drm_sched_fini() Philipp Stanner
2025-04-07 15:22 ` [PATCH 1/5] drm/sched: Fix teardown leaks with waitqueue Philipp Stanner
2025-04-17  7:49   ` Philipp Stanner
2025-04-07 15:22 ` [PATCH 2/5] drm/sched: Prevent teardown waitque from blocking too long Philipp Stanner
2025-04-07 15:22 ` [PATCH 3/5] drm/sched: Warn if pending list is not empty Philipp Stanner
2025-04-17  7:45   ` Philipp Stanner
2025-04-17 11:27     ` Tvrtko Ursulin
2025-04-17 12:11       ` Danilo Krummrich
2025-04-17 14:20         ` Tvrtko Ursulin
2025-04-17 14:48           ` Danilo Krummrich
2025-04-17 16:08             ` Tvrtko Ursulin
2025-04-17 17:07               ` Danilo Krummrich
2025-04-22  6:06               ` Philipp Stanner
2025-04-22 10:39                 ` Tvrtko Ursulin
2025-04-22 11:13                   ` Danilo Krummrich
2025-04-22 12:00                     ` Philipp Stanner
2025-04-22 13:25                       ` Tvrtko Ursulin
2025-04-22 12:07                     ` Tvrtko Ursulin
2025-04-22 12:21                       ` Philipp Stanner
2025-04-22 12:32                       ` Danilo Krummrich
2025-04-22 13:39                         ` Tvrtko Ursulin
2025-04-22 13:46                           ` Philipp Stanner
2025-04-22 14:08                           ` Danilo Krummrich
2025-04-22 14:16                             ` Philipp Stanner
2025-04-22 14:52                               ` Danilo Krummrich
2025-04-23  7:34                                 ` Tvrtko Ursulin
2025-04-23  8:48                                   ` Danilo Krummrich
2025-04-23 10:10                                     ` Tvrtko Ursulin
2025-04-23 10:26                                       ` Danilo Krummrich
2025-04-07 15:22 ` [PATCH 4/5] drm/nouveau: Add new callback for scheduler teardown Philipp Stanner
2025-04-07 15:22 ` [PATCH 5/5] drm/nouveau: Remove waitque for sched teardown Philipp Stanner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).