[PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking too long

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

From: Philipp Stanner <phasta@kernel.org>
To: "Lyude Paul" <lyude@redhat.com>,
	"Danilo Krummrich" <dakr@kernel.org>,
	"David Airlie" <airlied@gmail.com>,
	"Simona Vetter" <simona@ffwll.ch>,
	"Matthew Brost" <matthew.brost@intel.com>,
	"Philipp Stanner" <phasta@kernel.org>,
	"Christian König" <ckoenig.leichtzumerken@gmail.com>,
	"Maarten Lankhorst" <maarten.lankhorst@linux.intel.com>,
	"Maxime Ripard" <mripard@kernel.org>,
	"Thomas Zimmermann" <tzimmermann@suse.de>,
	"Tvrtko Ursulin" <tvrtko.ursulin@igalia.com>
Cc: dri-devel@lists.freedesktop.org, nouveau@lists.freedesktop.org,
	linux-kernel@vger.kernel.org, Danilo Krummrich <dakr@redhat.com>
Subject: [PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking too long
Date: Thu, 24 Apr 2025 11:55:32 +0200	[thread overview]
Message-ID: <20250424095535.26119-4-phasta@kernel.org> (raw)
In-Reply-To: <20250424095535.26119-2-phasta@kernel.org>

The waitqueue that ensures that drm_sched_fini() blocks until the
pending_list has become empty could theoretically cause that function to
block for a very long time. That, ultimately, could block userspace
procesess and prevent them from being killable through SIGKILL.

When a driver calls drm_sched_fini(), it is safe to assume that all
still pending jobs are not needed anymore anyways. Thus, they can be
cancelled and thereby it can be ensured that drm_sched_fini() will
return relatively quickly.

Implement a new helper to stop all work items / submission except for
the drm_sched_backend_ops.run_job().

Implement a driver callback, kill_fence_context(), that instructs the
driver to kill the fence context associated with this scheduler, thereby
causing all pending hardware fences to be signalled.

Call those new routines in drm_sched_fini() and ensure backwards
compatibility if the new callback is not implemented.

Suggested-by: Danilo Krummrich <dakr@redhat.com>
Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/scheduler/sched_main.c | 47 +++++++++++++++++---------
 include/drm/gpu_scheduler.h            | 11 ++++++
 2 files changed, 42 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index ea82e69a72a8..c2ad6c70bfb6 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1390,31 +1390,46 @@ drm_sched_no_jobs_pending(struct drm_gpu_scheduler *sched)
 	return empty;
 }
 
+/**
+ * drm_sched_cancel_jobs_and_wait - trigger freeing of all pending jobs
+ * @sched: scheduler instance
+ *
+ * Must only be called if &struct drm_sched_backend_ops.kill_fence_context is
+ * implemented.
+ *
+ * Instructs the driver to kill the fence context associated with this scheduler,
+ * thereby signalling all pending fences. This, in turn, will trigger
+ * &struct drm_sched_backend_ops.free_job to be called for all pending jobs.
+ * The function then blocks until all pending jobs have been freed.
+ */
+static inline void
+drm_sched_cancel_jobs_and_wait(struct drm_gpu_scheduler *sched)
+{
+	sched->ops->kill_fence_context(sched);
+	wait_event(sched->pending_list_waitque, drm_sched_no_jobs_pending(sched));
+}
+
 /**
  * drm_sched_fini - Destroy a gpu scheduler
  *
  * @sched: scheduler instance
  *
- * Tears down and cleans up the scheduler.
- *
- * Note that this function blocks until all the fences returned by
- * &struct drm_sched_backend_ops.run_job have been signalled.
+ * Tears down and cleans up the scheduler. Might leak memory if
+ * &struct drm_sched_backend_ops.kill_fence_context is not implemented.
  */
 void drm_sched_fini(struct drm_gpu_scheduler *sched)
 {
 	struct drm_sched_entity *s_entity;
 	int i;
 
-	/*
-	 * Jobs that have neither been scheduled or which have timed out are
-	 * gone by now, but jobs that have been submitted through
-	 * backend_ops.run_job() and have not yet terminated are still pending.
-	 *
-	 * Wait for the pending_list to become empty to avoid leaking those jobs.
-	 */
-	drm_sched_submission_and_timeout_stop(sched);
-	wait_event(sched->pending_list_waitque, drm_sched_no_jobs_pending(sched));
-	drm_sched_free_stop(sched);
+	if (sched->ops->kill_fence_context) {
+		drm_sched_submission_and_timeout_stop(sched);
+		drm_sched_cancel_jobs_and_wait(sched);
+		drm_sched_free_stop(sched);
+	} else {
+		/* We're in "legacy free-mode" and ignore potential mem leaks */
+		drm_sched_wqueue_stop(sched);
+	}
 
 	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
 		struct drm_sched_rq *rq = sched->sched_rq[i];
@@ -1502,7 +1517,7 @@ bool drm_sched_wqueue_ready(struct drm_gpu_scheduler *sched)
 EXPORT_SYMBOL(drm_sched_wqueue_ready);
 
 /**
- * drm_sched_wqueue_stop - stop scheduler submission
+ * drm_sched_wqueue_stop - stop scheduler submission and freeing
  * @sched: scheduler instance
  *
  * Stops the scheduler from pulling new jobs from entities. It also stops
@@ -1518,7 +1533,7 @@ void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched)
 EXPORT_SYMBOL(drm_sched_wqueue_stop);
 
 /**
- * drm_sched_wqueue_start - start scheduler submission
+ * drm_sched_wqueue_start - start scheduler submission and freeing
  * @sched: scheduler instance
  *
  * Restarts the scheduler after drm_sched_wqueue_stop() has stopped it.
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index d0b1f416b4d9..8630b4a26f10 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -509,6 +509,17 @@ struct drm_sched_backend_ops {
          * and it's time to clean it up.
 	 */
 	void (*free_job)(struct drm_sched_job *sched_job);
+
+	/**
+	 * @kill_fence_context: kill the fence context belonging to this scheduler
+	 *
+	 * Needed to cleanly tear the scheduler down in drm_sched_fini(). This
+	 * callback will cause all hardware fences to be signalled by the driver,
+	 * which, ultimately, ensures that all jobs get freed before teardown.
+	 *
+	 * This callback is optional, but it is highly recommended to implement it.
+	 */
+	void (*kill_fence_context)(struct drm_gpu_scheduler *sched);
 };
 
 /**
-- 
2.48.1

next prev parent reply	other threads:[~2025-04-24  9:56 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-24  9:55 [PATCH v2 0/6] drm/sched: Fix memory leaks in drm_sched_fini() Philipp Stanner
2025-04-24  9:55 ` [PATCH v2 1/6] drm/sched: Fix teardown leaks with waitqueue Philipp Stanner
2025-05-16  9:19   ` Tvrtko Ursulin
2025-04-24  9:55 ` Philipp Stanner [this message]
2025-05-16  9:33   ` [PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking too long Tvrtko Ursulin
2025-05-16  9:53     ` Danilo Krummrich
2025-05-16 10:19       ` Tvrtko Ursulin
2025-05-16 10:54         ` Danilo Krummrich
2025-05-16 11:35           ` Tvrtko Ursulin
2025-05-16 12:00             ` Danilo Krummrich
2025-05-16 15:48               ` Tvrtko Ursulin
2025-05-16  9:54     ` Philipp Stanner
2025-05-16 10:34       ` Tvrtko Ursulin
2025-04-24  9:55 ` [PATCH v2 3/6] drm/sched: Warn if pending list is not empty Philipp Stanner
2025-05-16  9:40   ` Tvrtko Ursulin
2025-04-24  9:55 ` [PATCH v2 4/6] drm/nouveau: Add new callback for scheduler teardown Philipp Stanner
2025-04-24  9:55 ` [PATCH v2 5/6] drm/nouveau: Remove waitque for sched teardown Philipp Stanner
2025-04-24  9:55 ` [PATCH v2 6/6] drm/sched: Port unit tests to new cleanup design Philipp Stanner
2025-05-08 11:03   ` Philipp Stanner
2025-05-08 12:51     ` Tvrtko Ursulin
2025-05-12  8:00       ` Philipp Stanner
2025-05-14  8:30         ` Tvrtko Ursulin
2025-05-14  9:19           ` Philipp Stanner
2025-05-16  9:00             ` Tvrtko Ursulin

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:ea82e69a72a dfblob:c2ad6c70bfb dfblob:d0b1f416b4d
dfblob:8630b4a26f1 )
 OR (
bs:"[PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking too long" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250424095535.26119-4-phasta@kernel.org \
    --to=phasta@kernel.org \
    --cc=airlied@gmail.com \
    --cc=ckoenig.leichtzumerken@gmail.com \
    --cc=dakr@kernel.org \
    --cc=dakr@redhat.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lyude@redhat.com \
    --cc=maarten.lankhorst@linux.intel.com \
    --cc=matthew.brost@intel.com \
    --cc=mripard@kernel.org \
    --cc=nouveau@lists.freedesktop.org \
    --cc=simona@ffwll.ch \
    --cc=tvrtko.ursulin@igalia.com \
    --cc=tzimmermann@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox