* [PATCH v2 0/7] drm/sched: Fix memory leaks with cancel_job() callback
@ 2025-07-07 13:42 Philipp Stanner
2025-07-07 13:42 ` [PATCH v2 1/7] drm/sched: Avoid " Philipp Stanner
` (6 more replies)
0 siblings, 7 replies; 15+ messages in thread
From: Philipp Stanner @ 2025-07-07 13:42 UTC (permalink / raw)
To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Matthew Brost, Philipp Stanner, Christian König,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
Tvrtko Ursulin, Pierre-Eric Pelloux-Prayer
Cc: dri-devel, nouveau, linux-kernel, linux-media
Changes in v2:
- Add new unit test to test cancel_job()'s behavior. (Tvrtko)
- Add RB from Maíra
Changes since the RFC:
- Rename helper function for drm_sched_fini(). (Tvrtko)
- Add Link to Tvrtko's RFC patch to patch 1.
Since a long time, drm_sched_fini() can leak jobs that are still in
drm_sched.pending_list.
This series solves the leaks in a backwards-compatible manner by adding
a new, optional callback. If that callback is implemented, the scheduler
uses it to cancel all jobs from pending_list and then frees them.
Philipp Stanner (7):
drm/sched: Avoid memory leaks with cancel_job() callback
drm/sched/tests: Implement cancel_job() callback
drm/sched/tests: Add unit test for cancel_job()
drm/sched: Warn if pending list is not empty
drm/nouveau: Make fence container helper usable driver-wide
drm/nouveau: Add new callback for scheduler teardown
drm/nouveau: Remove waitque for sched teardown
drivers/gpu/drm/nouveau/nouveau_fence.c | 35 ++++++----
drivers/gpu/drm/nouveau/nouveau_fence.h | 7 ++
drivers/gpu/drm/nouveau/nouveau_sched.c | 35 ++++++----
drivers/gpu/drm/nouveau/nouveau_sched.h | 9 +--
drivers/gpu/drm/nouveau/nouveau_uvmm.c | 8 +--
drivers/gpu/drm/scheduler/sched_main.c | 37 +++++++----
.../gpu/drm/scheduler/tests/mock_scheduler.c | 66 +++++++------------
drivers/gpu/drm/scheduler/tests/tests_basic.c | 43 ++++++++++++
include/drm/gpu_scheduler.h | 18 +++++
9 files changed, 165 insertions(+), 93 deletions(-)
--
2.49.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v2 1/7] drm/sched: Avoid memory leaks with cancel_job() callback
2025-07-07 13:42 [PATCH v2 0/7] drm/sched: Fix memory leaks with cancel_job() callback Philipp Stanner
@ 2025-07-07 13:42 ` Philipp Stanner
2025-07-07 13:42 ` [PATCH v2 2/7] drm/sched/tests: Implement " Philipp Stanner
` (5 subsequent siblings)
6 siblings, 0 replies; 15+ messages in thread
From: Philipp Stanner @ 2025-07-07 13:42 UTC (permalink / raw)
To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Matthew Brost, Philipp Stanner, Christian König,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
Tvrtko Ursulin, Pierre-Eric Pelloux-Prayer
Cc: dri-devel, nouveau, linux-kernel, linux-media, Maíra Canal
Since its inception, the GPU scheduler can leak memory if the driver
calls drm_sched_fini() while there are still jobs in flight.
The simplest way to solve this in a backwards compatible manner is by
adding a new callback, drm_sched_backend_ops.cancel_job(), which
instructs the driver to signal the hardware fence associated with the
job. Afterwards, the scheduler can safely use the established free_job()
callback for freeing the job.
Implement the new backend_ops callback cancel_job().
Suggested-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Link: https://lore.kernel.org/dri-devel/20250418113211.69956-1-tvrtko.ursulin@igalia.com/
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Reviewed-by: Maíra Canal <mcanal@igalia.com>
---
drivers/gpu/drm/scheduler/sched_main.c | 34 ++++++++++++++++----------
include/drm/gpu_scheduler.h | 18 ++++++++++++++
2 files changed, 39 insertions(+), 13 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index c63543132f9d..1239954f5f7c 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1353,6 +1353,18 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
}
EXPORT_SYMBOL(drm_sched_init);
+static void drm_sched_cancel_remaining_jobs(struct drm_gpu_scheduler *sched)
+{
+ struct drm_sched_job *job, *tmp;
+
+ /* All other accessors are stopped. No locking necessary. */
+ list_for_each_entry_safe_reverse(job, tmp, &sched->pending_list, list) {
+ sched->ops->cancel_job(job);
+ list_del(&job->list);
+ sched->ops->free_job(job);
+ }
+}
+
/**
* drm_sched_fini - Destroy a gpu scheduler
*
@@ -1360,19 +1372,11 @@ EXPORT_SYMBOL(drm_sched_init);
*
* Tears down and cleans up the scheduler.
*
- * This stops submission of new jobs to the hardware through
- * drm_sched_backend_ops.run_job(). Consequently, drm_sched_backend_ops.free_job()
- * will not be called for all jobs still in drm_gpu_scheduler.pending_list.
- * There is no solution for this currently. Thus, it is up to the driver to make
- * sure that:
- *
- * a) drm_sched_fini() is only called after for all submitted jobs
- * drm_sched_backend_ops.free_job() has been called or that
- * b) the jobs for which drm_sched_backend_ops.free_job() has not been called
- * after drm_sched_fini() ran are freed manually.
- *
- * FIXME: Take care of the above problem and prevent this function from leaking
- * the jobs in drm_gpu_scheduler.pending_list under any circumstances.
+ * This stops submission of new jobs to the hardware through &struct
+ * drm_sched_backend_ops.run_job. If &struct drm_sched_backend_ops.cancel_job
+ * is implemented, all jobs will be canceled through it and afterwards cleaned
+ * up through &struct drm_sched_backend_ops.free_job. If cancel_job is not
+ * implemented, memory could leak.
*/
void drm_sched_fini(struct drm_gpu_scheduler *sched)
{
@@ -1402,6 +1406,10 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
/* Confirm no work left behind accessing device structures */
cancel_delayed_work_sync(&sched->work_tdr);
+ /* Avoid memory leaks if supported by the driver. */
+ if (sched->ops->cancel_job)
+ drm_sched_cancel_remaining_jobs(sched);
+
if (sched->own_submit_wq)
destroy_workqueue(sched->submit_wq);
sched->ready = false;
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index e62a7214e052..190844370f48 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -512,6 +512,24 @@ struct drm_sched_backend_ops {
* and it's time to clean it up.
*/
void (*free_job)(struct drm_sched_job *sched_job);
+
+ /**
+ * @cancel_job: Used by the scheduler to guarantee remaining jobs' fences
+ * get signaled in drm_sched_fini().
+ *
+ * Used by the scheduler to cancel all jobs that have not been executed
+ * with &struct drm_sched_backend_ops.run_job by the time
+ * drm_sched_fini() gets invoked.
+ *
+ * Drivers need to signal the passed job's hardware fence with an
+ * appropriate error code (e.g., -ECANCELED) in this callback. They
+ * must not free the job.
+ *
+ * The scheduler will only call this callback once it stopped calling
+ * all other callbacks forever, with the exception of &struct
+ * drm_sched_backend_ops.free_job.
+ */
+ void (*cancel_job)(struct drm_sched_job *sched_job);
};
/**
--
2.49.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 2/7] drm/sched/tests: Implement cancel_job() callback
2025-07-07 13:42 [PATCH v2 0/7] drm/sched: Fix memory leaks with cancel_job() callback Philipp Stanner
2025-07-07 13:42 ` [PATCH v2 1/7] drm/sched: Avoid " Philipp Stanner
@ 2025-07-07 13:42 ` Philipp Stanner
2025-07-07 15:25 ` Tvrtko Ursulin
2025-07-07 13:42 ` [PATCH v2 3/7] drm/sched/tests: Add unit test for cancel_job() Philipp Stanner
` (4 subsequent siblings)
6 siblings, 1 reply; 15+ messages in thread
From: Philipp Stanner @ 2025-07-07 13:42 UTC (permalink / raw)
To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Matthew Brost, Philipp Stanner, Christian König,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
Tvrtko Ursulin, Pierre-Eric Pelloux-Prayer
Cc: dri-devel, nouveau, linux-kernel, linux-media
The GPU Scheduler now supports a new callback, cancel_job(), which lets
the scheduler cancel all jobs which might not yet be freed when
drm_sched_fini() runs. Using this callback allows for significantly
simplifying the mock scheduler teardown code.
Implement the cancel_job() callback and adjust the code where necessary.
Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
.../gpu/drm/scheduler/tests/mock_scheduler.c | 66 +++++++------------
1 file changed, 23 insertions(+), 43 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
index 49d067fecd67..2d3169d95200 100644
--- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
+++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
@@ -63,7 +63,7 @@ static void drm_mock_sched_job_complete(struct drm_mock_sched_job *job)
lockdep_assert_held(&sched->lock);
job->flags |= DRM_MOCK_SCHED_JOB_DONE;
- list_move_tail(&job->link, &sched->done_list);
+ list_del(&job->link);
dma_fence_signal_locked(&job->hw_fence);
complete(&job->done);
}
@@ -236,26 +236,39 @@ mock_sched_timedout_job(struct drm_sched_job *sched_job)
static void mock_sched_free_job(struct drm_sched_job *sched_job)
{
- struct drm_mock_scheduler *sched =
- drm_sched_to_mock_sched(sched_job->sched);
struct drm_mock_sched_job *job = drm_sched_job_to_mock_job(sched_job);
- unsigned long flags;
- /* Remove from the scheduler done list. */
- spin_lock_irqsave(&sched->lock, flags);
- list_del(&job->link);
- spin_unlock_irqrestore(&sched->lock, flags);
dma_fence_put(&job->hw_fence);
-
drm_sched_job_cleanup(sched_job);
/* Mock job itself is freed by the kunit framework. */
}
+static void mock_sched_cancel_job(struct drm_sched_job *sched_job)
+{
+ struct drm_mock_scheduler *sched = drm_sched_to_mock_sched(sched_job->sched);
+ struct drm_mock_sched_job *job = drm_sched_job_to_mock_job(sched_job);
+ unsigned long flags;
+
+ hrtimer_cancel(&job->timer);
+
+ spin_lock_irqsave(&sched->lock, flags);
+ if (!dma_fence_is_signaled_locked(&job->hw_fence)) {
+ list_del(&job->link);
+ dma_fence_set_error(&job->hw_fence, -ECANCELED);
+ dma_fence_signal_locked(&job->hw_fence);
+ }
+ spin_unlock_irqrestore(&sched->lock, flags);
+
+ /* The GPU Scheduler will call drm_sched_backend_ops.free_job(), still.
+ * Mock job itself is freed by the kunit framework. */
+}
+
static const struct drm_sched_backend_ops drm_mock_scheduler_ops = {
.run_job = mock_sched_run_job,
.timedout_job = mock_sched_timedout_job,
- .free_job = mock_sched_free_job
+ .free_job = mock_sched_free_job,
+ .cancel_job = mock_sched_cancel_job,
};
/**
@@ -289,7 +302,6 @@ struct drm_mock_scheduler *drm_mock_sched_new(struct kunit *test, long timeout)
sched->hw_timeline.context = dma_fence_context_alloc(1);
atomic_set(&sched->hw_timeline.next_seqno, 0);
INIT_LIST_HEAD(&sched->job_list);
- INIT_LIST_HEAD(&sched->done_list);
spin_lock_init(&sched->lock);
return sched;
@@ -304,38 +316,6 @@ struct drm_mock_scheduler *drm_mock_sched_new(struct kunit *test, long timeout)
*/
void drm_mock_sched_fini(struct drm_mock_scheduler *sched)
{
- struct drm_mock_sched_job *job, *next;
- unsigned long flags;
- LIST_HEAD(list);
-
- drm_sched_wqueue_stop(&sched->base);
-
- /* Force complete all unfinished jobs. */
- spin_lock_irqsave(&sched->lock, flags);
- list_for_each_entry_safe(job, next, &sched->job_list, link)
- list_move_tail(&job->link, &list);
- spin_unlock_irqrestore(&sched->lock, flags);
-
- list_for_each_entry(job, &list, link)
- hrtimer_cancel(&job->timer);
-
- spin_lock_irqsave(&sched->lock, flags);
- list_for_each_entry_safe(job, next, &list, link)
- drm_mock_sched_job_complete(job);
- spin_unlock_irqrestore(&sched->lock, flags);
-
- /*
- * Free completed jobs and jobs not yet processed by the DRM scheduler
- * free worker.
- */
- spin_lock_irqsave(&sched->lock, flags);
- list_for_each_entry_safe(job, next, &sched->done_list, link)
- list_move_tail(&job->link, &list);
- spin_unlock_irqrestore(&sched->lock, flags);
-
- list_for_each_entry_safe(job, next, &list, link)
- mock_sched_free_job(&job->base);
-
drm_sched_fini(&sched->base);
}
--
2.49.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 3/7] drm/sched/tests: Add unit test for cancel_job()
2025-07-07 13:42 [PATCH v2 0/7] drm/sched: Fix memory leaks with cancel_job() callback Philipp Stanner
2025-07-07 13:42 ` [PATCH v2 1/7] drm/sched: Avoid " Philipp Stanner
2025-07-07 13:42 ` [PATCH v2 2/7] drm/sched/tests: Implement " Philipp Stanner
@ 2025-07-07 13:42 ` Philipp Stanner
2025-07-07 15:20 ` Tvrtko Ursulin
2025-07-07 13:42 ` [PATCH v2 4/7] drm/sched: Warn if pending list is not empty Philipp Stanner
` (3 subsequent siblings)
6 siblings, 1 reply; 15+ messages in thread
From: Philipp Stanner @ 2025-07-07 13:42 UTC (permalink / raw)
To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Matthew Brost, Philipp Stanner, Christian König,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
Tvrtko Ursulin, Pierre-Eric Pelloux-Prayer
Cc: dri-devel, nouveau, linux-kernel, linux-media
The scheduler unit tests now provide a new callback, cancel_job(). This
callback gets used by drm_sched_fini() for all still pending jobs to
cancel them.
Implement a new unit test to test this.
Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
drivers/gpu/drm/scheduler/tests/tests_basic.c | 43 +++++++++++++++++++
1 file changed, 43 insertions(+)
diff --git a/drivers/gpu/drm/scheduler/tests/tests_basic.c b/drivers/gpu/drm/scheduler/tests/tests_basic.c
index 7230057e0594..fa3da2db4893 100644
--- a/drivers/gpu/drm/scheduler/tests/tests_basic.c
+++ b/drivers/gpu/drm/scheduler/tests/tests_basic.c
@@ -204,6 +204,48 @@ static struct kunit_suite drm_sched_basic = {
.test_cases = drm_sched_basic_tests,
};
+static void drm_sched_basic_cancel(struct kunit *test)
+{
+ struct drm_mock_sched_entity *entity;
+ struct drm_mock_scheduler *sched;
+ struct drm_mock_sched_job *job;
+ bool done;
+
+ /*
+ * Check that the configured credit limit is respected.
+ */
+
+ sched = drm_mock_sched_new(test, MAX_SCHEDULE_TIMEOUT);
+ sched->base.credit_limit = 1;
+
+ entity = drm_mock_sched_entity_new(test, DRM_SCHED_PRIORITY_NORMAL,
+ sched);
+
+ job = drm_mock_sched_job_new(test, entity);
+
+ drm_mock_sched_job_submit(job);
+
+ done = drm_mock_sched_job_wait_scheduled(job, HZ);
+ KUNIT_ASSERT_TRUE(test, done);
+
+ drm_mock_sched_entity_free(entity);
+ drm_mock_sched_fini(sched);
+
+ KUNIT_ASSERT_EQ(test, job->hw_fence.error, -ECANCELED);
+}
+
+static struct kunit_case drm_sched_cancel_tests[] = {
+ KUNIT_CASE(drm_sched_basic_cancel),
+ {}
+};
+
+static struct kunit_suite drm_sched_cancel = {
+ .name = "drm_sched_basic_cancel_tests",
+ .init = drm_sched_basic_init,
+ .exit = drm_sched_basic_exit,
+ .test_cases = drm_sched_cancel_tests,
+};
+
static void drm_sched_basic_timeout(struct kunit *test)
{
struct drm_mock_scheduler *sched = test->priv;
@@ -471,6 +513,7 @@ static struct kunit_suite drm_sched_credits = {
kunit_test_suites(&drm_sched_basic,
&drm_sched_timeout,
+ &drm_sched_cancel,
&drm_sched_priority,
&drm_sched_modify_sched,
&drm_sched_credits);
--
2.49.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 4/7] drm/sched: Warn if pending list is not empty
2025-07-07 13:42 [PATCH v2 0/7] drm/sched: Fix memory leaks with cancel_job() callback Philipp Stanner
` (2 preceding siblings ...)
2025-07-07 13:42 ` [PATCH v2 3/7] drm/sched/tests: Add unit test for cancel_job() Philipp Stanner
@ 2025-07-07 13:42 ` Philipp Stanner
2025-07-07 13:42 ` [PATCH v2 5/7] drm/nouveau: Make fence container helper usable driver-wide Philipp Stanner
` (2 subsequent siblings)
6 siblings, 0 replies; 15+ messages in thread
From: Philipp Stanner @ 2025-07-07 13:42 UTC (permalink / raw)
To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Matthew Brost, Philipp Stanner, Christian König,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
Tvrtko Ursulin, Pierre-Eric Pelloux-Prayer
Cc: dri-devel, nouveau, linux-kernel, linux-media
drm_sched_fini() can leak jobs under certain circumstances.
Warn if that happens.
Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
drivers/gpu/drm/scheduler/sched_main.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 1239954f5f7c..dadf1a22ddf6 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1415,6 +1415,9 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
sched->ready = false;
kfree(sched->sched_rq);
sched->sched_rq = NULL;
+
+ if (!list_empty(&sched->pending_list))
+ dev_err(sched->dev, "Tearing down scheduler while jobs are pending!\n");
}
EXPORT_SYMBOL(drm_sched_fini);
--
2.49.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 5/7] drm/nouveau: Make fence container helper usable driver-wide
2025-07-07 13:42 [PATCH v2 0/7] drm/sched: Fix memory leaks with cancel_job() callback Philipp Stanner
` (3 preceding siblings ...)
2025-07-07 13:42 ` [PATCH v2 4/7] drm/sched: Warn if pending list is not empty Philipp Stanner
@ 2025-07-07 13:42 ` Philipp Stanner
2025-07-07 13:45 ` Danilo Krummrich
2025-07-07 13:42 ` [PATCH v2 6/7] drm/nouveau: Add new callback for scheduler teardown Philipp Stanner
2025-07-07 13:42 ` [PATCH v2 7/7] drm/nouveau: Remove waitque for sched teardown Philipp Stanner
6 siblings, 1 reply; 15+ messages in thread
From: Philipp Stanner @ 2025-07-07 13:42 UTC (permalink / raw)
To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Matthew Brost, Philipp Stanner, Christian König,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
Tvrtko Ursulin, Pierre-Eric Pelloux-Prayer
Cc: dri-devel, nouveau, linux-kernel, linux-media
In order to implement a new DRM GPU scheduler callback in Nouveau, a
helper for obtaining a nouveau_fence from a dma_fence is necessary. Such
a helper exists already inside nouveau_fence.c, called from_fence().
Make that helper available to other C files with a more precise name.
Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
drivers/gpu/drm/nouveau/nouveau_fence.c | 20 +++++++-------------
drivers/gpu/drm/nouveau/nouveau_fence.h | 6 ++++++
2 files changed, 13 insertions(+), 13 deletions(-)
diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/drivers/gpu/drm/nouveau/nouveau_fence.c
index d5654e26d5bc..869d4335c0f4 100644
--- a/drivers/gpu/drm/nouveau/nouveau_fence.c
+++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
@@ -38,12 +38,6 @@
static const struct dma_fence_ops nouveau_fence_ops_uevent;
static const struct dma_fence_ops nouveau_fence_ops_legacy;
-static inline struct nouveau_fence *
-from_fence(struct dma_fence *fence)
-{
- return container_of(fence, struct nouveau_fence, base);
-}
-
static inline struct nouveau_fence_chan *
nouveau_fctx(struct nouveau_fence *fence)
{
@@ -77,7 +71,7 @@ nouveau_local_fence(struct dma_fence *fence, struct nouveau_drm *drm)
fence->ops != &nouveau_fence_ops_uevent)
return NULL;
- return from_fence(fence);
+ return to_nouveau_fence(fence);
}
void
@@ -268,7 +262,7 @@ nouveau_fence_done(struct nouveau_fence *fence)
static long
nouveau_fence_wait_legacy(struct dma_fence *f, bool intr, long wait)
{
- struct nouveau_fence *fence = from_fence(f);
+ struct nouveau_fence *fence = to_nouveau_fence(f);
unsigned long sleep_time = NSEC_PER_MSEC / 1000;
unsigned long t = jiffies, timeout = t + wait;
@@ -448,7 +442,7 @@ static const char *nouveau_fence_get_get_driver_name(struct dma_fence *fence)
static const char *nouveau_fence_get_timeline_name(struct dma_fence *f)
{
- struct nouveau_fence *fence = from_fence(f);
+ struct nouveau_fence *fence = to_nouveau_fence(f);
struct nouveau_fence_chan *fctx = nouveau_fctx(fence);
return !fctx->dead ? fctx->name : "dead channel";
@@ -462,7 +456,7 @@ static const char *nouveau_fence_get_timeline_name(struct dma_fence *f)
*/
static bool nouveau_fence_is_signaled(struct dma_fence *f)
{
- struct nouveau_fence *fence = from_fence(f);
+ struct nouveau_fence *fence = to_nouveau_fence(f);
struct nouveau_fence_chan *fctx = nouveau_fctx(fence);
struct nouveau_channel *chan;
bool ret = false;
@@ -478,7 +472,7 @@ static bool nouveau_fence_is_signaled(struct dma_fence *f)
static bool nouveau_fence_no_signaling(struct dma_fence *f)
{
- struct nouveau_fence *fence = from_fence(f);
+ struct nouveau_fence *fence = to_nouveau_fence(f);
/*
* caller should have a reference on the fence,
@@ -503,7 +497,7 @@ static bool nouveau_fence_no_signaling(struct dma_fence *f)
static void nouveau_fence_release(struct dma_fence *f)
{
- struct nouveau_fence *fence = from_fence(f);
+ struct nouveau_fence *fence = to_nouveau_fence(f);
struct nouveau_fence_chan *fctx = nouveau_fctx(fence);
kref_put(&fctx->fence_ref, nouveau_fence_context_put);
@@ -521,7 +515,7 @@ static const struct dma_fence_ops nouveau_fence_ops_legacy = {
static bool nouveau_fence_enable_signaling(struct dma_fence *f)
{
- struct nouveau_fence *fence = from_fence(f);
+ struct nouveau_fence *fence = to_nouveau_fence(f);
struct nouveau_fence_chan *fctx = nouveau_fctx(fence);
bool ret;
diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.h b/drivers/gpu/drm/nouveau/nouveau_fence.h
index 6a983dd9f7b9..183dd43ecfff 100644
--- a/drivers/gpu/drm/nouveau/nouveau_fence.h
+++ b/drivers/gpu/drm/nouveau/nouveau_fence.h
@@ -17,6 +17,12 @@ struct nouveau_fence {
unsigned long timeout;
};
+static inline struct nouveau_fence *
+to_nouveau_fence(struct dma_fence *fence)
+{
+ return container_of(fence, struct nouveau_fence, base);
+}
+
int nouveau_fence_create(struct nouveau_fence **, struct nouveau_channel *);
int nouveau_fence_new(struct nouveau_fence **, struct nouveau_channel *);
void nouveau_fence_unref(struct nouveau_fence **);
--
2.49.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 6/7] drm/nouveau: Add new callback for scheduler teardown
2025-07-07 13:42 [PATCH v2 0/7] drm/sched: Fix memory leaks with cancel_job() callback Philipp Stanner
` (4 preceding siblings ...)
2025-07-07 13:42 ` [PATCH v2 5/7] drm/nouveau: Make fence container helper usable driver-wide Philipp Stanner
@ 2025-07-07 13:42 ` Philipp Stanner
2025-07-07 14:26 ` Danilo Krummrich
2025-07-07 13:42 ` [PATCH v2 7/7] drm/nouveau: Remove waitque for sched teardown Philipp Stanner
6 siblings, 1 reply; 15+ messages in thread
From: Philipp Stanner @ 2025-07-07 13:42 UTC (permalink / raw)
To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Matthew Brost, Philipp Stanner, Christian König,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
Tvrtko Ursulin, Pierre-Eric Pelloux-Prayer
Cc: dri-devel, nouveau, linux-kernel, linux-media
There is a new callback for always tearing the scheduler down in a
leak-free, deadlock-free manner.
Port Nouveau as its first user by providing the scheduler with a
callback that ensures the fence context gets killed in drm_sched_fini().
Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
drivers/gpu/drm/nouveau/nouveau_fence.c | 15 +++++++++++++++
drivers/gpu/drm/nouveau/nouveau_fence.h | 1 +
drivers/gpu/drm/nouveau/nouveau_sched.c | 15 ++++++++++++++-
3 files changed, 30 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/drivers/gpu/drm/nouveau/nouveau_fence.c
index 869d4335c0f4..9f345a008717 100644
--- a/drivers/gpu/drm/nouveau/nouveau_fence.c
+++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
@@ -240,6 +240,21 @@ nouveau_fence_emit(struct nouveau_fence *fence)
return ret;
}
+void
+nouveau_fence_cancel(struct nouveau_fence *fence)
+{
+ struct nouveau_fence_chan *fctx = nouveau_fctx(fence);
+ unsigned long flags;
+
+ spin_lock_irqsave(&fctx->lock, flags);
+ if (!dma_fence_is_signaled_locked(&fence->base)) {
+ dma_fence_set_error(&fence->base, -ECANCELED);
+ if (nouveau_fence_signal(fence))
+ nvif_event_block(&fctx->event);
+ }
+ spin_unlock_irqrestore(&fctx->lock, flags);
+}
+
bool
nouveau_fence_done(struct nouveau_fence *fence)
{
diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.h b/drivers/gpu/drm/nouveau/nouveau_fence.h
index 183dd43ecfff..9957a919bd38 100644
--- a/drivers/gpu/drm/nouveau/nouveau_fence.h
+++ b/drivers/gpu/drm/nouveau/nouveau_fence.h
@@ -29,6 +29,7 @@ void nouveau_fence_unref(struct nouveau_fence **);
int nouveau_fence_emit(struct nouveau_fence *);
bool nouveau_fence_done(struct nouveau_fence *);
+void nouveau_fence_cancel(struct nouveau_fence *fence);
int nouveau_fence_wait(struct nouveau_fence *, bool lazy, bool intr);
int nouveau_fence_sync(struct nouveau_bo *, struct nouveau_channel *, bool exclusive, bool intr);
diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
index 460a5fb02412..2ec62059c351 100644
--- a/drivers/gpu/drm/nouveau/nouveau_sched.c
+++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
@@ -11,6 +11,7 @@
#include "nouveau_exec.h"
#include "nouveau_abi16.h"
#include "nouveau_sched.h"
+#include "nouveau_chan.h"
#define NOUVEAU_SCHED_JOB_TIMEOUT_MS 10000
@@ -393,10 +394,23 @@ nouveau_sched_free_job(struct drm_sched_job *sched_job)
nouveau_job_fini(job);
}
+static void
+nouveau_sched_cancel_job(struct drm_sched_job *sched_job)
+{
+ struct nouveau_fence *fence;
+ struct nouveau_job *job;
+
+ job = to_nouveau_job(sched_job);
+ fence = to_nouveau_fence(job->done_fence);
+
+ nouveau_fence_cancel(fence);
+}
+
static const struct drm_sched_backend_ops nouveau_sched_ops = {
.run_job = nouveau_sched_run_job,
.timedout_job = nouveau_sched_timedout_job,
.free_job = nouveau_sched_free_job,
+ .cancel_job = nouveau_sched_cancel_job,
};
static int
@@ -482,7 +496,6 @@ nouveau_sched_create(struct nouveau_sched **psched, struct nouveau_drm *drm,
return 0;
}
-
static void
nouveau_sched_fini(struct nouveau_sched *sched)
{
--
2.49.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 7/7] drm/nouveau: Remove waitque for sched teardown
2025-07-07 13:42 [PATCH v2 0/7] drm/sched: Fix memory leaks with cancel_job() callback Philipp Stanner
` (5 preceding siblings ...)
2025-07-07 13:42 ` [PATCH v2 6/7] drm/nouveau: Add new callback for scheduler teardown Philipp Stanner
@ 2025-07-07 13:42 ` Philipp Stanner
2025-07-07 13:47 ` Danilo Krummrich
6 siblings, 1 reply; 15+ messages in thread
From: Philipp Stanner @ 2025-07-07 13:42 UTC (permalink / raw)
To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Matthew Brost, Philipp Stanner, Christian König,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
Tvrtko Ursulin, Pierre-Eric Pelloux-Prayer
Cc: dri-devel, nouveau, linux-kernel, linux-media
struct nouveau_sched contains a waitque needed to prevent
drm_sched_fini() from being called while there are still jobs pending.
Doing so so far would have caused memory leaks.
With the new memleak-free mode of operation switched on in
drm_sched_fini() by providing the callback
nouveau_sched_fence_context_kill() the waitque is not necessary anymore.
Remove the waitque.
Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
drivers/gpu/drm/nouveau/nouveau_sched.c | 20 +++++++-------------
drivers/gpu/drm/nouveau/nouveau_sched.h | 9 +++------
drivers/gpu/drm/nouveau/nouveau_uvmm.c | 8 ++++----
3 files changed, 14 insertions(+), 23 deletions(-)
diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
index 2ec62059c351..7d9c3418e76b 100644
--- a/drivers/gpu/drm/nouveau/nouveau_sched.c
+++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
@@ -122,11 +122,9 @@ nouveau_job_done(struct nouveau_job *job)
{
struct nouveau_sched *sched = job->sched;
- spin_lock(&sched->job.list.lock);
+ spin_lock(&sched->job_list.lock);
list_del(&job->entry);
- spin_unlock(&sched->job.list.lock);
-
- wake_up(&sched->job.wq);
+ spin_unlock(&sched->job_list.lock);
}
void
@@ -307,9 +305,9 @@ nouveau_job_submit(struct nouveau_job *job)
}
/* Submit was successful; add the job to the schedulers job list. */
- spin_lock(&sched->job.list.lock);
- list_add(&job->entry, &sched->job.list.head);
- spin_unlock(&sched->job.list.lock);
+ spin_lock(&sched->job_list.lock);
+ list_add(&job->entry, &sched->job_list.head);
+ spin_unlock(&sched->job_list.lock);
drm_sched_job_arm(&job->base);
job->done_fence = dma_fence_get(&job->base.s_fence->finished);
@@ -460,9 +458,8 @@ nouveau_sched_init(struct nouveau_sched *sched, struct nouveau_drm *drm,
goto fail_sched;
mutex_init(&sched->mutex);
- spin_lock_init(&sched->job.list.lock);
- INIT_LIST_HEAD(&sched->job.list.head);
- init_waitqueue_head(&sched->job.wq);
+ spin_lock_init(&sched->job_list.lock);
+ INIT_LIST_HEAD(&sched->job_list.head);
return 0;
@@ -502,9 +499,6 @@ nouveau_sched_fini(struct nouveau_sched *sched)
struct drm_gpu_scheduler *drm_sched = &sched->base;
struct drm_sched_entity *entity = &sched->entity;
- rmb(); /* for list_empty to work without lock */
- wait_event(sched->job.wq, list_empty(&sched->job.list.head));
-
drm_sched_entity_fini(entity);
drm_sched_fini(drm_sched);
diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.h b/drivers/gpu/drm/nouveau/nouveau_sched.h
index 20cd1da8db73..b98c3f0bef30 100644
--- a/drivers/gpu/drm/nouveau/nouveau_sched.h
+++ b/drivers/gpu/drm/nouveau/nouveau_sched.h
@@ -103,12 +103,9 @@ struct nouveau_sched {
struct mutex mutex;
struct {
- struct {
- struct list_head head;
- spinlock_t lock;
- } list;
- struct wait_queue_head wq;
- } job;
+ struct list_head head;
+ spinlock_t lock;
+ } job_list;
};
int nouveau_sched_create(struct nouveau_sched **psched, struct nouveau_drm *drm,
diff --git a/drivers/gpu/drm/nouveau/nouveau_uvmm.c b/drivers/gpu/drm/nouveau/nouveau_uvmm.c
index 48f105239f42..ddfc46bc1b3e 100644
--- a/drivers/gpu/drm/nouveau/nouveau_uvmm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_uvmm.c
@@ -1019,8 +1019,8 @@ bind_validate_map_sparse(struct nouveau_job *job, u64 addr, u64 range)
u64 end = addr + range;
again:
- spin_lock(&sched->job.list.lock);
- list_for_each_entry(__job, &sched->job.list.head, entry) {
+ spin_lock(&sched->job_list.lock);
+ list_for_each_entry(__job, &sched->job_list.head, entry) {
struct nouveau_uvmm_bind_job *bind_job = to_uvmm_bind_job(__job);
list_for_each_op(op, &bind_job->ops) {
@@ -1030,7 +1030,7 @@ bind_validate_map_sparse(struct nouveau_job *job, u64 addr, u64 range)
if (!(end <= op_addr || addr >= op_end)) {
nouveau_uvmm_bind_job_get(bind_job);
- spin_unlock(&sched->job.list.lock);
+ spin_unlock(&sched->job_list.lock);
wait_for_completion(&bind_job->complete);
nouveau_uvmm_bind_job_put(bind_job);
goto again;
@@ -1038,7 +1038,7 @@ bind_validate_map_sparse(struct nouveau_job *job, u64 addr, u64 range)
}
}
}
- spin_unlock(&sched->job.list.lock);
+ spin_unlock(&sched->job_list.lock);
}
static int
--
2.49.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH v2 5/7] drm/nouveau: Make fence container helper usable driver-wide
2025-07-07 13:42 ` [PATCH v2 5/7] drm/nouveau: Make fence container helper usable driver-wide Philipp Stanner
@ 2025-07-07 13:45 ` Danilo Krummrich
0 siblings, 0 replies; 15+ messages in thread
From: Danilo Krummrich @ 2025-07-07 13:45 UTC (permalink / raw)
To: Philipp Stanner
Cc: Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
Christian König, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, Sumit Semwal, Tvrtko Ursulin,
Pierre-Eric Pelloux-Prayer, dri-devel, nouveau, linux-kernel,
linux-media
On 7/7/25 3:42 PM, Philipp Stanner wrote:
> In order to implement a new DRM GPU scheduler callback in Nouveau, a
> helper for obtaining a nouveau_fence from a dma_fence is necessary. Such
> a helper exists already inside nouveau_fence.c, called from_fence().
>
> Make that helper available to other C files with a more precise name.
>
> Signed-off-by: Philipp Stanner <phasta@kernel.org>
Acked-by: Danilo Krummrich <dakr@kernel.org>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 7/7] drm/nouveau: Remove waitque for sched teardown
2025-07-07 13:42 ` [PATCH v2 7/7] drm/nouveau: Remove waitque for sched teardown Philipp Stanner
@ 2025-07-07 13:47 ` Danilo Krummrich
2025-07-07 13:54 ` Philipp Stanner
0 siblings, 1 reply; 15+ messages in thread
From: Danilo Krummrich @ 2025-07-07 13:47 UTC (permalink / raw)
To: Philipp Stanner
Cc: Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
Christian König, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, Sumit Semwal, Tvrtko Ursulin,
Pierre-Eric Pelloux-Prayer, dri-devel, nouveau, linux-kernel,
linux-media
On 7/7/25 3:42 PM, Philipp Stanner wrote:
> struct nouveau_sched contains a waitque needed to prevent
> drm_sched_fini() from being called while there are still jobs pending.
> Doing so so far would have caused memory leaks.
>
> With the new memleak-free mode of operation switched on in
> drm_sched_fini() by providing the callback
> nouveau_sched_fence_context_kill() the waitque is not necessary anymore.
>
> Remove the waitque.
>
> Signed-off-by: Philipp Stanner <phasta@kernel.org>
Doesn't this break the driver until fixed up by the subsequent patch?
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 7/7] drm/nouveau: Remove waitque for sched teardown
2025-07-07 13:47 ` Danilo Krummrich
@ 2025-07-07 13:54 ` Philipp Stanner
2025-07-07 14:26 ` Danilo Krummrich
0 siblings, 1 reply; 15+ messages in thread
From: Philipp Stanner @ 2025-07-07 13:54 UTC (permalink / raw)
To: Danilo Krummrich, Philipp Stanner
Cc: Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
Christian König, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, Sumit Semwal, Tvrtko Ursulin,
Pierre-Eric Pelloux-Prayer, dri-devel, nouveau, linux-kernel,
linux-media
On Mon, 2025-07-07 at 15:47 +0200, Danilo Krummrich wrote:
> On 7/7/25 3:42 PM, Philipp Stanner wrote:
> > struct nouveau_sched contains a waitque needed to prevent
> > drm_sched_fini() from being called while there are still jobs
> > pending.
> > Doing so so far would have caused memory leaks.
> >
> > With the new memleak-free mode of operation switched on in
> > drm_sched_fini() by providing the callback
> > nouveau_sched_fence_context_kill() the waitque is not necessary
> > anymore.
> >
> > Remove the waitque.
> >
> > Signed-off-by: Philipp Stanner <phasta@kernel.org>
>
> Doesn't this break the driver until fixed up by the subsequent patch?
>
Did you mean to answer to patch 6?
Patch 6 implements the cancel_job() callback for nouveau, which makes
sure the (still existing) waitque will never block. The, now redundant,
waitque then gets removed in patch 7.
P.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 7/7] drm/nouveau: Remove waitque for sched teardown
2025-07-07 13:54 ` Philipp Stanner
@ 2025-07-07 14:26 ` Danilo Krummrich
0 siblings, 0 replies; 15+ messages in thread
From: Danilo Krummrich @ 2025-07-07 14:26 UTC (permalink / raw)
To: phasta
Cc: Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
Christian König, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, Sumit Semwal, Tvrtko Ursulin,
Pierre-Eric Pelloux-Prayer, dri-devel, nouveau, linux-kernel,
linux-media
On 7/7/25 3:54 PM, Philipp Stanner wrote:
> On Mon, 2025-07-07 at 15:47 +0200, Danilo Krummrich wrote:
>> On 7/7/25 3:42 PM, Philipp Stanner wrote:
>>> struct nouveau_sched contains a waitque needed to prevent
>>> drm_sched_fini() from being called while there are still jobs
>>> pending.
>>> Doing so so far would have caused memory leaks.
>>>
>>> With the new memleak-free mode of operation switched on in
>>> drm_sched_fini() by providing the callback
>>> nouveau_sched_fence_context_kill() the waitque is not necessary
>>> anymore.
>>>
>>> Remove the waitque.
>>>
>>> Signed-off-by: Philipp Stanner <phasta@kernel.org>
>>
>> Doesn't this break the driver until fixed up by the subsequent patch?
>>
>
> Did you mean to answer to patch 6?
>
> Patch 6 implements the cancel_job() callback for nouveau, which makes
> sure the (still existing) waitque will never block. The, now redundant,
> waitque then gets removed in patch 7.
Yup, I mixed up the order on my end.
Acked-by: Danilo Krummrich <dakr@kernel.org>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 6/7] drm/nouveau: Add new callback for scheduler teardown
2025-07-07 13:42 ` [PATCH v2 6/7] drm/nouveau: Add new callback for scheduler teardown Philipp Stanner
@ 2025-07-07 14:26 ` Danilo Krummrich
0 siblings, 0 replies; 15+ messages in thread
From: Danilo Krummrich @ 2025-07-07 14:26 UTC (permalink / raw)
To: Philipp Stanner
Cc: Lyude Paul, David Airlie, Simona Vetter, Matthew Brost,
Christian König, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, Sumit Semwal, Tvrtko Ursulin,
Pierre-Eric Pelloux-Prayer, dri-devel, nouveau, linux-kernel,
linux-media
On 7/7/25 3:42 PM, Philipp Stanner wrote:
> There is a new callback for always tearing the scheduler down in a
> leak-free, deadlock-free manner.
>
> Port Nouveau as its first user by providing the scheduler with a
> callback that ensures the fence context gets killed in drm_sched_fini().
>
> Signed-off-by: Philipp Stanner <phasta@kernel.org>
Acked-by: Danilo Krummrich <dakr@kernel.org>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 3/7] drm/sched/tests: Add unit test for cancel_job()
2025-07-07 13:42 ` [PATCH v2 3/7] drm/sched/tests: Add unit test for cancel_job() Philipp Stanner
@ 2025-07-07 15:20 ` Tvrtko Ursulin
0 siblings, 0 replies; 15+ messages in thread
From: Tvrtko Ursulin @ 2025-07-07 15:20 UTC (permalink / raw)
To: Philipp Stanner, Lyude Paul, Danilo Krummrich, David Airlie,
Simona Vetter, Matthew Brost, Christian König,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
Pierre-Eric Pelloux-Prayer
Cc: dri-devel, nouveau, linux-kernel, linux-media
On 07/07/2025 14:42, Philipp Stanner wrote:
> The scheduler unit tests now provide a new callback, cancel_job(). This
> callback gets used by drm_sched_fini() for all still pending jobs to
> cancel them.
>
> Implement a new unit test to test this.
>
> Signed-off-by: Philipp Stanner <phasta@kernel.org>
> ---
> drivers/gpu/drm/scheduler/tests/tests_basic.c | 43 +++++++++++++++++++
> 1 file changed, 43 insertions(+)
>
> diff --git a/drivers/gpu/drm/scheduler/tests/tests_basic.c b/drivers/gpu/drm/scheduler/tests/tests_basic.c
> index 7230057e0594..fa3da2db4893 100644
> --- a/drivers/gpu/drm/scheduler/tests/tests_basic.c
> +++ b/drivers/gpu/drm/scheduler/tests/tests_basic.c
> @@ -204,6 +204,48 @@ static struct kunit_suite drm_sched_basic = {
> .test_cases = drm_sched_basic_tests,
> };
>
> +static void drm_sched_basic_cancel(struct kunit *test)
> +{
> + struct drm_mock_sched_entity *entity;
> + struct drm_mock_scheduler *sched;
> + struct drm_mock_sched_job *job;
> + bool done;
> +
> + /*
> + * Check that the configured credit limit is respected.
> + */
Copy & paste mishap.
> +
> + sched = drm_mock_sched_new(test, MAX_SCHEDULE_TIMEOUT);
> + sched->base.credit_limit = 1;
Ditto.
> +
> + entity = drm_mock_sched_entity_new(test, DRM_SCHED_PRIORITY_NORMAL,
> + sched);
> +
> + job = drm_mock_sched_job_new(test, entity);
> +
> + drm_mock_sched_job_submit(job);
> +
> + done = drm_mock_sched_job_wait_scheduled(job, HZ);
> + KUNIT_ASSERT_TRUE(test, done);
> +
> + drm_mock_sched_entity_free(entity);
> + drm_mock_sched_fini(sched);
> +
> + KUNIT_ASSERT_EQ(test, job->hw_fence.error, -ECANCELED);
> +}
> +
> +static struct kunit_case drm_sched_cancel_tests[] = {
> + KUNIT_CASE(drm_sched_basic_cancel),
> + {}
> +};
> +
> +static struct kunit_suite drm_sched_cancel = {
> + .name = "drm_sched_basic_cancel_tests",
> + .init = drm_sched_basic_init,
> + .exit = drm_sched_basic_exit,
> + .test_cases = drm_sched_cancel_tests,
> +};
> +
> static void drm_sched_basic_timeout(struct kunit *test)
> {
> struct drm_mock_scheduler *sched = test->priv;
> @@ -471,6 +513,7 @@ static struct kunit_suite drm_sched_credits = {
>
> kunit_test_suites(&drm_sched_basic,
> &drm_sched_timeout,
> + &drm_sched_cancel,
> &drm_sched_priority,
> &drm_sched_modify_sched,
> &drm_sched_credits);
The rest looks good. With the comment fixed and credit limit setting
removed:
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Regards,
Tvrtko
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 2/7] drm/sched/tests: Implement cancel_job() callback
2025-07-07 13:42 ` [PATCH v2 2/7] drm/sched/tests: Implement " Philipp Stanner
@ 2025-07-07 15:25 ` Tvrtko Ursulin
0 siblings, 0 replies; 15+ messages in thread
From: Tvrtko Ursulin @ 2025-07-07 15:25 UTC (permalink / raw)
To: Philipp Stanner, Lyude Paul, Danilo Krummrich, David Airlie,
Simona Vetter, Matthew Brost, Christian König,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Sumit Semwal,
Pierre-Eric Pelloux-Prayer
Cc: dri-devel, nouveau, linux-kernel, linux-media
On 07/07/2025 14:42, Philipp Stanner wrote:
> The GPU Scheduler now supports a new callback, cancel_job(), which lets
> the scheduler cancel all jobs which might not yet be freed when
> drm_sched_fini() runs. Using this callback allows for significantly
> simplifying the mock scheduler teardown code.
>
> Implement the cancel_job() callback and adjust the code where necessary.
>
> Signed-off-by: Philipp Stanner <phasta@kernel.org>
> ---
> .../gpu/drm/scheduler/tests/mock_scheduler.c | 66 +++++++------------
> 1 file changed, 23 insertions(+), 43 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> index 49d067fecd67..2d3169d95200 100644
> --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> @@ -63,7 +63,7 @@ static void drm_mock_sched_job_complete(struct drm_mock_sched_job *job)
> lockdep_assert_held(&sched->lock);
>
> job->flags |= DRM_MOCK_SCHED_JOB_DONE;
> - list_move_tail(&job->link, &sched->done_list);
> + list_del(&job->link);
> dma_fence_signal_locked(&job->hw_fence);
> complete(&job->done);
> }
> @@ -236,26 +236,39 @@ mock_sched_timedout_job(struct drm_sched_job *sched_job)
>
> static void mock_sched_free_job(struct drm_sched_job *sched_job)
> {
> - struct drm_mock_scheduler *sched =
> - drm_sched_to_mock_sched(sched_job->sched);
> struct drm_mock_sched_job *job = drm_sched_job_to_mock_job(sched_job);
> - unsigned long flags;
>
> - /* Remove from the scheduler done list. */
> - spin_lock_irqsave(&sched->lock, flags);
> - list_del(&job->link);
> - spin_unlock_irqrestore(&sched->lock, flags);
> dma_fence_put(&job->hw_fence);
> -
> drm_sched_job_cleanup(sched_job);
>
> /* Mock job itself is freed by the kunit framework. */
> }
>
> +static void mock_sched_cancel_job(struct drm_sched_job *sched_job)
> +{
> + struct drm_mock_scheduler *sched = drm_sched_to_mock_sched(sched_job->sched);
> + struct drm_mock_sched_job *job = drm_sched_job_to_mock_job(sched_job);
> + unsigned long flags;
> +
> + hrtimer_cancel(&job->timer);
> +
> + spin_lock_irqsave(&sched->lock, flags);
> + if (!dma_fence_is_signaled_locked(&job->hw_fence)) {
> + list_del(&job->link);
> + dma_fence_set_error(&job->hw_fence, -ECANCELED);
> + dma_fence_signal_locked(&job->hw_fence);
> + }
> + spin_unlock_irqrestore(&sched->lock, flags);
> +
> + /* The GPU Scheduler will call drm_sched_backend_ops.free_job(), still.
> + * Mock job itself is freed by the kunit framework. */
I suggest sticking with the multi-line comment format as used in this
file. Ie.
/*
* Multi-line comment.
*/
> +}
> +
> static const struct drm_sched_backend_ops drm_mock_scheduler_ops = {
> .run_job = mock_sched_run_job,
> .timedout_job = mock_sched_timedout_job,
> - .free_job = mock_sched_free_job
> + .free_job = mock_sched_free_job,
> + .cancel_job = mock_sched_cancel_job,
> };
>
> /**
> @@ -289,7 +302,6 @@ struct drm_mock_scheduler *drm_mock_sched_new(struct kunit *test, long timeout)
> sched->hw_timeline.context = dma_fence_context_alloc(1);
> atomic_set(&sched->hw_timeline.next_seqno, 0);
> INIT_LIST_HEAD(&sched->job_list);
> - INIT_LIST_HEAD(&sched->done_list);
You forgot to remove done_list from the struct definition.
Regards,
Tvrtko
> spin_lock_init(&sched->lock);
>
> return sched;
> @@ -304,38 +316,6 @@ struct drm_mock_scheduler *drm_mock_sched_new(struct kunit *test, long timeout)
> */
> void drm_mock_sched_fini(struct drm_mock_scheduler *sched)
> {
> - struct drm_mock_sched_job *job, *next;
> - unsigned long flags;
> - LIST_HEAD(list);
> -
> - drm_sched_wqueue_stop(&sched->base);
> -
> - /* Force complete all unfinished jobs. */
> - spin_lock_irqsave(&sched->lock, flags);
> - list_for_each_entry_safe(job, next, &sched->job_list, link)
> - list_move_tail(&job->link, &list);
> - spin_unlock_irqrestore(&sched->lock, flags);
> -
> - list_for_each_entry(job, &list, link)
> - hrtimer_cancel(&job->timer);
> -
> - spin_lock_irqsave(&sched->lock, flags);
> - list_for_each_entry_safe(job, next, &list, link)
> - drm_mock_sched_job_complete(job);
> - spin_unlock_irqrestore(&sched->lock, flags);
> -
> - /*
> - * Free completed jobs and jobs not yet processed by the DRM scheduler
> - * free worker.
> - */
> - spin_lock_irqsave(&sched->lock, flags);
> - list_for_each_entry_safe(job, next, &sched->done_list, link)
> - list_move_tail(&job->link, &list);
> - spin_unlock_irqrestore(&sched->lock, flags);
> -
> - list_for_each_entry_safe(job, next, &list, link)
> - mock_sched_free_job(&job->base);
> -
> drm_sched_fini(&sched->base);
> }
>
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2025-07-07 15:25 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-07 13:42 [PATCH v2 0/7] drm/sched: Fix memory leaks with cancel_job() callback Philipp Stanner
2025-07-07 13:42 ` [PATCH v2 1/7] drm/sched: Avoid " Philipp Stanner
2025-07-07 13:42 ` [PATCH v2 2/7] drm/sched/tests: Implement " Philipp Stanner
2025-07-07 15:25 ` Tvrtko Ursulin
2025-07-07 13:42 ` [PATCH v2 3/7] drm/sched/tests: Add unit test for cancel_job() Philipp Stanner
2025-07-07 15:20 ` Tvrtko Ursulin
2025-07-07 13:42 ` [PATCH v2 4/7] drm/sched: Warn if pending list is not empty Philipp Stanner
2025-07-07 13:42 ` [PATCH v2 5/7] drm/nouveau: Make fence container helper usable driver-wide Philipp Stanner
2025-07-07 13:45 ` Danilo Krummrich
2025-07-07 13:42 ` [PATCH v2 6/7] drm/nouveau: Add new callback for scheduler teardown Philipp Stanner
2025-07-07 14:26 ` Danilo Krummrich
2025-07-07 13:42 ` [PATCH v2 7/7] drm/nouveau: Remove waitque for sched teardown Philipp Stanner
2025-07-07 13:47 ` Danilo Krummrich
2025-07-07 13:54 ` Philipp Stanner
2025-07-07 14:26 ` Danilo Krummrich
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).