From: Philipp Stanner <phasta@mailbox.org>
To: "Tvrtko Ursulin" <tvrtko.ursulin@igalia.com>,
phasta@kernel.org, "Lyude Paul" <lyude@redhat.com>,
"Danilo Krummrich" <dakr@kernel.org>,
"David Airlie" <airlied@gmail.com>,
"Simona Vetter" <simona@ffwll.ch>,
"Matthew Brost" <matthew.brost@intel.com>,
"Christian König" <ckoenig.leichtzumerken@gmail.com>,
"Maarten Lankhorst" <maarten.lankhorst@linux.intel.com>,
"Maxime Ripard" <mripard@kernel.org>,
"Thomas Zimmermann" <tzimmermann@suse.de>,
"Sumit Semwal" <sumit.semwal@linaro.org>,
"Pierre-Eric Pelloux-Prayer" <pierre-eric.pelloux-prayer@amd.com>
Cc: dri-devel@lists.freedesktop.org, nouveau@lists.freedesktop.org,
linux-kernel@vger.kernel.org, linux-media@vger.kernel.org
Subject: Re: [RFC PATCH 1/6] drm/sched: Avoid memory leaks with cancel_job() callback
Date: Mon, 16 Jun 2025 12:08:40 +0200 [thread overview]
Message-ID: <d1ecec0124edcf70f682e91e52f3f349c7a1b33c.camel@mailbox.org> (raw)
In-Reply-To: <18cd6b1f-8872-4a16-9ceb-50fd1ecfea39@igalia.com>
On Mon, 2025-06-16 at 10:27 +0100, Tvrtko Ursulin wrote:
>
> On 12/06/2025 15:20, Philipp Stanner wrote:
> > On Thu, 2025-06-12 at 15:17 +0100, Tvrtko Ursulin wrote:
> > >
> > > On 03/06/2025 10:31, Philipp Stanner wrote:
> > > > Since its inception, the GPU scheduler can leak memory if the
> > > > driver
> > > > calls drm_sched_fini() while there are still jobs in flight.
> > > >
> > > > The simplest way to solve this in a backwards compatible manner
> > > > is
> > > > by
> > > > adding a new callback, drm_sched_backend_ops.cancel_job(),
> > > > which
> > > > instructs the driver to signal the hardware fence associated
> > > > with
> > > > the
> > > > job. Afterwards, the scheduler can savely use the established
> > > > free_job()
> > > > callback for freeing the job.
> > > >
> > > > Implement the new backend_ops callback cancel_job().
> > > >
> > > > Suggested-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> > >
> > > Please just add the link to the patch here (it is only in the
> > > cover
> > > letter):
> > >
> > > Link:
> > > https://lore.kernel.org/dri-devel/20250418113211.69956-1-tvrtko.ursulin@igalia.com/
> >
> > That I can do, sure
>
> Cool, with that, for this patch:
>
> Acked-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>
> > > And you probably want to take the unit test modifications from
> > > the
> > > same
> > > patch too. You could put them in the same patch or separate.
> >
> > Necessary adjustments for the unit tests are already implemented
> > and
> > are waiting for review separately, since this can be done
> > independently
> > from this entire series:
> >
> > https://lore.kernel.org/dri-devel/20250605134154.191764-2-phasta@kernel.org/
>
> For me it would make most sense to fold that into 2/6 from this
> series.
> I don't see it making sense as standalone. So if you could repost the
> series with it integrated I will give it a spin and can review that
> patch at least.
It does make sense as an independent patch, because it is: independent.
It improves the unit tests in a way that they become a better role
model for the driver callbacks. All fences always must get signaled,
which is not the case there currently. Unit tests serve as a reference
implementation for new users, which is why I am stressing that point.
If you disagree with that patch's content, please answer on it
P.
>
> Regards,
>
> Tvrtko
>
> >
> > Thx
> > P.
> >
> > >
> > > Regards,
> > >
> > > Tvrtko
> > >
> > > > Signed-off-by: Philipp Stanner <phasta@kernel.org>
> > > > ---
> > > > drivers/gpu/drm/scheduler/sched_main.c | 34
> > > > ++++++++++++++++-----
> > > > -----
> > > > include/drm/gpu_scheduler.h | 9 +++++++
> > > > 2 files changed, 30 insertions(+), 13 deletions(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > index d20726d7adf0..3f14f1e151fa 100644
> > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > @@ -1352,6 +1352,18 @@ int drm_sched_init(struct
> > > > drm_gpu_scheduler
> > > > *sched, const struct drm_sched_init_
> > > > }
> > > > EXPORT_SYMBOL(drm_sched_init);
> > > >
> > > > +static void drm_sched_kill_remaining_jobs(struct
> > > > drm_gpu_scheduler
> > > > *sched)
> > > > +{
> > > > + struct drm_sched_job *job, *tmp;
> > > > +
> > > > + /* All other accessors are stopped. No locking
> > > > necessary.
> > > > */
> > > > + list_for_each_entry_safe_reverse(job, tmp, &sched-
> > > > > pending_list, list) {
> > > > + sched->ops->cancel_job(job);
> > > > + list_del(&job->list);
> > > > + sched->ops->free_job(job);
> > > > + }
> > > > +}
> > > > +
> > > > /**
> > > > * drm_sched_fini - Destroy a gpu scheduler
> > > > *
> > > > @@ -1359,19 +1371,11 @@ EXPORT_SYMBOL(drm_sched_init);
> > > > *
> > > > * Tears down and cleans up the scheduler.
> > > > *
> > > > - * This stops submission of new jobs to the hardware through
> > > > - * drm_sched_backend_ops.run_job(). Consequently,
> > > > drm_sched_backend_ops.free_job()
> > > > - * will not be called for all jobs still in
> > > > drm_gpu_scheduler.pending_list.
> > > > - * There is no solution for this currently. Thus, it is up to
> > > > the
> > > > driver to make
> > > > - * sure that:
> > > > - *
> > > > - * a) drm_sched_fini() is only called after for all submitted
> > > > jobs
> > > > - * drm_sched_backend_ops.free_job() has been called or
> > > > that
> > > > - * b) the jobs for which drm_sched_backend_ops.free_job() has
> > > > not
> > > > been called
> > > > - * after drm_sched_fini() ran are freed manually.
> > > > - *
> > > > - * FIXME: Take care of the above problem and prevent this
> > > > function
> > > > from leaking
> > > > - * the jobs in drm_gpu_scheduler.pending_list under any
> > > > circumstances.
> > > > + * This stops submission of new jobs to the hardware through
> > > > &struct
> > > > + * drm_sched_backend_ops.run_job. If &struct
> > > > drm_sched_backend_ops.cancel_job
> > > > + * is implemented, all jobs will be canceled through it and
> > > > afterwards cleaned
> > > > + * up through &struct drm_sched_backend_ops.free_job. If
> > > > cancel_job is not
> > > > + * implemented, memory could leak.
> > > > */
> > > > void drm_sched_fini(struct drm_gpu_scheduler *sched)
> > > > {
> > > > @@ -1401,6 +1405,10 @@ void drm_sched_fini(struct
> > > > drm_gpu_scheduler
> > > > *sched)
> > > > /* Confirm no work left behind accessing device
> > > > structures
> > > > */
> > > > cancel_delayed_work_sync(&sched->work_tdr);
> > > >
> > > > + /* Avoid memory leaks if supported by the driver. */
> > > > + if (sched->ops->cancel_job)
> > > > + drm_sched_kill_remaining_jobs(sched);
> > > > +
> > > > if (sched->own_submit_wq)
> > > > destroy_workqueue(sched->submit_wq);
> > > > sched->ready = false;
> > > > diff --git a/include/drm/gpu_scheduler.h
> > > > b/include/drm/gpu_scheduler.h
> > > > index e62a7214e052..81dcbfc8c223 100644
> > > > --- a/include/drm/gpu_scheduler.h
> > > > +++ b/include/drm/gpu_scheduler.h
> > > > @@ -512,6 +512,15 @@ struct drm_sched_backend_ops {
> > > > * and it's time to clean it up.
> > > > */
> > > > void (*free_job)(struct drm_sched_job *sched_job);
> > > > +
> > > > + /**
> > > > + * @cancel_job: Used by the scheduler to guarantee
> > > > remaining jobs' fences
> > > > + * get signaled in drm_sched_fini().
> > > > + *
> > > > + * Drivers need to signal the passed job's hardware
> > > > fence
> > > > with
> > > > + * -ECANCELED in this callback. They must not free the
> > > > job.
> > > > + */
> > > > + void (*cancel_job)(struct drm_sched_job *sched_job);
> > > > };
> > > >
> > > > /**
> > >
> >
>
next prev parent reply other threads:[~2025-06-16 10:08 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-03 9:31 [RFC PATCH 0/6] drm/sched: Avoid memory leaks by canceling job-by-job Philipp Stanner
2025-06-03 9:31 ` [RFC PATCH 1/6] drm/sched: Avoid memory leaks with cancel_job() callback Philipp Stanner
2025-06-03 13:22 ` Tvrtko Ursulin
2025-06-12 14:17 ` Tvrtko Ursulin
2025-06-12 14:20 ` Philipp Stanner
2025-06-16 9:27 ` Tvrtko Ursulin
2025-06-16 10:08 ` Philipp Stanner [this message]
2025-06-03 9:31 ` [RFC PATCH 2/6] drm/sched/tests: Implement cancel_job() Philipp Stanner
2025-06-03 9:31 ` [RFC PATCH 3/6] drm/sched: Warn if pending list is not empty Philipp Stanner
2025-06-03 9:31 ` [RFC PATCH 4/6] drm/nouveau: Make fence container helper usable driver-wide Philipp Stanner
2025-06-03 9:31 ` [RFC PATCH 5/6] drm/nouveau: Add new callback for scheduler teardown Philipp Stanner
2025-06-03 9:31 ` [RFC PATCH 6/6] drm/nouveau: Remove waitque for sched teardown Philipp Stanner
2025-06-03 12:27 ` [RFC PATCH 0/6] drm/sched: Avoid memory leaks by canceling job-by-job Tvrtko Ursulin
2025-06-03 13:23 ` Philipp Stanner
2025-06-11 21:21 ` Danilo Krummrich
2025-06-12 13:46 ` Tvrtko Ursulin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d1ecec0124edcf70f682e91e52f3f349c7a1b33c.camel@mailbox.org \
--to=phasta@mailbox.org \
--cc=airlied@gmail.com \
--cc=ckoenig.leichtzumerken@gmail.com \
--cc=dakr@kernel.org \
--cc=dri-devel@lists.freedesktop.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-media@vger.kernel.org \
--cc=lyude@redhat.com \
--cc=maarten.lankhorst@linux.intel.com \
--cc=matthew.brost@intel.com \
--cc=mripard@kernel.org \
--cc=nouveau@lists.freedesktop.org \
--cc=phasta@kernel.org \
--cc=pierre-eric.pelloux-prayer@amd.com \
--cc=simona@ffwll.ch \
--cc=sumit.semwal@linaro.org \
--cc=tvrtko.ursulin@igalia.com \
--cc=tzimmermann@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).