From: Philipp Stanner <phasta@mailbox.org>
To: "Tvrtko Ursulin" <tvrtko.ursulin@igalia.com>,
"Philipp Stanner" <phasta@kernel.org>,
"Lyude Paul" <lyude@redhat.com>,
"Danilo Krummrich" <dakr@kernel.org>,
"David Airlie" <airlied@gmail.com>,
"Simona Vetter" <simona@ffwll.ch>,
"Matthew Brost" <matthew.brost@intel.com>,
"Christian König" <ckoenig.leichtzumerken@gmail.com>,
"Maarten Lankhorst" <maarten.lankhorst@linux.intel.com>,
"Maxime Ripard" <mripard@kernel.org>,
"Thomas Zimmermann" <tzimmermann@suse.de>
Cc: dri-devel@lists.freedesktop.org, nouveau@lists.freedesktop.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH v3 2/5] drm/sched/tests: Port tests to new cleanup method
Date: Thu, 22 May 2025 16:59:35 +0200 [thread overview]
Message-ID: <a637755cb96de8415b51feb1ae61b8c651e94295.camel@mailbox.org> (raw)
In-Reply-To: <b24d5c5e-8a9e-4dfb-886b-b3ad70e62e76@igalia.com>
On Thu, 2025-05-22 at 15:06 +0100, Tvrtko Ursulin wrote:
>
> On 22/05/2025 09:27, Philipp Stanner wrote:
> > The drm_gpu_scheduler now supports a callback to help
> > drm_sched_fini()
> > avoid memory leaks. This callback instructs the driver to signal
> > all
> > pending hardware fences.
> >
> > Implement the new callback
> > drm_sched_backend_ops.cancel_pending_fences().
> >
> > Have the callback use drm_mock_sched_job_complete() with a new
> > error
> > field for the fence error.
> >
> > Keep the job status as DRM_MOCK_SCHED_JOB_DONE for now, since there
> > is
> > no party for which checking for a CANCELED status would be useful
> > currently.
> >
> > Signed-off-by: Philipp Stanner <phasta@kernel.org>
> > ---
> > .../gpu/drm/scheduler/tests/mock_scheduler.c | 67 +++++++-------
> > -----
> > drivers/gpu/drm/scheduler/tests/sched_tests.h | 4 +-
> > 2 files changed, 25 insertions(+), 46 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > index f999c8859cf7..eca47f0395bc 100644
> > --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > @@ -55,7 +55,7 @@ void drm_mock_sched_entity_free(struct
> > drm_mock_sched_entity *entity)
> > drm_sched_entity_destroy(&entity->base);
> > }
> >
> > -static void drm_mock_sched_job_complete(struct drm_mock_sched_job
> > *job)
> > +static void drm_mock_sched_job_complete(struct drm_mock_sched_job
> > *job, int err)
> > {
> > struct drm_mock_scheduler *sched =
> > drm_sched_to_mock_sched(job->base.sched);
> > @@ -63,7 +63,9 @@ static void drm_mock_sched_job_complete(struct
> > drm_mock_sched_job *job)
> > lockdep_assert_held(&sched->lock);
> >
> > job->flags |= DRM_MOCK_SCHED_JOB_DONE;
> > - list_move_tail(&job->link, &sched->done_list);
> > + list_del(&job->link);
> > + if (err)
> > + dma_fence_set_error(&job->hw_fence, err);
> > dma_fence_signal(&job->hw_fence);
> > complete(&job->done);
> > }
> > @@ -89,7 +91,7 @@ drm_mock_sched_job_signal_timer(struct hrtimer
> > *hrtimer)
> > break;
> >
> > sched->hw_timeline.cur_seqno = job-
> > >hw_fence.seqno;
> > - drm_mock_sched_job_complete(job);
> > + drm_mock_sched_job_complete(job, 0);
> > }
> > spin_unlock_irqrestore(&sched->lock, flags);
> >
> > @@ -212,26 +214,33 @@ mock_sched_timedout_job(struct drm_sched_job
> > *sched_job)
> >
> > static void mock_sched_free_job(struct drm_sched_job *sched_job)
> > {
> > - struct drm_mock_scheduler *sched =
> > - drm_sched_to_mock_sched(sched_job->sched);
> > struct drm_mock_sched_job *job =
> > drm_sched_job_to_mock_job(sched_job);
> > - unsigned long flags;
> >
> > - /* Remove from the scheduler done list. */
> > - spin_lock_irqsave(&sched->lock, flags);
> > - list_del(&job->link);
> > - spin_unlock_irqrestore(&sched->lock, flags);
> > dma_fence_put(&job->hw_fence);
> > -
> > drm_sched_job_cleanup(sched_job);
> >
> > /* Mock job itself is freed by the kunit framework. */
> > }
> >
> > +static void mock_sched_cancel_pending_fences(struct
> > drm_gpu_scheduler *gsched)
>
> "gsched" feels like a first time invention. Maybe drm_sched?
Alright
>
> > +{
> > + struct drm_mock_sched_job *job, *next;
> > + struct drm_mock_scheduler *sched;
> > + unsigned long flags;
> > +
> > + sched = container_of(gsched, struct drm_mock_scheduler,
> > base);
> > +
> > + spin_lock_irqsave(&sched->lock, flags);
> > + list_for_each_entry_safe(job, next, &sched->job_list,
> > link)
> > + drm_mock_sched_job_complete(job, -ECANCELED);
> > + spin_unlock_irqrestore(&sched->lock, flags);
>
> Canceling of the timers belongs in this call back I think. Otherwise
> jobs are not fully cancelled.
I wouldn't say so – the timers represent things like the hardware
interrupts. And those must be deactivated by the driver itself.
See, one big reason why I like my approach is that the contract between
driver and scheduler is made very simple:
"Driver, signal all fences that you ever returned through run_job() to
this scheduler!"
That always works, and the driver always has all those fences. It's
based on the most fundamental agreement regarding dma_fences: they must
all be signaled.
>
> Hm, I also think, conceptually, the order of first canceling the
> timer
> and then signaling the fence should be kept.
That's the case here, no?
It must indeed be kept, otherwise the timers could fire after
everything is torn down -> UAF
>
> At the moment it does not matter hugely, since the timer does not
> signal
> the jobs directly and will not find unlinked jobs, but if that
> changes
> in the future, the reversed order could cause double signaling. So if
> you keep it in the correct logical order that potential gotcha is
> avoided. Basically just keep the two pass approach verbatim, as is in
> the current drm_mock_sched_fini.
>
> The rest of the conversion is I think good.
:)
>
> Only a slight uncertainty after I cross-referenced with my version
> (->cancel_job()) around why I needed to add signaling to
> mock_sched_timedout_job() and manual job cleanup to the timeout test.
> It
> was more than a month ago that I wrote it so can't remember right
> now.
> You checked for memory leaks and the usual stuff?
Hm, it seems I indeed ran into that leak that you fixed (in addition to
the other stuff) in your RFC, for the timeout tests.
We should fix that in a separate patch, probably.
P.
>
> Regards,
>
> Tvrtko
>
> > +}
> > +
> > static const struct drm_sched_backend_ops drm_mock_scheduler_ops
> > = {
> > .run_job = mock_sched_run_job,
> > .timedout_job = mock_sched_timedout_job,
> > - .free_job = mock_sched_free_job
> > + .free_job = mock_sched_free_job,
> > + .cancel_pending_fences = mock_sched_cancel_pending_fences,
> > };
> >
> > /**
> > @@ -265,7 +274,6 @@ struct drm_mock_scheduler
> > *drm_mock_sched_new(struct kunit *test, long timeout)
> > sched->hw_timeline.context = dma_fence_context_alloc(1);
> > atomic_set(&sched->hw_timeline.next_seqno, 0);
> > INIT_LIST_HEAD(&sched->job_list);
> > - INIT_LIST_HEAD(&sched->done_list);
> > spin_lock_init(&sched->lock);
> >
> > return sched;
> > @@ -280,38 +288,11 @@ struct drm_mock_scheduler
> > *drm_mock_sched_new(struct kunit *test, long timeout)
> > */
> > void drm_mock_sched_fini(struct drm_mock_scheduler *sched)
> > {
> > - struct drm_mock_sched_job *job, *next;
> > - unsigned long flags;
> > - LIST_HEAD(list);
> > + struct drm_mock_sched_job *job;
> >
> > - drm_sched_wqueue_stop(&sched->base);
> > -
> > - /* Force complete all unfinished jobs. */
> > - spin_lock_irqsave(&sched->lock, flags);
> > - list_for_each_entry_safe(job, next, &sched->job_list,
> > link)
> > - list_move_tail(&job->link, &list);
> > - spin_unlock_irqrestore(&sched->lock, flags);
> > -
> > - list_for_each_entry(job, &list, link)
> > + list_for_each_entry(job, &sched->job_list, link)
> > hrtimer_cancel(&job->timer);
> >
> > - spin_lock_irqsave(&sched->lock, flags);
> > - list_for_each_entry_safe(job, next, &list, link)
> > - drm_mock_sched_job_complete(job);
> > - spin_unlock_irqrestore(&sched->lock, flags);
> > -
> > - /*
> > - * Free completed jobs and jobs not yet processed by the
> > DRM scheduler
> > - * free worker.
> > - */
> > - spin_lock_irqsave(&sched->lock, flags);
> > - list_for_each_entry_safe(job, next, &sched->done_list,
> > link)
> > - list_move_tail(&job->link, &list);
> > - spin_unlock_irqrestore(&sched->lock, flags);
> > -
> > - list_for_each_entry_safe(job, next, &list, link)
> > - mock_sched_free_job(&job->base);
> > -
> > drm_sched_fini(&sched->base);
> > }
> >
> > @@ -346,7 +327,7 @@ unsigned int drm_mock_sched_advance(struct
> > drm_mock_scheduler *sched,
> > if (sched->hw_timeline.cur_seqno < job-
> > >hw_fence.seqno)
> > break;
> >
> > - drm_mock_sched_job_complete(job);
> > + drm_mock_sched_job_complete(job, 0);
> > found++;
> > }
> > unlock:
> > diff --git a/drivers/gpu/drm/scheduler/tests/sched_tests.h
> > b/drivers/gpu/drm/scheduler/tests/sched_tests.h
> > index 27caf8285fb7..22e530d87791 100644
> > --- a/drivers/gpu/drm/scheduler/tests/sched_tests.h
> > +++ b/drivers/gpu/drm/scheduler/tests/sched_tests.h
> > @@ -32,9 +32,8 @@
> > *
> > * @base: DRM scheduler base class
> > * @test: Backpointer to owning the kunit test case
> > - * @lock: Lock to protect the simulated @hw_timeline, @job_list
> > and @done_list
> > + * @lock: Lock to protect the simulated @hw_timeline and @job_list
> > * @job_list: List of jobs submitted to the mock GPU
> > - * @done_list: List of jobs completed by the mock GPU
> > * @hw_timeline: Simulated hardware timeline has a @context,
> > @next_seqno and
> > * @cur_seqno for implementing a struct dma_fence
> > signaling the
> > * simulated job completion.
> > @@ -49,7 +48,6 @@ struct drm_mock_scheduler {
> >
> > spinlock_t lock;
> > struct list_head job_list;
> > - struct list_head done_list;
> >
> > struct {
> > u64 context;
>
next prev parent reply other threads:[~2025-05-22 14:59 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-22 8:27 [PATCH v3 0/5] Fix memory leaks in drm_sched_fini() Philipp Stanner
2025-05-22 8:27 ` [PATCH v3 1/5] drm/sched: Fix teardown leaks with waitqueue Philipp Stanner
2025-05-22 12:44 ` Danilo Krummrich
2025-05-22 13:37 ` Tvrtko Ursulin
2025-05-22 15:32 ` Philipp Stanner
2025-05-23 15:35 ` Tvrtko Ursulin
2025-05-22 8:27 ` [PATCH v3 2/5] drm/sched/tests: Port tests to new cleanup method Philipp Stanner
2025-05-22 14:06 ` Tvrtko Ursulin
2025-05-22 14:59 ` Philipp Stanner [this message]
2025-05-23 15:49 ` Tvrtko Ursulin
2025-05-22 8:27 ` [PATCH v3 3/5] drm/sched: Warn if pending list is not empty Philipp Stanner
2025-05-22 8:27 ` [PATCH v3 4/5] drm/nouveau: Add new callback for scheduler teardown Philipp Stanner
2025-05-22 8:27 ` [PATCH v3 5/5] drm/nouveau: Remove waitque for sched teardown Philipp Stanner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a637755cb96de8415b51feb1ae61b8c651e94295.camel@mailbox.org \
--to=phasta@mailbox.org \
--cc=airlied@gmail.com \
--cc=ckoenig.leichtzumerken@gmail.com \
--cc=dakr@kernel.org \
--cc=dri-devel@lists.freedesktop.org \
--cc=linux-kernel@vger.kernel.org \
--cc=lyude@redhat.com \
--cc=maarten.lankhorst@linux.intel.com \
--cc=matthew.brost@intel.com \
--cc=mripard@kernel.org \
--cc=nouveau@lists.freedesktop.org \
--cc=phasta@kernel.org \
--cc=simona@ffwll.ch \
--cc=tvrtko.ursulin@igalia.com \
--cc=tzimmermann@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).