Re: [PATCH v3 2/5] drm/sched/tests: Port tests to new cleanup method

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Philipp Stanner <phasta@mailbox.org>
To: "Tvrtko Ursulin" <tvrtko.ursulin@igalia.com>,
	"Philipp Stanner" <phasta@kernel.org>,
	"Lyude Paul" <lyude@redhat.com>,
	"Danilo Krummrich" <dakr@kernel.org>,
	"David Airlie" <airlied@gmail.com>,
	"Simona Vetter" <simona@ffwll.ch>,
	"Matthew Brost" <matthew.brost@intel.com>,
	"Christian König" <ckoenig.leichtzumerken@gmail.com>,
	"Maarten Lankhorst" <maarten.lankhorst@linux.intel.com>,
	"Maxime Ripard" <mripard@kernel.org>,
	"Thomas Zimmermann" <tzimmermann@suse.de>
Cc: dri-devel@lists.freedesktop.org, nouveau@lists.freedesktop.org,
	 linux-kernel@vger.kernel.org
Subject: Re: [PATCH v3 2/5] drm/sched/tests: Port tests to new cleanup method
Date: Thu, 22 May 2025 16:59:35 +0200	[thread overview]
Message-ID: <a637755cb96de8415b51feb1ae61b8c651e94295.camel@mailbox.org> (raw)
In-Reply-To: <b24d5c5e-8a9e-4dfb-886b-b3ad70e62e76@igalia.com>

On Thu, 2025-05-22 at 15:06 +0100, Tvrtko Ursulin wrote:
> 
> On 22/05/2025 09:27, Philipp Stanner wrote:
> > The drm_gpu_scheduler now supports a callback to help
> > drm_sched_fini()
> > avoid memory leaks. This callback instructs the driver to signal
> > all
> > pending hardware fences.
> > 
> > Implement the new callback
> > drm_sched_backend_ops.cancel_pending_fences().
> > 
> > Have the callback use drm_mock_sched_job_complete() with a new
> > error
> > field for the fence error.
> > 
> > Keep the job status as DRM_MOCK_SCHED_JOB_DONE for now, since there
> > is
> > no party for which checking for a CANCELED status would be useful
> > currently.
> > 
> > Signed-off-by: Philipp Stanner <phasta@kernel.org>
> > ---
> >   .../gpu/drm/scheduler/tests/mock_scheduler.c  | 67 +++++++-------
> > -----
> >   drivers/gpu/drm/scheduler/tests/sched_tests.h |  4 +-
> >   2 files changed, 25 insertions(+), 46 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > index f999c8859cf7..eca47f0395bc 100644
> > --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > @@ -55,7 +55,7 @@ void drm_mock_sched_entity_free(struct
> > drm_mock_sched_entity *entity)
> >   	drm_sched_entity_destroy(&entity->base);
> >   }
> >   
> > -static void drm_mock_sched_job_complete(struct drm_mock_sched_job
> > *job)
> > +static void drm_mock_sched_job_complete(struct drm_mock_sched_job
> > *job, int err)
> >   {
> >   	struct drm_mock_scheduler *sched =
> >   		drm_sched_to_mock_sched(job->base.sched);
> > @@ -63,7 +63,9 @@ static void drm_mock_sched_job_complete(struct
> > drm_mock_sched_job *job)
> >   	lockdep_assert_held(&sched->lock);
> >   
> >   	job->flags |= DRM_MOCK_SCHED_JOB_DONE;
> > -	list_move_tail(&job->link, &sched->done_list);
> > +	list_del(&job->link);
> > +	if (err)
> > +		dma_fence_set_error(&job->hw_fence, err);
> >   	dma_fence_signal(&job->hw_fence);
> >   	complete(&job->done);
> >   }
> > @@ -89,7 +91,7 @@ drm_mock_sched_job_signal_timer(struct hrtimer
> > *hrtimer)
> >   			break;
> >   
> >   		sched->hw_timeline.cur_seqno = job-
> > >hw_fence.seqno;
> > -		drm_mock_sched_job_complete(job);
> > +		drm_mock_sched_job_complete(job, 0);
> >   	}
> >   	spin_unlock_irqrestore(&sched->lock, flags);
> >   
> > @@ -212,26 +214,33 @@ mock_sched_timedout_job(struct drm_sched_job
> > *sched_job)
> >   
> >   static void mock_sched_free_job(struct drm_sched_job *sched_job)
> >   {
> > -	struct drm_mock_scheduler *sched =
> > -			drm_sched_to_mock_sched(sched_job->sched);
> >   	struct drm_mock_sched_job *job =
> > drm_sched_job_to_mock_job(sched_job);
> > -	unsigned long flags;
> >   
> > -	/* Remove from the scheduler done list. */
> > -	spin_lock_irqsave(&sched->lock, flags);
> > -	list_del(&job->link);
> > -	spin_unlock_irqrestore(&sched->lock, flags);
> >   	dma_fence_put(&job->hw_fence);
> > -
> >   	drm_sched_job_cleanup(sched_job);
> >   
> >   	/* Mock job itself is freed by the kunit framework. */
> >   }
> >   
> > +static void mock_sched_cancel_pending_fences(struct
> > drm_gpu_scheduler *gsched)
> 
> "gsched" feels like a first time invention. Maybe drm_sched?

Alright

> 
> > +{
> > +	struct drm_mock_sched_job *job, *next;
> > +	struct drm_mock_scheduler *sched;
> > +	unsigned long flags;
> > +
> > +	sched = container_of(gsched, struct drm_mock_scheduler,
> > base);
> > +
> > +	spin_lock_irqsave(&sched->lock, flags);
> > +	list_for_each_entry_safe(job, next, &sched->job_list,
> > link)
> > +		drm_mock_sched_job_complete(job, -ECANCELED);
> > +	spin_unlock_irqrestore(&sched->lock, flags);
> 
> Canceling of the timers belongs in this call back I think. Otherwise 
> jobs are not fully cancelled.

I wouldn't say so – the timers represent things like the hardware
interrupts. And those must be deactivated by the driver itself.

See, one big reason why I like my approach is that the contract between
driver and scheduler is made very simple:

"Driver, signal all fences that you ever returned through run_job() to
this scheduler!"

That always works, and the driver always has all those fences. It's
based on the most fundamental agreement regarding dma_fences: they must
all be signaled.

> 
> Hm, I also think, conceptually, the order of first canceling the
> timer 
> and then signaling the fence should be kept.

That's the case here, no?

It must indeed be kept, otherwise the timers could fire after
everything is torn down -> UAF

> 
> At the moment it does not matter hugely, since the timer does not
> signal 
> the jobs directly and will not find unlinked jobs, but if that
> changes 
> in the future, the reversed order could cause double signaling. So if
> you keep it in the correct logical order that potential gotcha is 
> avoided. Basically just keep the two pass approach verbatim, as is in
> the current drm_mock_sched_fini.
> 
> The rest of the conversion is I think good.

:)

> 
> Only a slight uncertainty after I cross-referenced with my version 
> (->cancel_job()) around why I needed to add signaling to 
> mock_sched_timedout_job() and manual job cleanup to the timeout test.
> It 
> was more than a month ago that I wrote it so can't remember right
> now. 
> You checked for memory leaks and the usual stuff?

Hm, it seems I indeed ran into that leak that you fixed (in addition to
the other stuff) in your RFC, for the timeout tests.

We should fix that in a separate patch, probably.


P.

> 
> Regards,
> 
> Tvrtko
> 
> > +}
> > +
> >   static const struct drm_sched_backend_ops drm_mock_scheduler_ops
> > = {
> >   	.run_job = mock_sched_run_job,
> >   	.timedout_job = mock_sched_timedout_job,
> > -	.free_job = mock_sched_free_job
> > +	.free_job = mock_sched_free_job,
> > +	.cancel_pending_fences = mock_sched_cancel_pending_fences,
> >   };
> >   
> >   /**
> > @@ -265,7 +274,6 @@ struct drm_mock_scheduler
> > *drm_mock_sched_new(struct kunit *test, long timeout)
> >   	sched->hw_timeline.context = dma_fence_context_alloc(1);
> >   	atomic_set(&sched->hw_timeline.next_seqno, 0);
> >   	INIT_LIST_HEAD(&sched->job_list);
> > -	INIT_LIST_HEAD(&sched->done_list);
> >   	spin_lock_init(&sched->lock);
> >   
> >   	return sched;
> > @@ -280,38 +288,11 @@ struct drm_mock_scheduler
> > *drm_mock_sched_new(struct kunit *test, long timeout)
> >    */
> >   void drm_mock_sched_fini(struct drm_mock_scheduler *sched)
> >   {
> > -	struct drm_mock_sched_job *job, *next;
> > -	unsigned long flags;
> > -	LIST_HEAD(list);
> > +	struct drm_mock_sched_job *job;
> >   
> > -	drm_sched_wqueue_stop(&sched->base);
> > -
> > -	/* Force complete all unfinished jobs. */
> > -	spin_lock_irqsave(&sched->lock, flags);
> > -	list_for_each_entry_safe(job, next, &sched->job_list,
> > link)
> > -		list_move_tail(&job->link, &list);
> > -	spin_unlock_irqrestore(&sched->lock, flags);
> > -
> > -	list_for_each_entry(job, &list, link)
> > +	list_for_each_entry(job, &sched->job_list, link)
> >   		hrtimer_cancel(&job->timer);
> >   
> > -	spin_lock_irqsave(&sched->lock, flags);
> > -	list_for_each_entry_safe(job, next, &list, link)
> > -		drm_mock_sched_job_complete(job);
> > -	spin_unlock_irqrestore(&sched->lock, flags);
> > -
> > -	/*
> > -	 * Free completed jobs and jobs not yet processed by the
> > DRM scheduler
> > -	 * free worker.
> > -	 */
> > -	spin_lock_irqsave(&sched->lock, flags);
> > -	list_for_each_entry_safe(job, next, &sched->done_list,
> > link)
> > -		list_move_tail(&job->link, &list);
> > -	spin_unlock_irqrestore(&sched->lock, flags);
> > -
> > -	list_for_each_entry_safe(job, next, &list, link)
> > -		mock_sched_free_job(&job->base);
> > -
> >   	drm_sched_fini(&sched->base);
> >   }
> >   
> > @@ -346,7 +327,7 @@ unsigned int drm_mock_sched_advance(struct
> > drm_mock_scheduler *sched,
> >   		if (sched->hw_timeline.cur_seqno < job-
> > >hw_fence.seqno)
> >   			break;
> >   
> > -		drm_mock_sched_job_complete(job);
> > +		drm_mock_sched_job_complete(job, 0);
> >   		found++;
> >   	}
> >   unlock:
> > diff --git a/drivers/gpu/drm/scheduler/tests/sched_tests.h
> > b/drivers/gpu/drm/scheduler/tests/sched_tests.h
> > index 27caf8285fb7..22e530d87791 100644
> > --- a/drivers/gpu/drm/scheduler/tests/sched_tests.h
> > +++ b/drivers/gpu/drm/scheduler/tests/sched_tests.h
> > @@ -32,9 +32,8 @@
> >    *
> >    * @base: DRM scheduler base class
> >    * @test: Backpointer to owning the kunit test case
> > - * @lock: Lock to protect the simulated @hw_timeline, @job_list
> > and @done_list
> > + * @lock: Lock to protect the simulated @hw_timeline and @job_list
> >    * @job_list: List of jobs submitted to the mock GPU
> > - * @done_list: List of jobs completed by the mock GPU
> >    * @hw_timeline: Simulated hardware timeline has a @context,
> > @next_seqno and
> >    *		 @cur_seqno for implementing a struct dma_fence
> > signaling the
> >    *		 simulated job completion.
> > @@ -49,7 +48,6 @@ struct drm_mock_scheduler {
> >   
> >   	spinlock_t		lock;
> >   	struct list_head	job_list;
> > -	struct list_head	done_list;
> >   
> >   	struct {
> >   		u64		context;
>

next prev parent reply	other threads:[~2025-05-22 14:59 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-22  8:27 [PATCH v3 0/5] Fix memory leaks in drm_sched_fini() Philipp Stanner
2025-05-22  8:27 ` [PATCH v3 1/5] drm/sched: Fix teardown leaks with waitqueue Philipp Stanner
2025-05-22 12:44   ` Danilo Krummrich
2025-05-22 13:37   ` Tvrtko Ursulin
2025-05-22 15:32     ` Philipp Stanner
2025-05-23 15:35       ` Tvrtko Ursulin
2025-05-22  8:27 ` [PATCH v3 2/5] drm/sched/tests: Port tests to new cleanup method Philipp Stanner
2025-05-22 14:06   ` Tvrtko Ursulin
2025-05-22 14:59     ` Philipp Stanner [this message]
2025-05-23 15:49       ` Tvrtko Ursulin
2025-05-22  8:27 ` [PATCH v3 3/5] drm/sched: Warn if pending list is not empty Philipp Stanner
2025-05-22  8:27 ` [PATCH v3 4/5] drm/nouveau: Add new callback for scheduler teardown Philipp Stanner
2025-05-22  8:27 ` [PATCH v3 5/5] drm/nouveau: Remove waitque for sched teardown Philipp Stanner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a637755cb96de8415b51feb1ae61b8c651e94295.camel@mailbox.org \
    --to=phasta@mailbox.org \
    --cc=airlied@gmail.com \
    --cc=ckoenig.leichtzumerken@gmail.com \
    --cc=dakr@kernel.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lyude@redhat.com \
    --cc=maarten.lankhorst@linux.intel.com \
    --cc=matthew.brost@intel.com \
    --cc=mripard@kernel.org \
    --cc=nouveau@lists.freedesktop.org \
    --cc=phasta@kernel.org \
    --cc=simona@ffwll.ch \
    --cc=tvrtko.ursulin@igalia.com \
    --cc=tzimmermann@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).