Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Danilo Krummrich <dakr@kernel.org>
To: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: phasta@kernel.org, "Lyude Paul" <lyude@redhat.com>,
	"David Airlie" <airlied@gmail.com>,
	"Simona Vetter" <simona@ffwll.ch>,
	"Matthew Brost" <matthew.brost@intel.com>,
	"Christian König" <ckoenig.leichtzumerken@gmail.com>,
	"Maarten Lankhorst" <maarten.lankhorst@linux.intel.com>,
	"Maxime Ripard" <mripard@kernel.org>,
	"Thomas Zimmermann" <tzimmermann@suse.de>,
	dri-devel@lists.freedesktop.org, nouveau@lists.freedesktop.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
Date: Thu, 17 Apr 2025 16:48:34 +0200	[thread overview]
Message-ID: <aAEUwjzZ9w9xlKRY@cassiopeiae> (raw)
In-Reply-To: <83758ca7-8ece-433e-b904-3d21690ead23@igalia.com>

On Thu, Apr 17, 2025 at 03:20:44PM +0100, Tvrtko Ursulin wrote:
> 
> On 17/04/2025 13:11, Danilo Krummrich wrote:
> > On Thu, Apr 17, 2025 at 12:27:29PM +0100, Tvrtko Ursulin wrote:
> > > 
> > > On 17/04/2025 08:45, Philipp Stanner wrote:
> > > > On Mon, 2025-04-07 at 17:22 +0200, Philipp Stanner wrote:
> > > 
> > > Problem exactly is that jobs can outlive the entities and the scheduler,
> > > while some userspace may have a dma fence reference to the job via sync
> > > file. This new callback would not solve it for xe, but if everything
> > > required was reference counted it would.
> > 
> > I think you're mixing up the job and the dma_fence here, if a job outlives the
> > scheduler, it clearly is a bug, always has been.
> > 
> > AFAIK, Xe reference counts it's driver specific job structures *and* the driver
> > specific scheduler structure, such that drm_sched_fini() won't be called before
> > all jobs have finished.
> 
> Yes, sorry, dma fence. But it is not enough to postpone drm_sched_fini until
> the job is not finished. Problem is exported dma fence holds the pointer to
> drm_sched_fence (and so oopses in drm_sched_fence_get_timeline_name on
> fence->sched->name) *after* job had finished and driver was free to tear
> everything down.

Well, that's a bug in drm_sched_fence then and independent from the other topic.
Once the finished fence in a struct drm_sched_fence has been signaled it must
live independent of the scheduler.

The lifetime of the drm_sched_fence is entirely independent from the scheduler
itself, as you correctly point out.

Starting to reference count things to keep the whole scheduler etc. alive as
long as the drm_sched_fence lives is not the correct solution.

> > Multiple solutions have been discussed already, e.g. just wait for the pending
> > list to be empty, reference count the scheduler for every pending job. Those all
> > had significant downsides, which I don't see with this proposal.
> > 
> > I'm all for better ideas though -- what do you propose?
> 
> I think we need to brainstorm both issues and see if there is a solution
> which solves them both, with bonus points for being elegant.

The problems are not related. As mentioned above, once signaled a
drm_sched_fence must not depend on the scheduler any longer.

> > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > index 6b72278c4b72..ae3152beca14 100644
> > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > @@ -1465,6 +1465,10 @@ void drm_sched_fini(struct drm_gpu_scheduler
> > > > > *sched)
> > > > >    	sched->ready = false;
> > > > >    	kfree(sched->sched_rq);
> > > > >    	sched->sched_rq = NULL;
> > > > > +
> > > > > +	if (!list_empty(&sched->pending_list))
> > > > > +		dev_err(sched->dev, "%s: Tearing down scheduler
> > > > > while jobs are pending!\n",
> > > > > +			__func__);
> > > 
> > > It isn't fair to add this error since it would out of the blue start firing
> > > for everyone expect nouveau, no? Regardless if there is a leak or not.
> > 
> > I think it is pretty fair to warn when detecting a guaranteed bug, no?
> > 
> > If drm_sched_fini() is call while jobs are still on the pending_list, they won't
> > ever be freed, because all workqueues are stopped.
> 
> Is it a guaranteed bug for drivers are aware of the drm_sched_fini()
> limitation and are cleaning up upon themselves?

How could a driver clean up on itself (unless the driver holds its own list of
pending jobs)?

Once a job is in flight (i.e. it's on the pending_list) we must guarantee that
free_job() is called by the scheduler, which it can't do if we call
drm_sched_fini() before the pending_list is empty.

> In other words if you apply the series up to here would it trigger for
> nouveau?

No, because nouveau does something very stupid, i.e. replicate the pending_list.

> Reportedly it triggers for the mock scheduler which also has no
> leak.

That sounds impossible. How do you ensure you do *not* leak memory when you tear
down the scheduler while it still has pending jobs? Or in other words, who calls
free_job() if not the scheduler itself?

> Also, I asked in my initial reply if we have a list of which of the current
> drivers suffer from memory leaks. Is it all or some etc.

Not all, but quite some I think. The last time I looked (which is about a year
ago) amdgpu for instance could leak memory when you unbind the driver while
enough jobs are in flight.

next prev parent reply	other threads:[~2025-04-17 14:48 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-07 15:22 [PATCH 0/5] drm/sched: Fix memory leaks in drm_sched_fini() Philipp Stanner
2025-04-07 15:22 ` [PATCH 1/5] drm/sched: Fix teardown leaks with waitqueue Philipp Stanner
2025-04-17  7:49   ` Philipp Stanner
2025-04-07 15:22 ` [PATCH 2/5] drm/sched: Prevent teardown waitque from blocking too long Philipp Stanner
2025-04-07 15:22 ` [PATCH 3/5] drm/sched: Warn if pending list is not empty Philipp Stanner
2025-04-17  7:45   ` Philipp Stanner
2025-04-17 11:27     ` Tvrtko Ursulin
2025-04-17 12:11       ` Danilo Krummrich
2025-04-17 14:20         ` Tvrtko Ursulin
2025-04-17 14:48           ` Danilo Krummrich [this message]
2025-04-17 16:08             ` Tvrtko Ursulin
2025-04-17 17:07               ` Danilo Krummrich
2025-04-22  6:06               ` Philipp Stanner
2025-04-22 10:39                 ` Tvrtko Ursulin
2025-04-22 11:13                   ` Danilo Krummrich
2025-04-22 12:00                     ` Philipp Stanner
2025-04-22 13:25                       ` Tvrtko Ursulin
2025-04-22 12:07                     ` Tvrtko Ursulin
2025-04-22 12:21                       ` Philipp Stanner
2025-04-22 12:32                       ` Danilo Krummrich
2025-04-22 13:39                         ` Tvrtko Ursulin
2025-04-22 13:46                           ` Philipp Stanner
2025-04-22 14:08                           ` Danilo Krummrich
2025-04-22 14:16                             ` Philipp Stanner
2025-04-22 14:52                               ` Danilo Krummrich
2025-04-23  7:34                                 ` Tvrtko Ursulin
2025-04-23  8:48                                   ` Danilo Krummrich
2025-04-23 10:10                                     ` Tvrtko Ursulin
2025-04-23 10:26                                       ` Danilo Krummrich
2025-04-07 15:22 ` [PATCH 4/5] drm/nouveau: Add new callback for scheduler teardown Philipp Stanner
2025-04-07 15:22 ` [PATCH 5/5] drm/nouveau: Remove waitque for sched teardown Philipp Stanner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aAEUwjzZ9w9xlKRY@cassiopeiae \
    --to=dakr@kernel.org \
    --cc=airlied@gmail.com \
    --cc=ckoenig.leichtzumerken@gmail.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lyude@redhat.com \
    --cc=maarten.lankhorst@linux.intel.com \
    --cc=matthew.brost@intel.com \
    --cc=mripard@kernel.org \
    --cc=nouveau@lists.freedesktop.org \
    --cc=phasta@kernel.org \
    --cc=simona@ffwll.ch \
    --cc=tvrtko.ursulin@igalia.com \
    --cc=tzimmermann@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox