Re: drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Boris Brezillon <boris.brezillon@collabora.com>
To: "Christian König" <christian.koenig@amd.com>
Cc: Matthew Brost <matthew.brost@intel.com>,
	Sarah Walker <sarah.walker@imgtec.com>,
	"dri-devel@lists.freedesktop.org"
	<dri-devel@lists.freedesktop.org>,
	Alex Deucher <alexander.deucher@amd.com>
Subject: Re: drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated
Date: Wed, 3 May 2023 10:16:24 +0200	[thread overview]
Message-ID: <20230503101624.5dbae57c@collabora.com> (raw)
In-Reply-To: <20230502144132.6a9e1bb5@collabora.com>

Hi Christian,

On Tue, 2 May 2023 14:41:32 +0200
Boris Brezillon <boris.brezillon@collabora.com> wrote:

> Hi Christian,
> 
> Thanks for your quick reply.
> 
> On Tue, 2 May 2023 13:36:07 +0200
> Christian König <christian.koenig@amd.com> wrote:
> 
> > Hi Boris,
> > 
> > Am 02.05.23 um 13:19 schrieb Boris Brezillon:  
> > > Hello Christian, Alex,
> > >
> > > As part of our transition to drm_sched for the powervr GPU driver, we
> > > realized drm_sched_resubmit_jobs(), which is used by all drivers
> > > relying on drm_sched right except amdgpu, has been deprecated.
> > > Unfortunately, commit 5efbe6aa7a0e ("drm/scheduler: deprecate
> > > drm_sched_resubmit_jobs") doesn't describe what drivers should do or use
> > > as an alternative.
> > >
> > > At the very least, for our implementation, we need to restore the
> > > drm_sched_job::parent pointers that were set to NULL in
> > > drm_sched_stop(), such that jobs submitted before the GPU recovery are
> > > considered active when drm_sched_start() is called. That could be done
> > > with a custom pending_list iteration restoring drm_sched_job::parent's
> > > pointer, but that seems odd to let the scheduler backend manipulate
> > > this list directly, and I suspect we need to do other checks, like the
> > > karma vs hang-limit thing, so we can flag the entity dirty and cancel
> > > all jobs being queued there if the entity has caused too many hangs.
> > >
> > > Now that drm_sched_resubmit_jobs() has been deprecated, that would be
> > > great if you could help us write a piece of documentation describing
> > > what should be done between drm_sched_stop() and drm_sched_start(), so
> > > new drivers don't come up with their own slightly different/broken
> > > version of the same thing.    
> > 
> > Yeah, really good point! The solution is to not use drm_sched_stop() and 
> > drm_sched_start() either.  
> 
> Okay. If that's what we're heading to, this should really be clarified
> in the job_timedout() method doc, because right now it's
> mentioning drm_sched_{start,stop,resubmit_jobs}(), with
> drm_sched_resubmit_jobs() being deprecated already.
> 
> > 
> > The general idea Daniel, the other Intel guys and me seem to have agreed 
> > on is to convert the scheduler thread into a work item.
> > 
> > This work item for pushing jobs to the hw can then be queued to the same 
> > workqueue we use for the timeout work item.
> > 
> > If this workqueue is now configured by your driver as single threaded 
> > you have a guarantee that only either the scheduler or the timeout work 
> > item is running at the same time. That in turn makes starting/stopping 
> > the scheduler for a reset completely superfluous.  
> 
> Makes sense.
> 
> > 
> > Patches for this has already been floating on the mailing list, but 
> > haven't been committed yet. Since this is all WIP.  
> 
> Assuming you're talking about [1], yes, I'm aware of this effort
> (PowerVR also has FW-side scheduling, which is what this patch series
> was trying to address initially). And I'm aware of the
> ordered-workqueue trick too, it helped fixing a few races in panfrost
> in the past.
> 
> > 
> > In general it's not really a good idea to change the scheduler and hw 
> > fences during GPU reset/recovery. The dma_fence implementation has a 
> > pretty strict state transition which clearly say that a dma_fence should 
> > never go back from signaled to unsignaled and when you start messing 
> > with that this is exactly what might happen.
> > 
> > What you can do is to save your hw state and re-start at the same 
> > location after handling the timeout.  
> 
> To sum-up, we shouldn't call drm_sched_{start,stop,resubmit_jobs}().

After the discussion I had with Matthew yesterday on IRC, I
realized there was no clear agreement on this. Matthew uses those 3
helpers in the Xe driver right now, and given he intends to use a
multi-threaded wq for its 1:1 schedulers run queue, there's no way he
can get away without calling drm_sched_{start,stop}().
drm_sched_resubmit_jobs() can be open-coded in each driver, but I'm
wondering if it wouldn't be preferable to add a ::resubmit_job() method
or extend the ::run_job() one to support the resubmit semantics, which,
AFAIU, is just about enforcing the job done fence (the one returned by
::run_job()) doesn't transition from a signaled to an unsignaled state.

But probably more important than providing a generic helper, we should
document the resubmit semantics (AKA, what should and/or shouldn't be
done with pending jobs when a recovery happens). Because forbidding
people to use a generic helper function doesn't give any guarantee that
they'll do the right thing when coding their own logic, unless we give
clues about what's considered right/wrong, and the current state of the
doc is pretty unclear in this regard.

Regards,

Boris

next prev parent reply	other threads:[~2023-05-03  8:16 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-02 11:19 drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated Boris Brezillon
2023-05-02 11:36 ` Christian König
2023-05-02 12:41   ` Boris Brezillon
2023-05-03  8:16     ` Boris Brezillon [this message]
2023-05-03  8:47       ` Christian König
2023-05-03  9:49         ` Boris Brezillon
2023-05-03 10:28         ` Lucas Stach
2023-05-03 11:40           ` Christian König
2023-05-03 13:10             ` Lucas Stach
2023-05-03 15:01               ` Christian König
2023-05-04  4:54         ` Matthew Brost
2023-05-04 11:07           ` Christian König
2023-05-04 13:07             ` Matthew Brost
2023-05-02 16:50   ` Boris Brezillon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230503101624.5dbae57c@collabora.com \
    --to=boris.brezillon@collabora.com \
    --cc=alexander.deucher@amd.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=matthew.brost@intel.com \
    --cc=sarah.walker@imgtec.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.