All of lore.kernel.org
 help / color / mirror / Atom feed
From: Boris Brezillon <boris.brezillon@collabora.com>
To: "Christian König" <christian.koenig@amd.com>
Cc: Matthew Brost <matthew.brost@intel.com>,
	Sarah Walker <sarah.walker@imgtec.com>,
	"dri-devel@lists.freedesktop.org"
	<dri-devel@lists.freedesktop.org>,
	Alex Deucher <alexander.deucher@amd.com>
Subject: Re: drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated
Date: Tue, 2 May 2023 18:50:06 +0200	[thread overview]
Message-ID: <20230502185006.4b2e67a7@collabora.com> (raw)
In-Reply-To: <5c4f4e89-6126-7701-2023-2628db1b7caa@amd.com>

+Matthew who's been working on the kthread -> wq transition.

On Tue, 2 May 2023 13:36:07 +0200
Christian König <christian.koenig@amd.com> wrote:

> Hi Boris,
> 
> Am 02.05.23 um 13:19 schrieb Boris Brezillon:
> > Hello Christian, Alex,
> >
> > As part of our transition to drm_sched for the powervr GPU driver, we
> > realized drm_sched_resubmit_jobs(), which is used by all drivers
> > relying on drm_sched right except amdgpu, has been deprecated.
> > Unfortunately, commit 5efbe6aa7a0e ("drm/scheduler: deprecate
> > drm_sched_resubmit_jobs") doesn't describe what drivers should do or use
> > as an alternative.
> >
> > At the very least, for our implementation, we need to restore the
> > drm_sched_job::parent pointers that were set to NULL in
> > drm_sched_stop(), such that jobs submitted before the GPU recovery are
> > considered active when drm_sched_start() is called. That could be done
> > with a custom pending_list iteration restoring drm_sched_job::parent's
> > pointer, but that seems odd to let the scheduler backend manipulate
> > this list directly, and I suspect we need to do other checks, like the
> > karma vs hang-limit thing, so we can flag the entity dirty and cancel
> > all jobs being queued there if the entity has caused too many hangs.
> >
> > Now that drm_sched_resubmit_jobs() has been deprecated, that would be
> > great if you could help us write a piece of documentation describing
> > what should be done between drm_sched_stop() and drm_sched_start(), so
> > new drivers don't come up with their own slightly different/broken
> > version of the same thing.  
> 
> Yeah, really good point! The solution is to not use drm_sched_stop() and 
> drm_sched_start() either.
> 
> The general idea Daniel, the other Intel guys and me seem to have agreed 
> on is to convert the scheduler thread into a work item.
> 
> This work item for pushing jobs to the hw can then be queued to the same 
> workqueue we use for the timeout work item.
> 
> If this workqueue is now configured by your driver as single threaded 
> you have a guarantee that only either the scheduler or the timeout work 
> item is running at the same time. That in turn makes starting/stopping 
> the scheduler for a reset completely superfluous.
> 
> Patches for this has already been floating on the mailing list, but 
> haven't been committed yet. Since this is all WIP.

As I said previously, our work is based on Matthew's patch series
converting drm_sched threads to a workqueue based implementation.
And I think there are a couple of shortcomings with the current
implementation if we do what you suggest (one single-threaded workqueue
used to serialize the GPU resets and job dequeuing/submission):

- drm_sched_main() has a loop dequeuing jobs until it can't (because a
  dep is not signaled, or because all slots are taken), not one at a
  time. There can also be several 1:1 entities waiting for jobs to be
  dequeued before the reset work. That means the reset operation might
  be delayed if we don't have a drm_sched_stop() calling cancel_work()
  and making sure we break out of the loop.
- with a workqueue (and that's even worse with a single-threaded one),
  the entity priorities are not enforced at the de-queuing level. The FW
  will still take care of execution priority, but you can have a low
  prio drm_sched_entity getting pushed a lot of jobs, and with the
  drm_sched_main() loop, you'll only let other entities dequeue their
  jobs when the ringbuf of this low prio entity is full or a job dep
  is not signaled. De-queuing one job at a time makes things better, but
  if you have a lot of 1:1 entities, it can still take sometime until
  you reach the high prio drm_sched worker.

I'm just wondering if you already have solutions to these problems.

Regards,

Boris

      parent reply	other threads:[~2023-05-02 16:50 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-02 11:19 drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated Boris Brezillon
2023-05-02 11:36 ` Christian König
2023-05-02 12:41   ` Boris Brezillon
2023-05-03  8:16     ` Boris Brezillon
2023-05-03  8:47       ` Christian König
2023-05-03  9:49         ` Boris Brezillon
2023-05-03 10:28         ` Lucas Stach
2023-05-03 11:40           ` Christian König
2023-05-03 13:10             ` Lucas Stach
2023-05-03 15:01               ` Christian König
2023-05-04  4:54         ` Matthew Brost
2023-05-04 11:07           ` Christian König
2023-05-04 13:07             ` Matthew Brost
2023-05-02 16:50   ` Boris Brezillon [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230502185006.4b2e67a7@collabora.com \
    --to=boris.brezillon@collabora.com \
    --cc=alexander.deucher@amd.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=matthew.brost@intel.com \
    --cc=sarah.walker@imgtec.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.