From: Boris Brezillon <boris.brezillon@collabora.com>
To: "Christian König" <christian.koenig@amd.com>
Cc: Matthew Brost <matthew.brost@intel.com>,
Sarah Walker <sarah.walker@imgtec.com>,
"dri-devel@lists.freedesktop.org"
<dri-devel@lists.freedesktop.org>,
Alex Deucher <alexander.deucher@amd.com>
Subject: Re: drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated
Date: Tue, 2 May 2023 18:50:06 +0200 [thread overview]
Message-ID: <20230502185006.4b2e67a7@collabora.com> (raw)
In-Reply-To: <5c4f4e89-6126-7701-2023-2628db1b7caa@amd.com>
+Matthew who's been working on the kthread -> wq transition.
On Tue, 2 May 2023 13:36:07 +0200
Christian König <christian.koenig@amd.com> wrote:
> Hi Boris,
>
> Am 02.05.23 um 13:19 schrieb Boris Brezillon:
> > Hello Christian, Alex,
> >
> > As part of our transition to drm_sched for the powervr GPU driver, we
> > realized drm_sched_resubmit_jobs(), which is used by all drivers
> > relying on drm_sched right except amdgpu, has been deprecated.
> > Unfortunately, commit 5efbe6aa7a0e ("drm/scheduler: deprecate
> > drm_sched_resubmit_jobs") doesn't describe what drivers should do or use
> > as an alternative.
> >
> > At the very least, for our implementation, we need to restore the
> > drm_sched_job::parent pointers that were set to NULL in
> > drm_sched_stop(), such that jobs submitted before the GPU recovery are
> > considered active when drm_sched_start() is called. That could be done
> > with a custom pending_list iteration restoring drm_sched_job::parent's
> > pointer, but that seems odd to let the scheduler backend manipulate
> > this list directly, and I suspect we need to do other checks, like the
> > karma vs hang-limit thing, so we can flag the entity dirty and cancel
> > all jobs being queued there if the entity has caused too many hangs.
> >
> > Now that drm_sched_resubmit_jobs() has been deprecated, that would be
> > great if you could help us write a piece of documentation describing
> > what should be done between drm_sched_stop() and drm_sched_start(), so
> > new drivers don't come up with their own slightly different/broken
> > version of the same thing.
>
> Yeah, really good point! The solution is to not use drm_sched_stop() and
> drm_sched_start() either.
>
> The general idea Daniel, the other Intel guys and me seem to have agreed
> on is to convert the scheduler thread into a work item.
>
> This work item for pushing jobs to the hw can then be queued to the same
> workqueue we use for the timeout work item.
>
> If this workqueue is now configured by your driver as single threaded
> you have a guarantee that only either the scheduler or the timeout work
> item is running at the same time. That in turn makes starting/stopping
> the scheduler for a reset completely superfluous.
>
> Patches for this has already been floating on the mailing list, but
> haven't been committed yet. Since this is all WIP.
As I said previously, our work is based on Matthew's patch series
converting drm_sched threads to a workqueue based implementation.
And I think there are a couple of shortcomings with the current
implementation if we do what you suggest (one single-threaded workqueue
used to serialize the GPU resets and job dequeuing/submission):
- drm_sched_main() has a loop dequeuing jobs until it can't (because a
dep is not signaled, or because all slots are taken), not one at a
time. There can also be several 1:1 entities waiting for jobs to be
dequeued before the reset work. That means the reset operation might
be delayed if we don't have a drm_sched_stop() calling cancel_work()
and making sure we break out of the loop.
- with a workqueue (and that's even worse with a single-threaded one),
the entity priorities are not enforced at the de-queuing level. The FW
will still take care of execution priority, but you can have a low
prio drm_sched_entity getting pushed a lot of jobs, and with the
drm_sched_main() loop, you'll only let other entities dequeue their
jobs when the ringbuf of this low prio entity is full or a job dep
is not signaled. De-queuing one job at a time makes things better, but
if you have a lot of 1:1 entities, it can still take sometime until
you reach the high prio drm_sched worker.
I'm just wondering if you already have solutions to these problems.
Regards,
Boris
prev parent reply other threads:[~2023-05-02 16:50 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-05-02 11:19 drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated Boris Brezillon
2023-05-02 11:36 ` Christian König
2023-05-02 12:41 ` Boris Brezillon
2023-05-03 8:16 ` Boris Brezillon
2023-05-03 8:47 ` Christian König
2023-05-03 9:49 ` Boris Brezillon
2023-05-03 10:28 ` Lucas Stach
2023-05-03 11:40 ` Christian König
2023-05-03 13:10 ` Lucas Stach
2023-05-03 15:01 ` Christian König
2023-05-04 4:54 ` Matthew Brost
2023-05-04 11:07 ` Christian König
2023-05-04 13:07 ` Matthew Brost
2023-05-02 16:50 ` Boris Brezillon [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230502185006.4b2e67a7@collabora.com \
--to=boris.brezillon@collabora.com \
--cc=alexander.deucher@amd.com \
--cc=christian.koenig@amd.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=matthew.brost@intel.com \
--cc=sarah.walker@imgtec.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.