drm/sched: Replacement for drm_sched_resubmit

All of lore.kernel.org
 help / color / mirror / Atom feed

* drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated
@ 2023-05-02 11:19 Boris Brezillon
  2023-05-02 11:36 ` Christian König
  0 siblings, 1 reply; 14+ messages in thread
From: Boris Brezillon @ 2023-05-02 11:19 UTC (permalink / raw)
  To: Christian König, Alex Deucher
  Cc: Sarah Walker, dri-devel@lists.freedesktop.org

Hello Christian, Alex,

As part of our transition to drm_sched for the powervr GPU driver, we
realized drm_sched_resubmit_jobs(), which is used by all drivers
relying on drm_sched right except amdgpu, has been deprecated.
Unfortunately, commit 5efbe6aa7a0e ("drm/scheduler: deprecate
drm_sched_resubmit_jobs") doesn't describe what drivers should do or use
as an alternative.

At the very least, for our implementation, we need to restore the
drm_sched_job::parent pointers that were set to NULL in
drm_sched_stop(), such that jobs submitted before the GPU recovery are
considered active when drm_sched_start() is called. That could be done
with a custom pending_list iteration restoring drm_sched_job::parent's
pointer, but that seems odd to let the scheduler backend manipulate
this list directly, and I suspect we need to do other checks, like the
karma vs hang-limit thing, so we can flag the entity dirty and cancel
all jobs being queued there if the entity has caused too many hangs.

Now that drm_sched_resubmit_jobs() has been deprecated, that would be
great if you could help us write a piece of documentation describing
what should be done between drm_sched_stop() and drm_sched_start(), so
new drivers don't come up with their own slightly different/broken
version of the same thing.

Thanks in advance for your help.

Regards,

Boris

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated
  2023-05-02 11:19 drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated Boris Brezillon
@ 2023-05-02 11:36 ` Christian König
  2023-05-02 12:41   ` Boris Brezillon
  2023-05-02 16:50   ` Boris Brezillon
  0 siblings, 2 replies; 14+ messages in thread
From: Christian König @ 2023-05-02 11:36 UTC (permalink / raw)
  To: Boris Brezillon, Alex Deucher
  Cc: Sarah Walker, dri-devel@lists.freedesktop.org

Hi Boris,

Am 02.05.23 um 13:19 schrieb Boris Brezillon:
> Hello Christian, Alex,
>
> As part of our transition to drm_sched for the powervr GPU driver, we
> realized drm_sched_resubmit_jobs(), which is used by all drivers
> relying on drm_sched right except amdgpu, has been deprecated.
> Unfortunately, commit 5efbe6aa7a0e ("drm/scheduler: deprecate
> drm_sched_resubmit_jobs") doesn't describe what drivers should do or use
> as an alternative.
>
> At the very least, for our implementation, we need to restore the
> drm_sched_job::parent pointers that were set to NULL in
> drm_sched_stop(), such that jobs submitted before the GPU recovery are
> considered active when drm_sched_start() is called. That could be done
> with a custom pending_list iteration restoring drm_sched_job::parent's
> pointer, but that seems odd to let the scheduler backend manipulate
> this list directly, and I suspect we need to do other checks, like the
> karma vs hang-limit thing, so we can flag the entity dirty and cancel
> all jobs being queued there if the entity has caused too many hangs.
>
> Now that drm_sched_resubmit_jobs() has been deprecated, that would be
> great if you could help us write a piece of documentation describing
> what should be done between drm_sched_stop() and drm_sched_start(), so
> new drivers don't come up with their own slightly different/broken
> version of the same thing.

Yeah, really good point! The solution is to not use drm_sched_stop() and 
drm_sched_start() either.

The general idea Daniel, the other Intel guys and me seem to have agreed 
on is to convert the scheduler thread into a work item.

This work item for pushing jobs to the hw can then be queued to the same 
workqueue we use for the timeout work item.

If this workqueue is now configured by your driver as single threaded 
you have a guarantee that only either the scheduler or the timeout work 
item is running at the same time. That in turn makes starting/stopping 
the scheduler for a reset completely superfluous.

Patches for this has already been floating on the mailing list, but 
haven't been committed yet. Since this is all WIP.

In general it's not really a good idea to change the scheduler and hw 
fences during GPU reset/recovery. The dma_fence implementation has a 
pretty strict state transition which clearly say that a dma_fence should 
never go back from signaled to unsignaled and when you start messing 
with that this is exactly what might happen.

What you can do is to save your hw state and re-start at the same 
location after handling the timeout.

Regards,
Christian.

>
> Thanks in advance for your help.
>
> Regards,
>
> Boris

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated
  2023-05-02 11:36 ` Christian König
@ 2023-05-02 12:41   ` Boris Brezillon
  2023-05-03  8:16     ` Boris Brezillon
  2023-05-02 16:50   ` Boris Brezillon
  1 sibling, 1 reply; 14+ messages in thread
From: Boris Brezillon @ 2023-05-02 12:41 UTC (permalink / raw)
  To: Christian König
  Cc: Alex Deucher, Sarah Walker, dri-devel@lists.freedesktop.org

Hi Christian,

Thanks for your quick reply.

On Tue, 2 May 2023 13:36:07 +0200
Christian König <christian.koenig@amd.com> wrote:

> Hi Boris,
> 
> Am 02.05.23 um 13:19 schrieb Boris Brezillon:
> > Hello Christian, Alex,
> >
> > As part of our transition to drm_sched for the powervr GPU driver, we
> > realized drm_sched_resubmit_jobs(), which is used by all drivers
> > relying on drm_sched right except amdgpu, has been deprecated.
> > Unfortunately, commit 5efbe6aa7a0e ("drm/scheduler: deprecate
> > drm_sched_resubmit_jobs") doesn't describe what drivers should do or use
> > as an alternative.
> >
> > At the very least, for our implementation, we need to restore the
> > drm_sched_job::parent pointers that were set to NULL in
> > drm_sched_stop(), such that jobs submitted before the GPU recovery are
> > considered active when drm_sched_start() is called. That could be done
> > with a custom pending_list iteration restoring drm_sched_job::parent's
> > pointer, but that seems odd to let the scheduler backend manipulate
> > this list directly, and I suspect we need to do other checks, like the
> > karma vs hang-limit thing, so we can flag the entity dirty and cancel
> > all jobs being queued there if the entity has caused too many hangs.
> >
> > Now that drm_sched_resubmit_jobs() has been deprecated, that would be
> > great if you could help us write a piece of documentation describing
> > what should be done between drm_sched_stop() and drm_sched_start(), so
> > new drivers don't come up with their own slightly different/broken
> > version of the same thing.  
> 
> Yeah, really good point! The solution is to not use drm_sched_stop() and 
> drm_sched_start() either.

Okay. If that's what we're heading to, this should really be clarified
in the job_timedout() method doc, because right now it's
mentioning drm_sched_{start,stop,resubmit_jobs}(), with
drm_sched_resubmit_jobs() being deprecated already.

> 
> The general idea Daniel, the other Intel guys and me seem to have agreed 
> on is to convert the scheduler thread into a work item.
> 
> This work item for pushing jobs to the hw can then be queued to the same 
> workqueue we use for the timeout work item.
> 
> If this workqueue is now configured by your driver as single threaded 
> you have a guarantee that only either the scheduler or the timeout work 
> item is running at the same time. That in turn makes starting/stopping 
> the scheduler for a reset completely superfluous.

Makes sense.

> 
> Patches for this has already been floating on the mailing list, but 
> haven't been committed yet. Since this is all WIP.

Assuming you're talking about [1], yes, I'm aware of this effort
(PowerVR also has FW-side scheduling, which is what this patch series
was trying to address initially). And I'm aware of the
ordered-workqueue trick too, it helped fixing a few races in panfrost
in the past.

> 
> In general it's not really a good idea to change the scheduler and hw 
> fences during GPU reset/recovery. The dma_fence implementation has a 
> pretty strict state transition which clearly say that a dma_fence should 
> never go back from signaled to unsignaled and when you start messing 
> with that this is exactly what might happen.
> 
> What you can do is to save your hw state and re-start at the same 
> location after handling the timeout.

To sum-up, we shouldn't call drm_sched_{start,stop,resubmit_jobs}().
We should only make sure the reset operation is a work scheduled on the
ordered-workqueue that's also used by the drm_gpu_scheduler to dequeue
its jobs. What happens in the reset logic is purely driver-specific,
and the drm_gpu_scheduler::pending_list has to be manually iterated by
the driver if it needs to touch in-flight jobs as part of its
reset/resubmit logic. Now remains the question of entity guiltiness,
and guilty entity job's cancellation, which was previously handled by
the core and is now left to drivers. Is there any plan to automate that
a bit, or will it stay driver's responsibility?

Regards,

Boris

[1]https://lwn.net/Articles/928310/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated
  2023-05-02 12:41   ` Boris Brezillon
@ 2023-05-03  8:16     ` Boris Brezillon
  2023-05-03  8:47       ` Christian König
  0 siblings, 1 reply; 14+ messages in thread
From: Boris Brezillon @ 2023-05-03  8:16 UTC (permalink / raw)
  To: Christian König
  Cc: Matthew Brost, Sarah Walker, dri-devel@lists.freedesktop.org,
	Alex Deucher

Hi Christian,

On Tue, 2 May 2023 14:41:32 +0200
Boris Brezillon <boris.brezillon@collabora.com> wrote:

> Hi Christian,
> 
> Thanks for your quick reply.
> 
> On Tue, 2 May 2023 13:36:07 +0200
> Christian König <christian.koenig@amd.com> wrote:
> 
> > Hi Boris,
> > 
> > Am 02.05.23 um 13:19 schrieb Boris Brezillon:  
> > > Hello Christian, Alex,
> > >
> > > As part of our transition to drm_sched for the powervr GPU driver, we
> > > realized drm_sched_resubmit_jobs(), which is used by all drivers
> > > relying on drm_sched right except amdgpu, has been deprecated.
> > > Unfortunately, commit 5efbe6aa7a0e ("drm/scheduler: deprecate
> > > drm_sched_resubmit_jobs") doesn't describe what drivers should do or use
> > > as an alternative.
> > >
> > > At the very least, for our implementation, we need to restore the
> > > drm_sched_job::parent pointers that were set to NULL in
> > > drm_sched_stop(), such that jobs submitted before the GPU recovery are
> > > considered active when drm_sched_start() is called. That could be done
> > > with a custom pending_list iteration restoring drm_sched_job::parent's
> > > pointer, but that seems odd to let the scheduler backend manipulate
> > > this list directly, and I suspect we need to do other checks, like the
> > > karma vs hang-limit thing, so we can flag the entity dirty and cancel
> > > all jobs being queued there if the entity has caused too many hangs.
> > >
> > > Now that drm_sched_resubmit_jobs() has been deprecated, that would be
> > > great if you could help us write a piece of documentation describing
> > > what should be done between drm_sched_stop() and drm_sched_start(), so
> > > new drivers don't come up with their own slightly different/broken
> > > version of the same thing.    
> > 
> > Yeah, really good point! The solution is to not use drm_sched_stop() and 
> > drm_sched_start() either.  
> 
> Okay. If that's what we're heading to, this should really be clarified
> in the job_timedout() method doc, because right now it's
> mentioning drm_sched_{start,stop,resubmit_jobs}(), with
> drm_sched_resubmit_jobs() being deprecated already.
> 
> > 
> > The general idea Daniel, the other Intel guys and me seem to have agreed 
> > on is to convert the scheduler thread into a work item.
> > 
> > This work item for pushing jobs to the hw can then be queued to the same 
> > workqueue we use for the timeout work item.
> > 
> > If this workqueue is now configured by your driver as single threaded 
> > you have a guarantee that only either the scheduler or the timeout work 
> > item is running at the same time. That in turn makes starting/stopping 
> > the scheduler for a reset completely superfluous.  
> 
> Makes sense.
> 
> > 
> > Patches for this has already been floating on the mailing list, but 
> > haven't been committed yet. Since this is all WIP.  
> 
> Assuming you're talking about [1], yes, I'm aware of this effort
> (PowerVR also has FW-side scheduling, which is what this patch series
> was trying to address initially). And I'm aware of the
> ordered-workqueue trick too, it helped fixing a few races in panfrost
> in the past.
> 
> > 
> > In general it's not really a good idea to change the scheduler and hw 
> > fences during GPU reset/recovery. The dma_fence implementation has a 
> > pretty strict state transition which clearly say that a dma_fence should 
> > never go back from signaled to unsignaled and when you start messing 
> > with that this is exactly what might happen.
> > 
> > What you can do is to save your hw state and re-start at the same 
> > location after handling the timeout.  
> 
> To sum-up, we shouldn't call drm_sched_{start,stop,resubmit_jobs}().

After the discussion I had with Matthew yesterday on IRC, I
realized there was no clear agreement on this. Matthew uses those 3
helpers in the Xe driver right now, and given he intends to use a
multi-threaded wq for its 1:1 schedulers run queue, there's no way he
can get away without calling drm_sched_{start,stop}().
drm_sched_resubmit_jobs() can be open-coded in each driver, but I'm
wondering if it wouldn't be preferable to add a ::resubmit_job() method
or extend the ::run_job() one to support the resubmit semantics, which,
AFAIU, is just about enforcing the job done fence (the one returned by
::run_job()) doesn't transition from a signaled to an unsignaled state.

But probably more important than providing a generic helper, we should
document the resubmit semantics (AKA, what should and/or shouldn't be
done with pending jobs when a recovery happens). Because forbidding
people to use a generic helper function doesn't give any guarantee that
they'll do the right thing when coding their own logic, unless we give
clues about what's considered right/wrong, and the current state of the
doc is pretty unclear in this regard.

Regards,

Boris

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated
  2023-05-03  8:16     ` Boris Brezillon
@ 2023-05-03  8:47       ` Christian König
  2023-05-03  9:49         ` Boris Brezillon
                           ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Christian König @ 2023-05-03  8:47 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Matthew Brost, Sarah Walker, dri-devel@lists.freedesktop.org,
	Tuikov, Luben, Alex Deucher

Adding Luben as well.

Am 03.05.23 um 10:16 schrieb Boris Brezillon:
> [SNIP]
>> To sum-up, we shouldn't call drm_sched_{start,stop,resubmit_jobs}().
> After the discussion I had with Matthew yesterday on IRC, I
> realized there was no clear agreement on this. Matthew uses those 3
> helpers in the Xe driver right now, and given he intends to use a
> multi-threaded wq for its 1:1 schedulers run queue, there's no way he
> can get away without calling drm_sched_{start,stop}().
> drm_sched_resubmit_jobs() can be open-coded in each driver, but I'm
> wondering if it wouldn't be preferable to add a ::resubmit_job() method
> or extend the ::run_job() one to support the resubmit semantics, which,
> AFAIU, is just about enforcing the job done fence (the one returned by
> ::run_job()) doesn't transition from a signaled to an unsignaled state.
>
> But probably more important than providing a generic helper, we should
> document the resubmit semantics (AKA, what should and/or shouldn't be
> done with pending jobs when a recovery happens). Because forbidding
> people to use a generic helper function doesn't give any guarantee that
> they'll do the right thing when coding their own logic, unless we give
> clues about what's considered right/wrong, and the current state of the
> doc is pretty unclear in this regard.

I should probably talk about the history of the re-submit feature a bit 
more.

Basically AMD came up with re-submission as a cheap way of increasing 
the reliability of GPU resets. Problem is that it turned into an 
absolutely nightmare. We tried for the last 5 years or so to get that 
stable and it's still crashing.

The first and most major problem is that the kernel doesn't even has the 
information if re-submitting jobs is possible or not. For example a job 
which has already been pushed to the hw could have grabbed a binary 
semaphore and re-submitting it will just wait forever for the semaphore 
to be released.

The second problem is that the dma_fence semantics don't allow to ever 
transit the state of a fence from signaled back to unsignaled. This 
means that you can't re-use the hw fence and need to allocate a new one, 
but since memory allocation is forbidden inside a reset handler as well 
(YES we need to better document that part) you actually need to keep a 
bunch of hw fences pre-allocated around to make this work. Amdgpu choose 
to illegally re-use the hw fence instead which only works with quite 
extreme hacks.

The third problem is that the lifetime of the job object was actually 
defined very well before we tried to use re-submission. Basically it's 
just an intermediate state used between the IOCTL and pushing things to 
the hw, introducing this re-submit feature completely messed that up and 
cause quite a number of use after free errors in the past which are 
again only solved by quite some hacks.

What we should do in the GPU scheduler instead is the follow:

1. Don't support re-submission at all!
     Instead we can provide help to drivers to query which fences 
(scheduler or hw) are still not signaled yet.
     This can then be used to re-create hw state if (and only if!) the 
driver knows what it's doing and can actually guarantee that this will work.
     E.g. the case for XE where the kernel driver knows the contexts 
which were not running at the time and can re-create their queues.

2. We can provide both a wq to use for single threaded application as 
well as start/stop semantics.
     It's just that the start/stop semantics should never touch what was 
already submitted, but rather just make sure that we don't get any new 
submissions.

Regards,
Christian.

>
> Regards,
>
> Boris

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated
  2023-05-03  8:47       ` Christian König
@ 2023-05-03  9:49         ` Boris Brezillon
  2023-05-03 10:28         ` Lucas Stach
  2023-05-04  4:54         ` Matthew Brost
  2 siblings, 0 replies; 14+ messages in thread
From: Boris Brezillon @ 2023-05-03  9:49 UTC (permalink / raw)
  To: Christian König
  Cc: Matthew Brost, Sarah Walker, dri-devel@lists.freedesktop.org,
	Tuikov, Luben, Alex Deucher

On Wed, 3 May 2023 10:47:43 +0200
Christian König <christian.koenig@amd.com> wrote:

> Adding Luben as well.
> 
> Am 03.05.23 um 10:16 schrieb Boris Brezillon:
> > [SNIP]  
> >> To sum-up, we shouldn't call drm_sched_{start,stop,resubmit_jobs}().  
> > After the discussion I had with Matthew yesterday on IRC, I
> > realized there was no clear agreement on this. Matthew uses those 3
> > helpers in the Xe driver right now, and given he intends to use a
> > multi-threaded wq for its 1:1 schedulers run queue, there's no way he
> > can get away without calling drm_sched_{start,stop}().
> > drm_sched_resubmit_jobs() can be open-coded in each driver, but I'm
> > wondering if it wouldn't be preferable to add a ::resubmit_job() method
> > or extend the ::run_job() one to support the resubmit semantics, which,
> > AFAIU, is just about enforcing the job done fence (the one returned by
> > ::run_job()) doesn't transition from a signaled to an unsignaled state.
> >
> > But probably more important than providing a generic helper, we should
> > document the resubmit semantics (AKA, what should and/or shouldn't be
> > done with pending jobs when a recovery happens). Because forbidding
> > people to use a generic helper function doesn't give any guarantee that
> > they'll do the right thing when coding their own logic, unless we give
> > clues about what's considered right/wrong, and the current state of the
> > doc is pretty unclear in this regard.  
> 
> I should probably talk about the history of the re-submit feature a bit 
> more.
> 
> Basically AMD came up with re-submission as a cheap way of increasing 
> the reliability of GPU resets. Problem is that it turned into an 
> absolutely nightmare. We tried for the last 5 years or so to get that 
> stable and it's still crashing.
> 
> The first and most major problem is that the kernel doesn't even has the 
> information if re-submitting jobs is possible or not. For example a job 
> which has already been pushed to the hw could have grabbed a binary 
> semaphore and re-submitting it will just wait forever for the semaphore 
> to be released.

That's a valid point. Definitely something worth pointing out in the
doc IMHO.

> 
> The second problem is that the dma_fence semantics don't allow to ever 
> transit the state of a fence from signaled back to unsignaled. This 
> means that you can't re-use the hw fence and need to allocate a new one, 
> but since memory allocation is forbidden inside a reset handler as well 
> (YES we need to better document that part) you actually need to keep a 
> bunch of hw fences pre-allocated around to make this work. Amdgpu choose 
> to illegally re-use the hw fence instead which only works with quite 
> extreme hacks.

Hm, maybe I'm missing something, but I don't really see why we'd ever go
from signaled to unsignaled in the first place. If the parent fence is
signaled by the time we reach the resubmit function (because the GPU
keeps executing stuff after we called drm_sched_stop() and before we
actually shut it down), the ::run_job() hook would just return a
signaled fence at re-submission time, and the actual re-submission can
be skipped. If the job hasn't finished before the reset, the fence
should still be unsignaled, so no unsignaled -> signaled state
transition AFAICT.

> 
> The third problem is that the lifetime of the job object was actually 
> defined very well before we tried to use re-submission. Basically it's 
> just an intermediate state used between the IOCTL and pushing things to 
> the hw, introducing this re-submit feature completely messed that up and 
> cause quite a number of use after free errors in the past which are 
> again only solved by quite some hacks.

It's not clear to me what the resubmit logic has to do with that,
but I trust you.

> 
> What we should do in the GPU scheduler instead is the follow:
> 
> 1. Don't support re-submission at all!
>      Instead we can provide help to drivers to query which fences 
> (scheduler or hw) are still not signaled yet.
>      This can then be used to re-create hw state if (and only if!) the 
> driver knows what it's doing and can actually guarantee that this will work.
>      E.g. the case for XE where the kernel driver knows the contexts 
> which were not running at the time and can re-create their queues.

Okay, that basically means any queue that had in-flight jobs at
recovery time becomes unusable after that point, because skipping one
job might put you in an inconsistent state.

> 
> 2. We can provide both a wq to use for single threaded application as 
> well as start/stop semantics.

There's no risk calling stop+start even in the single-threaded wq
case, and the overhead should be negligible. If we want to keep things
simple, let's just say drivers should always call those :-).

>      It's just that the start/stop semantics should never touch what was 
> already submitted, but rather just make sure that we don't get any new 
> submissions.

So, something like that:

1. Call drm_sched_stop()
2. Iterate over all pending jobs, and flag the entity as dirty and/or
   kill it if any of the jobs were unfinished when the GPU was reset
3. Call drm_sched_start()

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated
  2023-05-03  8:47       ` Christian König
  2023-05-03  9:49         ` Boris Brezillon
@ 2023-05-03 10:28         ` Lucas Stach
  2023-05-03 11:40           ` Christian König
  2023-05-04  4:54         ` Matthew Brost
  2 siblings, 1 reply; 14+ messages in thread
From: Lucas Stach @ 2023-05-03 10:28 UTC (permalink / raw)
  To: Christian König, Boris Brezillon
  Cc: Matthew Brost, Alex Deucher, Tuikov, Luben, Sarah Walker,
	dri-devel@lists.freedesktop.org

Hi Christian,

Am Mittwoch, dem 03.05.2023 um 10:47 +0200 schrieb Christian König:
> Adding Luben as well.
> 
> Am 03.05.23 um 10:16 schrieb Boris Brezillon:
> > [SNIP]
> > > To sum-up, we shouldn't call drm_sched_{start,stop,resubmit_jobs}().
> > After the discussion I had with Matthew yesterday on IRC, I
> > realized there was no clear agreement on this. Matthew uses those 3
> > helpers in the Xe driver right now, and given he intends to use a
> > multi-threaded wq for its 1:1 schedulers run queue, there's no way he
> > can get away without calling drm_sched_{start,stop}().
> > drm_sched_resubmit_jobs() can be open-coded in each driver, but I'm
> > wondering if it wouldn't be preferable to add a ::resubmit_job() method
> > or extend the ::run_job() one to support the resubmit semantics, which,
> > AFAIU, is just about enforcing the job done fence (the one returned by
> > ::run_job()) doesn't transition from a signaled to an unsignaled state.
> > 
> > But probably more important than providing a generic helper, we should
> > document the resubmit semantics (AKA, what should and/or shouldn't be
> > done with pending jobs when a recovery happens). Because forbidding
> > people to use a generic helper function doesn't give any guarantee that
> > they'll do the right thing when coding their own logic, unless we give
> > clues about what's considered right/wrong, and the current state of the
> > doc is pretty unclear in this regard.
> 
> I should probably talk about the history of the re-submit feature a bit 
> more.
> 
> Basically AMD came up with re-submission as a cheap way of increasing 
> the reliability of GPU resets. Problem is that it turned into an 
> absolutely nightmare. We tried for the last 5 years or so to get that 
> stable and it's still crashing.
> 
> The first and most major problem is that the kernel doesn't even has the 
> information if re-submitting jobs is possible or not. For example a job 
> which has already been pushed to the hw could have grabbed a binary 
> semaphore and re-submitting it will just wait forever for the semaphore 
> to be released.
> 
I can follow this argument, but concluding that job resubmission is
impossible is punishing simple GPUs. On Vivante GPUs we have exactly
one job running at a time and all dependencies are visible to the
scheduler, as we don't have/use any hardware synchronization mechanism,
so all synchronization is piped through kernel visible fences.

It's reasonably easy for the etnaviv driver to find the guilty job to
skip but resubmit all other jobs in the current hardware queue. I'm not
really fond of having to make all applications deal with innocent
context resets, while we can solve this via resubmission on simple HW. 

I know that more complex hardware and use-cases might still require the
kernel driver for this HW to give up and shoot all contexts active at
the time of the GPU reset, but that's the price you pay for the
hardware being more capable. I don't see why we should also pay that
price on really simple HW.

> The second problem is that the dma_fence semantics don't allow to ever 
> transit the state of a fence from signaled back to unsignaled. This 
> means that you can't re-use the hw fence and need to allocate a new one, 
> but since memory allocation is forbidden inside a reset handler as well 
> (YES we need to better document that part) you actually need to keep a 
> bunch of hw fences pre-allocated around to make this work. Amdgpu choose 
> to illegally re-use the hw fence instead which only works with quite 
> extreme hacks.
> 
I'm with Boris here. Could you please explain when a fence would be
already signaled in a GPU reset scenario and would need to go back to
unsignaled, so we are on the same page here?

Also the no memory allocation policy really needs some documentation.
Currently etnaviv may allocate a bunch of memory to create the
devcoredump before resetting the GPU and I don't believe etnaviv is the
only driver doing such a thing. The allocations are set up in a way to
just give up if there is memory pressure, as getting the GPU back in
working order is obviously more important than writing GPU state data
for post mortem analysis, but I also don't see where else this
allocation could be done.

> The third problem is that the lifetime of the job object was actually 
> defined very well before we tried to use re-submission. Basically it's 
> just an intermediate state used between the IOCTL and pushing things to 
> the hw, introducing this re-submit feature completely messed that up and 
> cause quite a number of use after free errors in the past which are 
> again only solved by quite some hacks.
> 
Isn't it still well-defined? The job object lives until it's parent
fence signaled. Pulling a unsignaled parent fence and attaching a new
one is not a problem.

Regards,
Lucas

> What we should do in the GPU scheduler instead is the follow:
> 
> 1. Don't support re-submission at all!
>      Instead we can provide help to drivers to query which fences 
> (scheduler or hw) are still not signaled yet.
>      This can then be used to re-create hw state if (and only if!) the 
> driver knows what it's doing and can actually guarantee that this will work.
>      E.g. the case for XE where the kernel driver knows the contexts 
> which were not running at the time and can re-create their queues.
> 
> 2. We can provide both a wq to use for single threaded application as 
> well as start/stop semantics.
>      It's just that the start/stop semantics should never touch what was 
> already submitted, but rather just make sure that we don't get any new 
> submissions.
> 
> Regards,
> Christian.
> 
> > 
> > Regards,
> > 
> > Boris
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated
  2023-05-03 10:28         ` Lucas Stach
@ 2023-05-03 11:40           ` Christian König
  2023-05-03 13:10             ` Lucas Stach
  0 siblings, 1 reply; 14+ messages in thread
From: Christian König @ 2023-05-03 11:40 UTC (permalink / raw)
  To: Lucas Stach, Boris Brezillon
  Cc: Matthew Brost, Alex Deucher, Tuikov, Luben, Sarah Walker,
	dri-devel@lists.freedesktop.org

Hi Lucas,

Am 03.05.23 um 12:28 schrieb Lucas Stach:
> Hi Christian,
>
> Am Mittwoch, dem 03.05.2023 um 10:47 +0200 schrieb Christian König:
>> Adding Luben as well.
>>
>> Am 03.05.23 um 10:16 schrieb Boris Brezillon:
>>> [SNIP]
>>>> To sum-up, we shouldn't call drm_sched_{start,stop,resubmit_jobs}().
>>> After the discussion I had with Matthew yesterday on IRC, I
>>> realized there was no clear agreement on this. Matthew uses those 3
>>> helpers in the Xe driver right now, and given he intends to use a
>>> multi-threaded wq for its 1:1 schedulers run queue, there's no way he
>>> can get away without calling drm_sched_{start,stop}().
>>> drm_sched_resubmit_jobs() can be open-coded in each driver, but I'm
>>> wondering if it wouldn't be preferable to add a ::resubmit_job() method
>>> or extend the ::run_job() one to support the resubmit semantics, which,
>>> AFAIU, is just about enforcing the job done fence (the one returned by
>>> ::run_job()) doesn't transition from a signaled to an unsignaled state.
>>>
>>> But probably more important than providing a generic helper, we should
>>> document the resubmit semantics (AKA, what should and/or shouldn't be
>>> done with pending jobs when a recovery happens). Because forbidding
>>> people to use a generic helper function doesn't give any guarantee that
>>> they'll do the right thing when coding their own logic, unless we give
>>> clues about what's considered right/wrong, and the current state of the
>>> doc is pretty unclear in this regard.
>> I should probably talk about the history of the re-submit feature a bit
>> more.
>>
>> Basically AMD came up with re-submission as a cheap way of increasing
>> the reliability of GPU resets. Problem is that it turned into an
>> absolutely nightmare. We tried for the last 5 years or so to get that
>> stable and it's still crashing.
>>
>> The first and most major problem is that the kernel doesn't even has the
>> information if re-submitting jobs is possible or not. For example a job
>> which has already been pushed to the hw could have grabbed a binary
>> semaphore and re-submitting it will just wait forever for the semaphore
>> to be released.
>>
> I can follow this argument, but concluding that job resubmission is
> impossible is punishing simple GPUs. On Vivante GPUs we have exactly
> one job running at a time and all dependencies are visible to the
> scheduler, as we don't have/use any hardware synchronization mechanism,
> so all synchronization is piped through kernel visible fences.
>
> It's reasonably easy for the etnaviv driver to find the guilty job to
> skip but resubmit all other jobs in the current hardware queue. I'm not
> really fond of having to make all applications deal with innocent
> context resets, while we can solve this via resubmission on simple HW.
>
> I know that more complex hardware and use-cases might still require the
> kernel driver for this HW to give up and shoot all contexts active at
> the time of the GPU reset, but that's the price you pay for the
> hardware being more capable. I don't see why we should also pay that
> price on really simple HW.

You can still re-create the hw state inside your driver to continue work 
from some point if know that this will work.

As I wrote below the scheduler component can even provide help with with 
that in the form of providing all the unsignaled hw or scheduler fences 
for example.

But what we absolutely should *not* do is to have this re-submission 
feature, because that requires re-using the dma_fence objects. In other 
words this dance with detaching the scheduler fence from the hw fence 
and attach a new one is what absolutely doesn't work.

>> The second problem is that the dma_fence semantics don't allow to ever
>> transit the state of a fence from signaled back to unsignaled. This
>> means that you can't re-use the hw fence and need to allocate a new one,
>> but since memory allocation is forbidden inside a reset handler as well
>> (YES we need to better document that part) you actually need to keep a
>> bunch of hw fences pre-allocated around to make this work. Amdgpu choose
>> to illegally re-use the hw fence instead which only works with quite
>> extreme hacks.
>>
> I'm with Boris here. Could you please explain when a fence would be
> already signaled in a GPU reset scenario and would need to go back to
> unsignaled, so we are on the same page here?

Take a look at how this re-submission feature of the scheduler works. 
The approach is basically that you detach the hw fence from the 
scheduler fence and then attach a new one.

Either you re-use the hw fence, violate the dma_fence semantic or make 
sure you can always allocate a new hw fence object. Basically you can 
only choose between evils and all of them suck.

> Also the no memory allocation policy really needs some documentation.
> Currently etnaviv may allocate a bunch of memory to create the
> devcoredump before resetting the GPU and I don't believe etnaviv is the
> only driver doing such a thing. The allocations are set up in a way to
> just give up if there is memory pressure, as getting the GPU back in
> working order is obviously more important than writing GPU state data
> for post mortem analysis, but I also don't see where else this
> allocation could be done.

Well doing some optional memory allocation for devcoredump for post 
mortem analysis is perfectly fine. If that fails you can just write into 
syslog "sorry can't dump because of lack of memory".

But what we can't do is to rely on memory allocation for proper 
operation, e.g. something like "I can't get the desktop working again 
because memory allocation in the GPU reset handler failed" is a no-go.

>> The third problem is that the lifetime of the job object was actually
>> defined very well before we tried to use re-submission. Basically it's
>> just an intermediate state used between the IOCTL and pushing things to
>> the hw, introducing this re-submit feature completely messed that up and
>> cause quite a number of use after free errors in the past which are
>> again only solved by quite some hacks.
>>
> Isn't it still well-defined? The job object lives until it's parent
> fence signaled. Pulling a unsignaled parent fence and attaching a new
> one is not a problem.

Yeah, problem again is which parent fence? The old one, the new one? How 
to prevent racing between fence signaling and job timeout? Can we 
cleanup the job from the interrupt or atomic context? (Turned out no, we 
can't).

The idea of the job object was to give the scheduler something to work 
with, e.g. you push it into the scheduler on the one end and get it back 
on the other end based on scheduling properties, dependencies etc... 
Initially that was even just a pointer to void.

The hw fence then represents what's currently running on the hw, with 
timeout, resource allocation, etc...

If necessary the driver can even use the same object for both (like 
amdgpu does now), but the key point is that this is nothing the 
scheduler should be dealing with.

Regards,
Christian.

>
> Regards,
> Lucas
>
>> What we should do in the GPU scheduler instead is the follow:
>>
>> 1. Don't support re-submission at all!
>>       Instead we can provide help to drivers to query which fences
>> (scheduler or hw) are still not signaled yet.
>>       This can then be used to re-create hw state if (and only if!) the
>> driver knows what it's doing and can actually guarantee that this will work.
>>       E.g. the case for XE where the kernel driver knows the contexts
>> which were not running at the time and can re-create their queues.
>>
>> 2. We can provide both a wq to use for single threaded application as
>> well as start/stop semantics.
>>       It's just that the start/stop semantics should never touch what was
>> already submitted, but rather just make sure that we don't get any new
>> submissions.
>>
>> Regards,
>> Christian.
>>
>>> Regards,
>>>
>>> Boris


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated
  2023-05-03 11:40           ` Christian König
@ 2023-05-03 13:10             ` Lucas Stach
  2023-05-03 15:01               ` Christian König
  0 siblings, 1 reply; 14+ messages in thread
From: Lucas Stach @ 2023-05-03 13:10 UTC (permalink / raw)
  To: Christian König, Boris Brezillon
  Cc: Matthew Brost, Alex Deucher, Tuikov, Luben, Sarah Walker,
	dri-devel@lists.freedesktop.org

Am Mittwoch, dem 03.05.2023 um 13:40 +0200 schrieb Christian König:
> Hi Lucas,
> 
> Am 03.05.23 um 12:28 schrieb Lucas Stach:
> > Hi Christian,
> > 
> > Am Mittwoch, dem 03.05.2023 um 10:47 +0200 schrieb Christian König:
> > > Adding Luben as well.
> > > 
> > > Am 03.05.23 um 10:16 schrieb Boris Brezillon:
> > > > [SNIP]
> > > > > To sum-up, we shouldn't call drm_sched_{start,stop,resubmit_jobs}().
> > > > After the discussion I had with Matthew yesterday on IRC, I
> > > > realized there was no clear agreement on this. Matthew uses those 3
> > > > helpers in the Xe driver right now, and given he intends to use a
> > > > multi-threaded wq for its 1:1 schedulers run queue, there's no way he
> > > > can get away without calling drm_sched_{start,stop}().
> > > > drm_sched_resubmit_jobs() can be open-coded in each driver, but I'm
> > > > wondering if it wouldn't be preferable to add a ::resubmit_job() method
> > > > or extend the ::run_job() one to support the resubmit semantics, which,
> > > > AFAIU, is just about enforcing the job done fence (the one returned by
> > > > ::run_job()) doesn't transition from a signaled to an unsignaled state.
> > > > 
> > > > But probably more important than providing a generic helper, we should
> > > > document the resubmit semantics (AKA, what should and/or shouldn't be
> > > > done with pending jobs when a recovery happens). Because forbidding
> > > > people to use a generic helper function doesn't give any guarantee that
> > > > they'll do the right thing when coding their own logic, unless we give
> > > > clues about what's considered right/wrong, and the current state of the
> > > > doc is pretty unclear in this regard.
> > > I should probably talk about the history of the re-submit feature a bit
> > > more.
> > > 
> > > Basically AMD came up with re-submission as a cheap way of increasing
> > > the reliability of GPU resets. Problem is that it turned into an
> > > absolutely nightmare. We tried for the last 5 years or so to get that
> > > stable and it's still crashing.
> > > 
> > > The first and most major problem is that the kernel doesn't even has the
> > > information if re-submitting jobs is possible or not. For example a job
> > > which has already been pushed to the hw could have grabbed a binary
> > > semaphore and re-submitting it will just wait forever for the semaphore
> > > to be released.
> > > 
> > I can follow this argument, but concluding that job resubmission is
> > impossible is punishing simple GPUs. On Vivante GPUs we have exactly
> > one job running at a time and all dependencies are visible to the
> > scheduler, as we don't have/use any hardware synchronization mechanism,
> > so all synchronization is piped through kernel visible fences.
> > 
> > It's reasonably easy for the etnaviv driver to find the guilty job to
> > skip but resubmit all other jobs in the current hardware queue. I'm not
> > really fond of having to make all applications deal with innocent
> > context resets, while we can solve this via resubmission on simple HW.
> > 
> > I know that more complex hardware and use-cases might still require the
> > kernel driver for this HW to give up and shoot all contexts active at
> > the time of the GPU reset, but that's the price you pay for the
> > hardware being more capable. I don't see why we should also pay that
> > price on really simple HW.
> 
> You can still re-create the hw state inside your driver to continue work 
> from some point if know that this will work.
> 
> As I wrote below the scheduler component can even provide help with with 
> that in the form of providing all the unsignaled hw or scheduler fences 
> for example.
> 
> But what we absolutely should *not* do is to have this re-submission 
> feature, because that requires re-using the dma_fence objects. In other 
> words this dance with detaching the scheduler fence from the hw fence 
> and attach a new one is what absolutely doesn't work.
> 
> > > The second problem is that the dma_fence semantics don't allow to ever
> > > transit the state of a fence from signaled back to unsignaled. This
> > > means that you can't re-use the hw fence and need to allocate a new one,
> > > but since memory allocation is forbidden inside a reset handler as well
> > > (YES we need to better document that part) you actually need to keep a
> > > bunch of hw fences pre-allocated around to make this work. Amdgpu choose
> > > to illegally re-use the hw fence instead which only works with quite
> > > extreme hacks.
> > > 
> > I'm with Boris here. Could you please explain when a fence would be
> > already signaled in a GPU reset scenario and would need to go back to
> > unsignaled, so we are on the same page here?
> 
> Take a look at how this re-submission feature of the scheduler works. 
> The approach is basically that you detach the hw fence from the 
> scheduler fence and then attach a new one.
> 
Right, but this shouldn't be a problem, as long as the old fence isn't
signaled yet, right? It becomes a problem when the GPU reset and fence
signaling are racing each other, due to insufficient hardware/software
state synchronization.

I'm sure that the necessary synchronization can be hard to get right,
but it's not the act of switching one unsignaled fence to a new one or
reusing the old unsignaled fence that's causing problems, but the
complications of making sure that old fences don't signal after the
timeout detection, right?

To be clear: I'm not asking those questions because I think I know any
better, but because I'm actually not sure and would like to understand
your line of thought and background information when you say "this
dance with detaching the scheduler fence from the hw fence and attach a
new one is what absolutely doesn't work" above.

Driver writers need to understand the limitations of the current
resubmission scheduler code to do better in their driver
implementation, otherwise we just end up with (worse) open-coded
variations of that code in each driver.

Regards,
Lucas

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated
  2023-05-03 13:10             ` Lucas Stach
@ 2023-05-03 15:01               ` Christian König
  0 siblings, 0 replies; 14+ messages in thread
From: Christian König @ 2023-05-03 15:01 UTC (permalink / raw)
  To: Lucas Stach, Boris Brezillon
  Cc: Matthew Brost, Alex Deucher, Tuikov, Luben, Sarah Walker,
	dri-devel@lists.freedesktop.org

Am 03.05.23 um 15:10 schrieb Lucas Stach:
> Am Mittwoch, dem 03.05.2023 um 13:40 +0200 schrieb Christian König:
>> Hi Lucas,
>>
>> Am 03.05.23 um 12:28 schrieb Lucas Stach:
>>> Hi Christian,
>>>
>>> Am Mittwoch, dem 03.05.2023 um 10:47 +0200 schrieb Christian König:
>>>> Adding Luben as well.
>>>>
>>>> Am 03.05.23 um 10:16 schrieb Boris Brezillon:
>>>>> [SNIP]
>>>>>> To sum-up, we shouldn't call drm_sched_{start,stop,resubmit_jobs}().
>>>>> After the discussion I had with Matthew yesterday on IRC, I
>>>>> realized there was no clear agreement on this. Matthew uses those 3
>>>>> helpers in the Xe driver right now, and given he intends to use a
>>>>> multi-threaded wq for its 1:1 schedulers run queue, there's no way he
>>>>> can get away without calling drm_sched_{start,stop}().
>>>>> drm_sched_resubmit_jobs() can be open-coded in each driver, but I'm
>>>>> wondering if it wouldn't be preferable to add a ::resubmit_job() method
>>>>> or extend the ::run_job() one to support the resubmit semantics, which,
>>>>> AFAIU, is just about enforcing the job done fence (the one returned by
>>>>> ::run_job()) doesn't transition from a signaled to an unsignaled state.
>>>>>
>>>>> But probably more important than providing a generic helper, we should
>>>>> document the resubmit semantics (AKA, what should and/or shouldn't be
>>>>> done with pending jobs when a recovery happens). Because forbidding
>>>>> people to use a generic helper function doesn't give any guarantee that
>>>>> they'll do the right thing when coding their own logic, unless we give
>>>>> clues about what's considered right/wrong, and the current state of the
>>>>> doc is pretty unclear in this regard.
>>>> I should probably talk about the history of the re-submit feature a bit
>>>> more.
>>>>
>>>> Basically AMD came up with re-submission as a cheap way of increasing
>>>> the reliability of GPU resets. Problem is that it turned into an
>>>> absolutely nightmare. We tried for the last 5 years or so to get that
>>>> stable and it's still crashing.
>>>>
>>>> The first and most major problem is that the kernel doesn't even has the
>>>> information if re-submitting jobs is possible or not. For example a job
>>>> which has already been pushed to the hw could have grabbed a binary
>>>> semaphore and re-submitting it will just wait forever for the semaphore
>>>> to be released.
>>>>
>>> I can follow this argument, but concluding that job resubmission is
>>> impossible is punishing simple GPUs. On Vivante GPUs we have exactly
>>> one job running at a time and all dependencies are visible to the
>>> scheduler, as we don't have/use any hardware synchronization mechanism,
>>> so all synchronization is piped through kernel visible fences.
>>>
>>> It's reasonably easy for the etnaviv driver to find the guilty job to
>>> skip but resubmit all other jobs in the current hardware queue. I'm not
>>> really fond of having to make all applications deal with innocent
>>> context resets, while we can solve this via resubmission on simple HW.
>>>
>>> I know that more complex hardware and use-cases might still require the
>>> kernel driver for this HW to give up and shoot all contexts active at
>>> the time of the GPU reset, but that's the price you pay for the
>>> hardware being more capable. I don't see why we should also pay that
>>> price on really simple HW.
>> You can still re-create the hw state inside your driver to continue work
>> from some point if know that this will work.
>>
>> As I wrote below the scheduler component can even provide help with with
>> that in the form of providing all the unsignaled hw or scheduler fences
>> for example.
>>
>> But what we absolutely should *not* do is to have this re-submission
>> feature, because that requires re-using the dma_fence objects. In other
>> words this dance with detaching the scheduler fence from the hw fence
>> and attach a new one is what absolutely doesn't work.
>>
>>>> The second problem is that the dma_fence semantics don't allow to ever
>>>> transit the state of a fence from signaled back to unsignaled. This
>>>> means that you can't re-use the hw fence and need to allocate a new one,
>>>> but since memory allocation is forbidden inside a reset handler as well
>>>> (YES we need to better document that part) you actually need to keep a
>>>> bunch of hw fences pre-allocated around to make this work. Amdgpu choose
>>>> to illegally re-use the hw fence instead which only works with quite
>>>> extreme hacks.
>>>>
>>> I'm with Boris here. Could you please explain when a fence would be
>>> already signaled in a GPU reset scenario and would need to go back to
>>> unsignaled, so we are on the same page here?
>> Take a look at how this re-submission feature of the scheduler works.
>> The approach is basically that you detach the hw fence from the
>> scheduler fence and then attach a new one.
>>
> Right, but this shouldn't be a problem, as long as the old fence isn't
> signaled yet, right? It becomes a problem when the GPU reset and fence
> signaling are racing each other, due to insufficient hardware/software
> state synchronization.

Exactly that.

> I'm sure that the necessary synchronization can be hard to get right,
> but it's not the act of switching one unsignaled fence to a new one or
> reusing the old unsignaled fence that's causing problems, but the
> complications of making sure that old fences don't signal after the
> timeout detection, right?

Well sort of, switching the fences is the root of the problem.

Basically the initial hw fence for a submission should never signal 
until the hw is done with that submission. Either completed it or 
aborted it.

The concept that the GPU reset aborts the existing fences then re-submit 
which gives you new hw fences is what doesn't work.

Either you re-use the old hw fence (in which case you wouldn't have to 
switch it in the first place) or you allocate a new one (which violates 
the no-memory allocation requirements).

> To be clear: I'm not asking those questions because I think I know any
> better, but because I'm actually not sure and would like to understand
> your line of thought and background information when you say "this
> dance with detaching the scheduler fence from the hw fence and attach a
> new one is what absolutely doesn't work" above.

And that is really appreciated, we don't have enough people looking into 
this and especially pushing back on bad ideas.

> Driver writers need to understand the limitations of the current
> resubmission scheduler code to do better in their driver
> implementation, otherwise we just end up with (worse) open-coded
> variations of that code in each driver.

Yes, completely agree.

Regards,
Christian.

>
> Regards,
> Lucas


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated
  2023-05-03  8:47       ` Christian König
  2023-05-03  9:49         ` Boris Brezillon
  2023-05-03 10:28         ` Lucas Stach
@ 2023-05-04  4:54         ` Matthew Brost
  2023-05-04 11:07           ` Christian König
  2 siblings, 1 reply; 14+ messages in thread
From: Matthew Brost @ 2023-05-04  4:54 UTC (permalink / raw)
  To: Christian König
  Cc: Tuikov, Luben, Sarah Walker, dri-devel@lists.freedesktop.org,
	Boris Brezillon, Alex Deucher

On Wed, May 03, 2023 at 10:47:43AM +0200, Christian König wrote:
> Adding Luben as well.
> 
> Am 03.05.23 um 10:16 schrieb Boris Brezillon:
> > [SNIP]
> > > To sum-up, we shouldn't call drm_sched_{start,stop,resubmit_jobs}().
> > After the discussion I had with Matthew yesterday on IRC, I
> > realized there was no clear agreement on this. Matthew uses those 3
> > helpers in the Xe driver right now, and given he intends to use a
> > multi-threaded wq for its 1:1 schedulers run queue, there's no way he
> > can get away without calling drm_sched_{start,stop}().
> > drm_sched_resubmit_jobs() can be open-coded in each driver, but I'm
> > wondering if it wouldn't be preferable to add a ::resubmit_job() method
> > or extend the ::run_job() one to support the resubmit semantics, which,
> > AFAIU, is just about enforcing the job done fence (the one returned by
> > ::run_job()) doesn't transition from a signaled to an unsignaled state.
> > 
> > But probably more important than providing a generic helper, we should
> > document the resubmit semantics (AKA, what should and/or shouldn't be
> > done with pending jobs when a recovery happens). Because forbidding
> > people to use a generic helper function doesn't give any guarantee that
> > they'll do the right thing when coding their own logic, unless we give
> > clues about what's considered right/wrong, and the current state of the
> > doc is pretty unclear in this regard.
> 
> I should probably talk about the history of the re-submit feature a bit
> more.
> 
> Basically AMD came up with re-submission as a cheap way of increasing the
> reliability of GPU resets. Problem is that it turned into an absolutely
> nightmare. We tried for the last 5 years or so to get that stable and it's
> still crashing.
> 
> The first and most major problem is that the kernel doesn't even has the
> information if re-submitting jobs is possible or not. For example a job
> which has already been pushed to the hw could have grabbed a binary
> semaphore and re-submitting it will just wait forever for the semaphore to
> be released.
> 
> The second problem is that the dma_fence semantics don't allow to ever
> transit the state of a fence from signaled back to unsignaled. This means
> that you can't re-use the hw fence and need to allocate a new one, but since
> memory allocation is forbidden inside a reset handler as well (YES we need
> to better document that part) you actually need to keep a bunch of hw fences
> pre-allocated around to make this work. Amdgpu choose to illegally re-use
> the hw fence instead which only works with quite extreme hacks.
> 
> The third problem is that the lifetime of the job object was actually
> defined very well before we tried to use re-submission. Basically it's just
> an intermediate state used between the IOCTL and pushing things to the hw,
> introducing this re-submit feature completely messed that up and cause quite
> a number of use after free errors in the past which are again only solved by
> quite some hacks.
> 
> What we should do in the GPU scheduler instead is the follow:
> 
> 1. Don't support re-submission at all!
>     Instead we can provide help to drivers to query which fences (scheduler
> or hw) are still not signaled yet.
>     This can then be used to re-create hw state if (and only if!) the driver
> knows what it's doing and can actually guarantee that this will work.
>     E.g. the case for XE where the kernel driver knows the contexts which
> were not running at the time and can re-create their queues.
> 
> 2. We can provide both a wq to use for single threaded application as well
> as start/stop semantics.
>     It's just that the start/stop semantics should never touch what was
> already submitted, but rather just make sure that we don't get any new
> submissions.
> 

I pretty much agree with everything Christian has said here and Xe
aligns with this. Let me explain what Xe does.

1. Entity hang (TDR timeout of job on a entity, firmware notifies Xe that a
entity hung, entity IOMMU CAT error, etc...):
	- No re-submission at all
	- ban the entity
	- notify the UMD
	- cleanup all pending jobs / fences
2. Entire GPU hang (worth mentioning with good HW + KMD this *should*
never happen):
	- stop all schedulers (same as a entity in Xe because 1 to 1)
	- cleanup odd entity state related to communication with the
	  firmware
	- check if an entity has a job that started but not finished, if
	  so ban it (same mechanism as above)
	- resubmit all jobs from good entities
	- start all schedulers (same as a entity in Xe because 1 to 1)

The implementation for this in the following file [1]. Search for the
drm scheduler functions and you should be able to find implementation
easily.

If you want to use an ordered work queue to avoid the stop / start dance
great do that, in Xe the stop / start dance works. I have extensively
tested this and the flow is rock solid and please trust me when I say
this as I worked on reset flows in the i915 and fought bugs for years, I
still don't think it is in the i915 because we try to do resubmission +
crazy races. Because of that I was incredibly paranoid when I
implemented this + tested it extensively.

I'll post an updated version of my DRM scheduler series [2] on the list
soon for the WQ changes, plus whatever else is required for Xe.

So my take from this discussion is maybe with a little documentation, we
are good. Thoughts?

Matt

[1] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c
[2] https://patchwork.freedesktop.org/series/116055/

> Regards,
> Christian.
> 
> > 
> > Regards,
> > 
> > Boris
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated
  2023-05-04  4:54         ` Matthew Brost
@ 2023-05-04 11:07           ` Christian König
  2023-05-04 13:07             ` Matthew Brost
  0 siblings, 1 reply; 14+ messages in thread
From: Christian König @ 2023-05-04 11:07 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Tuikov, Luben, Sarah Walker, dri-devel@lists.freedesktop.org,
	Boris Brezillon, Alex Deucher

Am 04.05.23 um 06:54 schrieb Matthew Brost:
> On Wed, May 03, 2023 at 10:47:43AM +0200, Christian König wrote:
>> Adding Luben as well.
>>
>> Am 03.05.23 um 10:16 schrieb Boris Brezillon:
>>> [SNIP]
>>>> To sum-up, we shouldn't call drm_sched_{start,stop,resubmit_jobs}().
>>> After the discussion I had with Matthew yesterday on IRC, I
>>> realized there was no clear agreement on this. Matthew uses those 3
>>> helpers in the Xe driver right now, and given he intends to use a
>>> multi-threaded wq for its 1:1 schedulers run queue, there's no way he
>>> can get away without calling drm_sched_{start,stop}().
>>> drm_sched_resubmit_jobs() can be open-coded in each driver, but I'm
>>> wondering if it wouldn't be preferable to add a ::resubmit_job() method
>>> or extend the ::run_job() one to support the resubmit semantics, which,
>>> AFAIU, is just about enforcing the job done fence (the one returned by
>>> ::run_job()) doesn't transition from a signaled to an unsignaled state.
>>>
>>> But probably more important than providing a generic helper, we should
>>> document the resubmit semantics (AKA, what should and/or shouldn't be
>>> done with pending jobs when a recovery happens). Because forbidding
>>> people to use a generic helper function doesn't give any guarantee that
>>> they'll do the right thing when coding their own logic, unless we give
>>> clues about what's considered right/wrong, and the current state of the
>>> doc is pretty unclear in this regard.
>> I should probably talk about the history of the re-submit feature a bit
>> more.
>>
>> Basically AMD came up with re-submission as a cheap way of increasing the
>> reliability of GPU resets. Problem is that it turned into an absolutely
>> nightmare. We tried for the last 5 years or so to get that stable and it's
>> still crashing.
>>
>> The first and most major problem is that the kernel doesn't even has the
>> information if re-submitting jobs is possible or not. For example a job
>> which has already been pushed to the hw could have grabbed a binary
>> semaphore and re-submitting it will just wait forever for the semaphore to
>> be released.
>>
>> The second problem is that the dma_fence semantics don't allow to ever
>> transit the state of a fence from signaled back to unsignaled. This means
>> that you can't re-use the hw fence and need to allocate a new one, but since
>> memory allocation is forbidden inside a reset handler as well (YES we need
>> to better document that part) you actually need to keep a bunch of hw fences
>> pre-allocated around to make this work. Amdgpu choose to illegally re-use
>> the hw fence instead which only works with quite extreme hacks.
>>
>> The third problem is that the lifetime of the job object was actually
>> defined very well before we tried to use re-submission. Basically it's just
>> an intermediate state used between the IOCTL and pushing things to the hw,
>> introducing this re-submit feature completely messed that up and cause quite
>> a number of use after free errors in the past which are again only solved by
>> quite some hacks.
>>
>> What we should do in the GPU scheduler instead is the follow:
>>
>> 1. Don't support re-submission at all!
>>      Instead we can provide help to drivers to query which fences (scheduler
>> or hw) are still not signaled yet.
>>      This can then be used to re-create hw state if (and only if!) the driver
>> knows what it's doing and can actually guarantee that this will work.
>>      E.g. the case for XE where the kernel driver knows the contexts which
>> were not running at the time and can re-create their queues.
>>
>> 2. We can provide both a wq to use for single threaded application as well
>> as start/stop semantics.
>>      It's just that the start/stop semantics should never touch what was
>> already submitted, but rather just make sure that we don't get any new
>> submissions.
>>
> I pretty much agree with everything Christian has said here and Xe
> aligns with this. Let me explain what Xe does.
>
> 1. Entity hang (TDR timeout of job on a entity, firmware notifies Xe that a
> entity hung, entity IOMMU CAT error, etc...):
> 	- No re-submission at all
> 	- ban the entity
> 	- notify the UMD
> 	- cleanup all pending jobs / fences
> 2. Entire GPU hang (worth mentioning with good HW + KMD this *should*
> never happen):
> 	- stop all schedulers (same as a entity in Xe because 1 to 1)
> 	- cleanup odd entity state related to communication with the
> 	  firmware
> 	- check if an entity has a job that started but not finished, if
> 	  so ban it (same mechanism as above)
> 	- resubmit all jobs from good entities

As long as you can do this without creating new dma_fence objects for 
the hw submissions everything should be fine.

> 	- start all schedulers (same as a entity in Xe because 1 to 1)
>
> The implementation for this in the following file [1]. Search for the
> drm scheduler functions and you should be able to find implementation
> easily.
>
> If you want to use an ordered work queue to avoid the stop / start dance
> great do that, in Xe the stop / start dance works. I have extensively
> tested this and the flow is rock solid and please trust me when I say
> this as I worked on reset flows in the i915 and fought bugs for years, I
> still don't think it is in the i915 because we try to do resubmission +
> crazy races. Because of that I was incredibly paranoid when I
> implemented this + tested it extensively.
>
> I'll post an updated version of my DRM scheduler series [2] on the list
> soon for the WQ changes, plus whatever else is required for Xe.
>
> So my take from this discussion is maybe with a little documentation, we
> are good. Thoughts?

For XE what you describe above basically sounds perfectly fine to me.

I strongly assume when you re-submit things you just tell your hw 
scheduler to pick up at the place it left of? E.g. set a read pointer 
and write pointer of a ring buffer appropriately?

Christian.

>
> Matt
>
> [1] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c
> [2] https://patchwork.freedesktop.org/series/116055/
>
>> Regards,
>> Christian.
>>
>>> Regards,
>>>
>>> Boris


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated
  2023-05-04 11:07           ` Christian König
@ 2023-05-04 13:07             ` Matthew Brost
  0 siblings, 0 replies; 14+ messages in thread
From: Matthew Brost @ 2023-05-04 13:07 UTC (permalink / raw)
  To: Christian König
  Cc: Tuikov, Luben, Sarah Walker, dri-devel@lists.freedesktop.org,
	Boris Brezillon, Alex Deucher

On Thu, May 04, 2023 at 01:07:29PM +0200, Christian König wrote:
> Am 04.05.23 um 06:54 schrieb Matthew Brost:
> > On Wed, May 03, 2023 at 10:47:43AM +0200, Christian König wrote:
> > > Adding Luben as well.
> > > 
> > > Am 03.05.23 um 10:16 schrieb Boris Brezillon:
> > > > [SNIP]
> > > > > To sum-up, we shouldn't call drm_sched_{start,stop,resubmit_jobs}().
> > > > After the discussion I had with Matthew yesterday on IRC, I
> > > > realized there was no clear agreement on this. Matthew uses those 3
> > > > helpers in the Xe driver right now, and given he intends to use a
> > > > multi-threaded wq for its 1:1 schedulers run queue, there's no way he
> > > > can get away without calling drm_sched_{start,stop}().
> > > > drm_sched_resubmit_jobs() can be open-coded in each driver, but I'm
> > > > wondering if it wouldn't be preferable to add a ::resubmit_job() method
> > > > or extend the ::run_job() one to support the resubmit semantics, which,
> > > > AFAIU, is just about enforcing the job done fence (the one returned by
> > > > ::run_job()) doesn't transition from a signaled to an unsignaled state.
> > > > 
> > > > But probably more important than providing a generic helper, we should
> > > > document the resubmit semantics (AKA, what should and/or shouldn't be
> > > > done with pending jobs when a recovery happens). Because forbidding
> > > > people to use a generic helper function doesn't give any guarantee that
> > > > they'll do the right thing when coding their own logic, unless we give
> > > > clues about what's considered right/wrong, and the current state of the
> > > > doc is pretty unclear in this regard.
> > > I should probably talk about the history of the re-submit feature a bit
> > > more.
> > > 
> > > Basically AMD came up with re-submission as a cheap way of increasing the
> > > reliability of GPU resets. Problem is that it turned into an absolutely
> > > nightmare. We tried for the last 5 years or so to get that stable and it's
> > > still crashing.
> > > 
> > > The first and most major problem is that the kernel doesn't even has the
> > > information if re-submitting jobs is possible or not. For example a job
> > > which has already been pushed to the hw could have grabbed a binary
> > > semaphore and re-submitting it will just wait forever for the semaphore to
> > > be released.
> > > 
> > > The second problem is that the dma_fence semantics don't allow to ever
> > > transit the state of a fence from signaled back to unsignaled. This means
> > > that you can't re-use the hw fence and need to allocate a new one, but since
> > > memory allocation is forbidden inside a reset handler as well (YES we need
> > > to better document that part) you actually need to keep a bunch of hw fences
> > > pre-allocated around to make this work. Amdgpu choose to illegally re-use
> > > the hw fence instead which only works with quite extreme hacks.
> > > 
> > > The third problem is that the lifetime of the job object was actually
> > > defined very well before we tried to use re-submission. Basically it's just
> > > an intermediate state used between the IOCTL and pushing things to the hw,
> > > introducing this re-submit feature completely messed that up and cause quite
> > > a number of use after free errors in the past which are again only solved by
> > > quite some hacks.
> > > 
> > > What we should do in the GPU scheduler instead is the follow:
> > > 
> > > 1. Don't support re-submission at all!
> > >      Instead we can provide help to drivers to query which fences (scheduler
> > > or hw) are still not signaled yet.
> > >      This can then be used to re-create hw state if (and only if!) the driver
> > > knows what it's doing and can actually guarantee that this will work.
> > >      E.g. the case for XE where the kernel driver knows the contexts which
> > > were not running at the time and can re-create their queues.
> > > 
> > > 2. We can provide both a wq to use for single threaded application as well
> > > as start/stop semantics.
> > >      It's just that the start/stop semantics should never touch what was
> > > already submitted, but rather just make sure that we don't get any new
> > > submissions.
> > > 
> > I pretty much agree with everything Christian has said here and Xe
> > aligns with this. Let me explain what Xe does.
> > 
> > 1. Entity hang (TDR timeout of job on a entity, firmware notifies Xe that a
> > entity hung, entity IOMMU CAT error, etc...):
> > 	- No re-submission at all
> > 	- ban the entity
> > 	- notify the UMD
> > 	- cleanup all pending jobs / fences
> > 2. Entire GPU hang (worth mentioning with good HW + KMD this *should*
> > never happen):
> > 	- stop all schedulers (same as a entity in Xe because 1 to 1)
> > 	- cleanup odd entity state related to communication with the
> > 	  firmware
> > 	- check if an entity has a job that started but not finished, if
> > 	  so ban it (same mechanism as above)
> > 	- resubmit all jobs from good entities
> 
> As long as you can do this without creating new dma_fence objects for the hw
> submissions everything should be fine.
> 

Yep, same fence.

> > 	- start all schedulers (same as a entity in Xe because 1 to 1)
> > 
> > The implementation for this in the following file [1]. Search for the
> > drm scheduler functions and you should be able to find implementation
> > easily.
> > 
> > If you want to use an ordered work queue to avoid the stop / start dance
> > great do that, in Xe the stop / start dance works. I have extensively
> > tested this and the flow is rock solid and please trust me when I say
> > this as I worked on reset flows in the i915 and fought bugs for years, I
> > still don't think it is in the i915 because we try to do resubmission +
> > crazy races. Because of that I was incredibly paranoid when I
> > implemented this + tested it extensively.
> > 
> > I'll post an updated version of my DRM scheduler series [2] on the list
> > soon for the WQ changes, plus whatever else is required for Xe.
> > 
> > So my take from this discussion is maybe with a little documentation, we
> > are good. Thoughts?
> 
> For XE what you describe above basically sounds perfectly fine to me.
> 
> I strongly assume when you re-submit things you just tell your hw scheduler
> to pick up at the place it left of? E.g. set a read pointer and write
> pointer of a ring buffer appropriately?
>

Yep.

Matt
 
> Christian.
> 
> > 
> > Matt
> > 
> > [1] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c
> > [2] https://patchwork.freedesktop.org/series/116055/
> > 
> > > Regards,
> > > Christian.
> > > 
> > > > Regards,
> > > > 
> > > > Boris
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated
  2023-05-02 11:36 ` Christian König
  2023-05-02 12:41   ` Boris Brezillon
@ 2023-05-02 16:50   ` Boris Brezillon
  1 sibling, 0 replies; 14+ messages in thread
From: Boris Brezillon @ 2023-05-02 16:50 UTC (permalink / raw)
  To: Christian König
  Cc: Matthew Brost, Sarah Walker, dri-devel@lists.freedesktop.org,
	Alex Deucher

+Matthew who's been working on the kthread -> wq transition.

On Tue, 2 May 2023 13:36:07 +0200
Christian König <christian.koenig@amd.com> wrote:

> Hi Boris,
> 
> Am 02.05.23 um 13:19 schrieb Boris Brezillon:
> > Hello Christian, Alex,
> >
> > As part of our transition to drm_sched for the powervr GPU driver, we
> > realized drm_sched_resubmit_jobs(), which is used by all drivers
> > relying on drm_sched right except amdgpu, has been deprecated.
> > Unfortunately, commit 5efbe6aa7a0e ("drm/scheduler: deprecate
> > drm_sched_resubmit_jobs") doesn't describe what drivers should do or use
> > as an alternative.
> >
> > At the very least, for our implementation, we need to restore the
> > drm_sched_job::parent pointers that were set to NULL in
> > drm_sched_stop(), such that jobs submitted before the GPU recovery are
> > considered active when drm_sched_start() is called. That could be done
> > with a custom pending_list iteration restoring drm_sched_job::parent's
> > pointer, but that seems odd to let the scheduler backend manipulate
> > this list directly, and I suspect we need to do other checks, like the
> > karma vs hang-limit thing, so we can flag the entity dirty and cancel
> > all jobs being queued there if the entity has caused too many hangs.
> >
> > Now that drm_sched_resubmit_jobs() has been deprecated, that would be
> > great if you could help us write a piece of documentation describing
> > what should be done between drm_sched_stop() and drm_sched_start(), so
> > new drivers don't come up with their own slightly different/broken
> > version of the same thing.  
> 
> Yeah, really good point! The solution is to not use drm_sched_stop() and 
> drm_sched_start() either.
> 
> The general idea Daniel, the other Intel guys and me seem to have agreed 
> on is to convert the scheduler thread into a work item.
> 
> This work item for pushing jobs to the hw can then be queued to the same 
> workqueue we use for the timeout work item.
> 
> If this workqueue is now configured by your driver as single threaded 
> you have a guarantee that only either the scheduler or the timeout work 
> item is running at the same time. That in turn makes starting/stopping 
> the scheduler for a reset completely superfluous.
> 
> Patches for this has already been floating on the mailing list, but 
> haven't been committed yet. Since this is all WIP.

As I said previously, our work is based on Matthew's patch series
converting drm_sched threads to a workqueue based implementation.
And I think there are a couple of shortcomings with the current
implementation if we do what you suggest (one single-threaded workqueue
used to serialize the GPU resets and job dequeuing/submission):

- drm_sched_main() has a loop dequeuing jobs until it can't (because a
  dep is not signaled, or because all slots are taken), not one at a
  time. There can also be several 1:1 entities waiting for jobs to be
  dequeued before the reset work. That means the reset operation might
  be delayed if we don't have a drm_sched_stop() calling cancel_work()
  and making sure we break out of the loop.
- with a workqueue (and that's even worse with a single-threaded one),
  the entity priorities are not enforced at the de-queuing level. The FW
  will still take care of execution priority, but you can have a low
  prio drm_sched_entity getting pushed a lot of jobs, and with the
  drm_sched_main() loop, you'll only let other entities dequeue their
  jobs when the ringbuf of this low prio entity is full or a job dep
  is not signaled. De-queuing one job at a time makes things better, but
  if you have a lot of 1:1 entities, it can still take sometime until
  you reach the high prio drm_sched worker.

I'm just wondering if you already have solutions to these problems.

Regards,

Boris

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-05-04 13:12 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-02 11:19 drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated Boris Brezillon
2023-05-02 11:36 ` Christian König
2023-05-02 12:41   ` Boris Brezillon
2023-05-03  8:16     ` Boris Brezillon
2023-05-03  8:47       ` Christian König
2023-05-03  9:49         ` Boris Brezillon
2023-05-03 10:28         ` Lucas Stach
2023-05-03 11:40           ` Christian König
2023-05-03 13:10             ` Lucas Stach
2023-05-03 15:01               ` Christian König
2023-05-04  4:54         ` Matthew Brost
2023-05-04 11:07           ` Christian König
2023-05-04 13:07             ` Matthew Brost
2023-05-02 16:50   ` Boris Brezillon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.