* [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
@ 2025-07-14 6:23 Lin.Cao
2025-07-14 11:03 ` Christian König
2025-07-14 12:46 ` Philipp Stanner
0 siblings, 2 replies; 15+ messages in thread
From: Lin.Cao @ 2025-07-14 6:23 UTC (permalink / raw)
To: dri-devel
Cc: zhenguo.yin, Emily.Deng, Christian König, phasta, dakr,
matthew.brost, Lin.Cao
When Application A submits jobs (a1, a2, a3) and application B submits
job b1 with a dependency on a2's scheduler fence, killing application A
before run_job(a1) causes drm_sched_entity_kill_jobs_work() to force
signal all jobs sequentially. However, due to missing work_run_job or
work_free_job in entity_kill_job_work(), the scheduler enters sleep
state, causing application B hang.
Add drm_sched_wakeup() when entity_kill_job_work() to preventing
scheduler sleep and subsequent application hangs.
v2:
- Move drm_sched_wakeup() to after drm_sched_fence_scheduled()
Signed-off-by: Lin.Cao <lincao12@amd.com>
---
drivers/gpu/drm/scheduler/sched_entity.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index e671aa241720..66f2a43c58fd 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -177,6 +177,7 @@ static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk)
struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
drm_sched_fence_scheduled(job->s_fence, NULL);
+ drm_sched_wakeup(job->sched);
drm_sched_fence_finished(job->s_fence, -ESRCH);
WARN_ON(job->s_fence->parent);
job->sched->ops->free_job(job);
--
2.46.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
2025-07-14 6:23 [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs Lin.Cao
@ 2025-07-14 11:03 ` Christian König
2025-07-14 12:46 ` Philipp Stanner
1 sibling, 0 replies; 15+ messages in thread
From: Christian König @ 2025-07-14 11:03 UTC (permalink / raw)
To: Lin.Cao, dri-devel; +Cc: zhenguo.yin, Emily.Deng, phasta, dakr, matthew.brost
On 14.07.25 08:23, Lin.Cao wrote:
> When Application A submits jobs (a1, a2, a3) and application B submits
> job b1 with a dependency on a2's scheduler fence, killing application A
> before run_job(a1) causes drm_sched_entity_kill_jobs_work() to force
> signal all jobs sequentially. However, due to missing work_run_job or
> work_free_job in entity_kill_job_work(), the scheduler enters sleep
> state, causing application B hang.
>
> Add drm_sched_wakeup() when entity_kill_job_work() to preventing
> scheduler sleep and subsequent application hangs.
>
> v2:
> - Move drm_sched_wakeup() to after drm_sched_fence_scheduled()
>
> Signed-off-by: Lin.Cao <lincao12@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
> ---
> drivers/gpu/drm/scheduler/sched_entity.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index e671aa241720..66f2a43c58fd 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -177,6 +177,7 @@ static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk)
> struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
>
> drm_sched_fence_scheduled(job->s_fence, NULL);
> + drm_sched_wakeup(job->sched);
> drm_sched_fence_finished(job->s_fence, -ESRCH);
> WARN_ON(job->s_fence->parent);
> job->sched->ops->free_job(job);
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
2025-07-14 6:23 [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs Lin.Cao
2025-07-14 11:03 ` Christian König
@ 2025-07-14 12:46 ` Philipp Stanner
2025-07-14 13:08 ` Christian König
1 sibling, 1 reply; 15+ messages in thread
From: Philipp Stanner @ 2025-07-14 12:46 UTC (permalink / raw)
To: Lin.Cao, dri-devel
Cc: zhenguo.yin, Emily.Deng, Christian König, phasta, dakr,
matthew.brost
regarding the patch subject: the prefix we use for the scheduler is:
drm/sched:
On Mon, 2025-07-14 at 14:23 +0800, Lin.Cao wrote:
> When Application A submits jobs (a1, a2, a3) and application B submits
s/Application/application
> job b1 with a dependency on a2's scheduler fence, killing application A
> before run_job(a1) causes drm_sched_entity_kill_jobs_work() to force
> signal all jobs sequentially. However, due to missing work_run_job or
> work_free_job
>
You probably mean "due to the work items work_run_job and work_free_job
not being started […]".
However, the call you add, drm_sched_wakeup(), immediately only
triggers work_run_job. It's not clear to me why you mention
work_free_job, because drm_sched_entity_kill_jobs_work() directly
invokes the free_job() callback.
> in entity_kill_job_work(), the scheduler enters sleep
> state, causing application B hang.
So the issue is that if a1 never ran, the scheduler will not continue
with the jobs of application B, correct?
>
> Add drm_sched_wakeup() when entity_kill_job_work() to preventing
s/to preventing/to prevent
> scheduler sleep and subsequent application hangs.
>
> v2:
> - Move drm_sched_wakeup() to after drm_sched_fence_scheduled()
Changelogs are cool and useful, but move them below the Signed-off area
with another --- please, like so:
Signed-off by: …
---
v2:
…
---
drivers/gpu/drm/scheduler/sched_entity.c | 1 +
>
> Signed-off-by: Lin.Cao <lincao12@amd.com>
This fixes a bug. Even a racy bug, it seems. So we need a Fixes tag and
put the stable kernel in CC.
Please figure out with git blame, git log and git tag --contains which
commit your patch fixes and which is the oldest kernel that contains
the bug. Then add a Fixes: tag and Cc: the stable kernel. You'll find
lots of examples for that in git log.
Thx
P.
> ---
> drivers/gpu/drm/scheduler/sched_entity.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index e671aa241720..66f2a43c58fd 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -177,6 +177,7 @@ static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk)
> struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
>
> drm_sched_fence_scheduled(job->s_fence, NULL);
> + drm_sched_wakeup(job->sched);
> drm_sched_fence_finished(job->s_fence, -ESRCH);
> WARN_ON(job->s_fence->parent);
> job->sched->ops->free_job(job);
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
2025-07-14 12:46 ` Philipp Stanner
@ 2025-07-14 13:08 ` Christian König
2025-07-14 13:27 ` Philipp Stanner
0 siblings, 1 reply; 15+ messages in thread
From: Christian König @ 2025-07-14 13:08 UTC (permalink / raw)
To: phasta, Lin.Cao, dri-devel; +Cc: zhenguo.yin, Emily.Deng, dakr, matthew.brost
On 14.07.25 14:46, Philipp Stanner wrote:
> regarding the patch subject: the prefix we use for the scheduler is:
> drm/sched:
>
>
> On Mon, 2025-07-14 at 14:23 +0800, Lin.Cao wrote:
>
>> When Application A submits jobs (a1, a2, a3) and application B submits
>
> s/Application/application
>
>> job b1 with a dependency on a2's scheduler fence, killing application A
>> before run_job(a1) causes drm_sched_entity_kill_jobs_work() to force
>> signal all jobs sequentially. However, due to missing work_run_job or
>> work_free_job
>>
>
> You probably mean "due to the work items work_run_job and work_free_job
> not being started […]".
>
> However, the call you add, drm_sched_wakeup(), immediately only
> triggers work_run_job. It's not clear to me why you mention
> work_free_job, because drm_sched_entity_kill_jobs_work() directly
> invokes the free_job() callback.
Yeah the description is rather confusing, took me more than one try to understand what's going on here as well. Let me try to explain it in my words:
When an application A is killed the pending submissions from it are not executed, but rather torn down by drm_sched_entity_kill_jobs_work().
If now a submission from application B depends on something application A wanted to do on the same scheduler instance the function drm_sched_entity_add_dependency_cb() would have optimized this dependency and assumed that the scheduler work would already run and try to pick a job.
But that isn't the case when the submissions from application A are all killed. So make sure to start the scheduler work from drm_sched_entity_kill_jobs_work() to handle that case.
Regards,
Christian.
>
>> in entity_kill_job_work(), the scheduler enters sleep
>> state, causing application B hang.
>
> So the issue is that if a1 never ran, the scheduler will not continue
> with the jobs of application B, correct?
>
>>
>> Add drm_sched_wakeup() when entity_kill_job_work() to preventing
>
> s/to preventing/to prevent
>
>> scheduler sleep and subsequent application hangs.
>>
>> v2:
>> - Move drm_sched_wakeup() to after drm_sched_fence_scheduled()
>
> Changelogs are cool and useful, but move them below the Signed-off area
> with another --- please, like so:
>
> Signed-off by: …
> ---
> v2:
> …
> ---
> drivers/gpu/drm/scheduler/sched_entity.c | 1 +
>
>
>>
>> Signed-off-by: Lin.Cao <lincao12@amd.com>
>
> This fixes a bug. Even a racy bug, it seems. So we need a Fixes tag and
> put the stable kernel in CC.
>
> Please figure out with git blame, git log and git tag --contains which
> commit your patch fixes and which is the oldest kernel that contains
> the bug. Then add a Fixes: tag and Cc: the stable kernel. You'll find
> lots of examples for that in git log.
>
>
> Thx
> P.
>
>> ---
>> drivers/gpu/drm/scheduler/sched_entity.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>> index e671aa241720..66f2a43c58fd 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -177,6 +177,7 @@ static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk)
>> struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
>>
>> drm_sched_fence_scheduled(job->s_fence, NULL);
>> + drm_sched_wakeup(job->sched);
>> drm_sched_fence_finished(job->s_fence, -ESRCH);
>> WARN_ON(job->s_fence->parent);
>> job->sched->ops->free_job(job);
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
2025-07-14 13:08 ` Christian König
@ 2025-07-14 13:27 ` Philipp Stanner
2025-07-14 13:39 ` Christian König
0 siblings, 1 reply; 15+ messages in thread
From: Philipp Stanner @ 2025-07-14 13:27 UTC (permalink / raw)
To: Christian König, phasta, Lin.Cao, dri-devel
Cc: zhenguo.yin, Emily.Deng, dakr, matthew.brost
On Mon, 2025-07-14 at 15:08 +0200, Christian König wrote:
>
>
> On 14.07.25 14:46, Philipp Stanner wrote:
> > regarding the patch subject: the prefix we use for the scheduler
> > is:
> > drm/sched:
> >
> >
> > On Mon, 2025-07-14 at 14:23 +0800, Lin.Cao wrote:
> >
> > > When Application A submits jobs (a1, a2, a3) and application B
> > > submits
> >
> > s/Application/application
> >
> > > job b1 with a dependency on a2's scheduler fence, killing
> > > application A
> > > before run_job(a1) causes drm_sched_entity_kill_jobs_work() to
> > > force
> > > signal all jobs sequentially. However, due to missing
> > > work_run_job or
> > > work_free_job
> > >
> >
> > You probably mean "due to the work items work_run_job and
> > work_free_job
> > not being started […]".
> >
> > However, the call you add, drm_sched_wakeup(), immediately only
> > triggers work_run_job. It's not clear to me why you mention
> > work_free_job, because drm_sched_entity_kill_jobs_work() directly
> > invokes the free_job() callback.
>
> Yeah the description is rather confusing, took me more than one try
> to understand what's going on here as well. Let me try to explain it
> in my words:
>
> When an application A is killed the pending submissions from it are
> not executed, but rather torn down by
> drm_sched_entity_kill_jobs_work().
>
> If now a submission from application B depends on something
> application A wanted to do on the same scheduler instance the
> function drm_sched_entity_add_dependency_cb() would have optimized
> this dependency and assumed that the scheduler work would already run
> and try to pick a job.
>
> But that isn't the case when the submissions from application A are
> all killed. So make sure to start the scheduler work from
> drm_sched_entity_kill_jobs_work() to handle that case.
Alright, so the bug then depends on B setting a dependency on A _after_
A was killed, doesn't it? Because the optimization in
_add_dependency_cb() can only have an effect after A's jobs have been
torn down.
If there is such a timing order problem, the commit message should
mention it.
The commit message definitely also needs to mention this optimization
in drm_sched_entity_add_dependency_cb() since it's essential for
understanding the bug.
Danke
P.
>
> Regards,
> Christian.
>
> >
> > > in entity_kill_job_work(), the scheduler enters sleep
> > > state, causing application B hang.
> >
> > So the issue is that if a1 never ran, the scheduler will not
> > continue
> > with the jobs of application B, correct?
> >
> > >
> > > Add drm_sched_wakeup() when entity_kill_job_work() to preventing
> >
> > s/to preventing/to prevent
> >
> > > scheduler sleep and subsequent application hangs.
> > >
> > > v2:
> > > - Move drm_sched_wakeup() to after drm_sched_fence_scheduled()
> >
> > Changelogs are cool and useful, but move them below the Signed-off
> > area
> > with another --- please, like so:
> >
> > Signed-off by: …
> > ---
> > v2:
> > …
> > ---
> > drivers/gpu/drm/scheduler/sched_entity.c | 1 +
> >
> >
> > >
> > > Signed-off-by: Lin.Cao <lincao12@amd.com>
> >
> > This fixes a bug. Even a racy bug, it seems. So we need a Fixes tag
> > and
> > put the stable kernel in CC.
> >
> > Please figure out with git blame, git log and git tag --contains
> > which
> > commit your patch fixes and which is the oldest kernel that
> > contains
> > the bug. Then add a Fixes: tag and Cc: the stable kernel. You'll
> > find
> > lots of examples for that in git log.
> >
> >
> > Thx
> > P.
> >
> > > ---
> > > drivers/gpu/drm/scheduler/sched_entity.c | 1 +
> > > 1 file changed, 1 insertion(+)
> > >
> > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> > > b/drivers/gpu/drm/scheduler/sched_entity.c
> > > index e671aa241720..66f2a43c58fd 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > @@ -177,6 +177,7 @@ static void
> > > drm_sched_entity_kill_jobs_work(struct work_struct *wrk)
> > > struct drm_sched_job *job = container_of(wrk,
> > > typeof(*job), work);
> > >
> > > drm_sched_fence_scheduled(job->s_fence, NULL);
> > > + drm_sched_wakeup(job->sched);
> > > drm_sched_fence_finished(job->s_fence, -ESRCH);
> > > WARN_ON(job->s_fence->parent);
> > > job->sched->ops->free_job(job);
> >
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
2025-07-14 13:27 ` Philipp Stanner
@ 2025-07-14 13:39 ` Christian König
2025-07-15 3:38 ` cao, lin
0 siblings, 1 reply; 15+ messages in thread
From: Christian König @ 2025-07-14 13:39 UTC (permalink / raw)
To: phasta, Lin.Cao, dri-devel; +Cc: zhenguo.yin, Emily.Deng, dakr, matthew.brost
On 14.07.25 15:27, Philipp Stanner wrote:
> On Mon, 2025-07-14 at 15:08 +0200, Christian König wrote:
>>
>>
>> On 14.07.25 14:46, Philipp Stanner wrote:
>>> regarding the patch subject: the prefix we use for the scheduler
>>> is:
>>> drm/sched:
>>>
>>>
>>> On Mon, 2025-07-14 at 14:23 +0800, Lin.Cao wrote:
>>>
>>>> When Application A submits jobs (a1, a2, a3) and application B
>>>> submits
>>>
>>> s/Application/application
>>>
>>>> job b1 with a dependency on a2's scheduler fence, killing
>>>> application A
>>>> before run_job(a1) causes drm_sched_entity_kill_jobs_work() to
>>>> force
>>>> signal all jobs sequentially. However, due to missing
>>>> work_run_job or
>>>> work_free_job
>>>>
>>>
>>> You probably mean "due to the work items work_run_job and
>>> work_free_job
>>> not being started […]".
>>>
>>> However, the call you add, drm_sched_wakeup(), immediately only
>>> triggers work_run_job. It's not clear to me why you mention
>>> work_free_job, because drm_sched_entity_kill_jobs_work() directly
>>> invokes the free_job() callback.
>>
>> Yeah the description is rather confusing, took me more than one try
>> to understand what's going on here as well. Let me try to explain it
>> in my words:
>>
>> When an application A is killed the pending submissions from it are
>> not executed, but rather torn down by
>> drm_sched_entity_kill_jobs_work().
>>
>> If now a submission from application B depends on something
>> application A wanted to do on the same scheduler instance the
>> function drm_sched_entity_add_dependency_cb() would have optimized
>> this dependency and assumed that the scheduler work would already run
>> and try to pick a job.
>>
>> But that isn't the case when the submissions from application A are
>> all killed. So make sure to start the scheduler work from
>> drm_sched_entity_kill_jobs_work() to handle that case.
>
> Alright, so the bug then depends on B setting a dependency on A _after_
> A was killed, doesn't it? Because the optimization in
> _add_dependency_cb() can only have an effect after A's jobs have been
> torn down.
No, application A and application B submit right behind each other. Application A is then killed later on.
The optimization in _add_dependency_cb() just waits for As submission to be handled by the scheduler, but that never happens because it was killed.
> If there is such a timing order problem, the commit message should
> mention it.
>
> The commit message definitely also needs to mention this optimization
> in drm_sched_entity_add_dependency_cb() since it's essential for
> understanding the bug.
+1
Christian.
>
>
> Danke
> P.
>
>
>>
>> Regards,
>> Christian.
>>
>>>
>>>> in entity_kill_job_work(), the scheduler enters sleep
>>>> state, causing application B hang.
>>>
>>> So the issue is that if a1 never ran, the scheduler will not
>>> continue
>>> with the jobs of application B, correct?
>>>
>>>>
>>>> Add drm_sched_wakeup() when entity_kill_job_work() to preventing
>>>
>>> s/to preventing/to prevent
>>>
>>>> scheduler sleep and subsequent application hangs.
>>>>
>>>> v2:
>>>> - Move drm_sched_wakeup() to after drm_sched_fence_scheduled()
>>>
>>> Changelogs are cool and useful, but move them below the Signed-off
>>> area
>>> with another --- please, like so:
>>>
>>> Signed-off by: …
>>> ---
>>> v2:
>>> …
>>> ---
>>> drivers/gpu/drm/scheduler/sched_entity.c | 1 +
>>>
>>>
>>>>
>>>> Signed-off-by: Lin.Cao <lincao12@amd.com>
>>>
>>> This fixes a bug. Even a racy bug, it seems. So we need a Fixes tag
>>> and
>>> put the stable kernel in CC.
>>>
>>> Please figure out with git blame, git log and git tag --contains
>>> which
>>> commit your patch fixes and which is the oldest kernel that
>>> contains
>>> the bug. Then add a Fixes: tag and Cc: the stable kernel. You'll
>>> find
>>> lots of examples for that in git log.
>>>
>>>
>>> Thx
>>> P.
>>>
>>>> ---
>>>> drivers/gpu/drm/scheduler/sched_entity.c | 1 +
>>>> 1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>> index e671aa241720..66f2a43c58fd 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>> @@ -177,6 +177,7 @@ static void
>>>> drm_sched_entity_kill_jobs_work(struct work_struct *wrk)
>>>> struct drm_sched_job *job = container_of(wrk,
>>>> typeof(*job), work);
>>>>
>>>> drm_sched_fence_scheduled(job->s_fence, NULL);
>>>> + drm_sched_wakeup(job->sched);
>>>> drm_sched_fence_finished(job->s_fence, -ESRCH);
>>>> WARN_ON(job->s_fence->parent);
>>>> job->sched->ops->free_job(job);
>>>
>>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
2025-07-14 13:39 ` Christian König
@ 2025-07-15 3:38 ` cao, lin
2025-07-15 9:12 ` Philipp Stanner
0 siblings, 1 reply; 15+ messages in thread
From: cao, lin @ 2025-07-15 3:38 UTC (permalink / raw)
To: Koenig, Christian, phasta@kernel.org,
dri-devel@lists.freedesktop.org
Cc: Yin, ZhenGuo (Chris), Deng, Emily, dakr@kernel.org,
matthew.brost@intel.com
[-- Attachment #1: Type: text/plain, Size: 6408 bytes --]
[AMD Official Use Only - AMD Internal Distribution Only]
Hi Christian, Philipp,
I modified the commit msg, and I hope it makes more sense:
drm/sched: wake up scheduler when killing jobs to prevent hang
When application A submits jobs (a1, a2, a3) and application B submits
job b1 with a dependency on a2's scheduler fence, killing application A
before run_job(a1) causes drm_sched_entity_kill_jobs_work() to force
signal all jobs sequentially. However, the optimization in
drm_sched_entity_add_dependency_cb() waits for the fence to be scheduled
by adding a callback (drm_sched_entity_clear_dep) instead of immediately
waking up the scheduler. When A is killed before its jobs can run, the
callback gets triggered but drm_sched_entity_clear_dep() only clears the
dependency without waking up the scheduler, causing the scheduler to enter
sleep state and application B to hang.
Add drm_sched_wakeup() in entity_kill_job_work() to prevent scheduler
sleep and subsequent application hangs.
v2:
- Move drm_sched_wakeup() to after drm_sched_fence_scheduled()
Thanks,
Lin
________________________________
From: Koenig, Christian <Christian.Koenig@amd.com>
Sent: Monday, July 14, 2025 21:39
To: phasta@kernel.org <phasta@kernel.org>; cao, lin <lin.cao@amd.com>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>
Cc: Yin, ZhenGuo (Chris) <ZhenGuo.Yin@amd.com>; Deng, Emily <Emily.Deng@amd.com>; dakr@kernel.org <dakr@kernel.org>; matthew.brost@intel.com <matthew.brost@intel.com>
Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
On 14.07.25 15:27, Philipp Stanner wrote:
> On Mon, 2025-07-14 at 15:08 +0200, Christian König wrote:
>>
>>
>> On 14.07.25 14:46, Philipp Stanner wrote:
>>> regarding the patch subject: the prefix we use for the scheduler
>>> is:
>>> drm/sched:
>>>
>>>
>>> On Mon, 2025-07-14 at 14:23 +0800, Lin.Cao wrote:
>>>
>>>> When Application A submits jobs (a1, a2, a3) and application B
>>>> submits
>>>
>>> s/Application/application
>>>
>>>> job b1 with a dependency on a2's scheduler fence, killing
>>>> application A
>>>> before run_job(a1) causes drm_sched_entity_kill_jobs_work() to
>>>> force
>>>> signal all jobs sequentially. However, due to missing
>>>> work_run_job or
>>>> work_free_job
>>>>
>>>
>>> You probably mean "due to the work items work_run_job and
>>> work_free_job
>>> not being started […]".
>>>
>>> However, the call you add, drm_sched_wakeup(), immediately only
>>> triggers work_run_job. It's not clear to me why you mention
>>> work_free_job, because drm_sched_entity_kill_jobs_work() directly
>>> invokes the free_job() callback.
>>
>> Yeah the description is rather confusing, took me more than one try
>> to understand what's going on here as well. Let me try to explain it
>> in my words:
>>
>> When an application A is killed the pending submissions from it are
>> not executed, but rather torn down by
>> drm_sched_entity_kill_jobs_work().
>>
>> If now a submission from application B depends on something
>> application A wanted to do on the same scheduler instance the
>> function drm_sched_entity_add_dependency_cb() would have optimized
>> this dependency and assumed that the scheduler work would already run
>> and try to pick a job.
>>
>> But that isn't the case when the submissions from application A are
>> all killed. So make sure to start the scheduler work from
>> drm_sched_entity_kill_jobs_work() to handle that case.
>
> Alright, so the bug then depends on B setting a dependency on A _after_
> A was killed, doesn't it? Because the optimization in
> _add_dependency_cb() can only have an effect after A's jobs have been
> torn down.
No, application A and application B submit right behind each other. Application A is then killed later on.
The optimization in _add_dependency_cb() just waits for As submission to be handled by the scheduler, but that never happens because it was killed.
> If there is such a timing order problem, the commit message should
> mention it.
>
> The commit message definitely also needs to mention this optimization
> in drm_sched_entity_add_dependency_cb() since it's essential for
> understanding the bug.
+1
Christian.
>
>
> Danke
> P.
>
>
>>
>> Regards,
>> Christian.
>>
>>>
>>>> in entity_kill_job_work(), the scheduler enters sleep
>>>> state, causing application B hang.
>>>
>>> So the issue is that if a1 never ran, the scheduler will not
>>> continue
>>> with the jobs of application B, correct?
>>>
>>>>
>>>> Add drm_sched_wakeup() when entity_kill_job_work() to preventing
>>>
>>> s/to preventing/to prevent
>>>
>>>> scheduler sleep and subsequent application hangs.
>>>>
>>>> v2:
>>>> - Move drm_sched_wakeup() to after drm_sched_fence_scheduled()
>>>
>>> Changelogs are cool and useful, but move them below the Signed-off
>>> area
>>> with another --- please, like so:
>>>
>>> Signed-off by: …
>>> ---
>>> v2:
>>> …
>>> ---
>>> drivers/gpu/drm/scheduler/sched_entity.c | 1 +
>>>
>>>
>>>>
>>>> Signed-off-by: Lin.Cao <lincao12@amd.com>
>>>
>>> This fixes a bug. Even a racy bug, it seems. So we need a Fixes tag
>>> and
>>> put the stable kernel in CC.
>>>
>>> Please figure out with git blame, git log and git tag --contains
>>> which
>>> commit your patch fixes and which is the oldest kernel that
>>> contains
>>> the bug. Then add a Fixes: tag and Cc: the stable kernel. You'll
>>> find
>>> lots of examples for that in git log.
>>>
>>>
>>> Thx
>>> P.
>>>
>>>> ---
>>>> drivers/gpu/drm/scheduler/sched_entity.c | 1 +
>>>> 1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>> index e671aa241720..66f2a43c58fd 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>> @@ -177,6 +177,7 @@ static void
>>>> drm_sched_entity_kill_jobs_work(struct work_struct *wrk)
>>>> struct drm_sched_job *job = container_of(wrk,
>>>> typeof(*job), work);
>>>>
>>>> drm_sched_fence_scheduled(job->s_fence, NULL);
>>>> + drm_sched_wakeup(job->sched);
>>>> drm_sched_fence_finished(job->s_fence, -ESRCH);
>>>> WARN_ON(job->s_fence->parent);
>>>> job->sched->ops->free_job(job);
>>>
>>
>
[-- Attachment #2: Type: text/html, Size: 13947 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
2025-07-15 3:38 ` cao, lin
@ 2025-07-15 9:12 ` Philipp Stanner
2025-07-15 9:51 ` cao, lin
0 siblings, 1 reply; 15+ messages in thread
From: Philipp Stanner @ 2025-07-15 9:12 UTC (permalink / raw)
To: cao, lin, Koenig, Christian, phasta@kernel.org,
dri-devel@lists.freedesktop.org
Cc: Yin, ZhenGuo (Chris), Deng, Emily, dakr@kernel.org,
matthew.brost@intel.com
On Tue, 2025-07-15 at 03:38 +0000, cao, lin wrote:
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>
>
> Hi Christian, Philipp,
>
>
> I modified the commit msg, and I hope it makes more sense:
>
>
> drm/sched: wake up scheduler when killing jobs to prevent hang
nit:
s/wake/Wake
>
>
> When application A submits jobs (a1, a2, a3) and application B submits
> job b1 with a dependency on a2's scheduler fence, killing application A
> before run_job(a1) causes drm_sched_entity_kill_jobs_work() to force
> signal all jobs sequentially. However, the optimization in
> drm_sched_entity_add_dependency_cb() waits for the fence to be scheduled
> by adding a callback (drm_sched_entity_clear_dep) instead of immediately
> waking up the scheduler. When A is killed before its jobs can run, the
> callback gets triggered but drm_sched_entity_clear_dep() only clears the
> dependency without waking up the scheduler, causing the scheduler to enter
> sleep state and application B to hang.
That now reads as if drm_sched_entity_clear_dep() is supposed to wake
up the scheduler. But you add the wakeup to a different function (the
work item).
Also the code actually looks like that:
xa_erase(&job->dependencies, index);
if (f && !dma_fence_add_callback(f, &job->finish_cb,
drm_sched_entity_kill_jobs_cb))
return;
dma_fence_put(f);
}
INIT_WORK(&job->work, drm_sched_entity_kill_jobs_work);
schedule_work(&job->work);
}
So if adding that callback succeeds we.. return immediately but we.. do
that for the first dependency, not for all of them?
Oh boy. That code base. We(tm) should look into pimping that up. Also
lacks documentation everywhere.
Anyways. If this solves a bug for you the patch looks fine (i.e., not
potentially harmful) by me and if the tests succeed we can merge it.
However, I'd feel better if you could clarify more why that function is
the right place to solve the bug.
Thanks,
P.
>
>
> Add drm_sched_wakeup() in entity_kill_job_work() to prevent scheduler
> sleep and subsequent application hangs.
>
>
> v2:
> - Move drm_sched_wakeup() to after drm_sched_fence_scheduled()
>
>
> Thanks,
> Lin
>
>
> From: Koenig, Christian <Christian.Koenig@amd.com>
> Sent: Monday, July 14, 2025 21:39
> To: phasta@kernel.org <phasta@kernel.org>; cao, lin <lin.cao@amd.com>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>
> Cc: Yin, ZhenGuo (Chris) <ZhenGuo.Yin@amd.com>; Deng, Emily <Emily.Deng@amd.com>; dakr@kernel.org <dakr@kernel.org>; matthew.brost@intel.com <matthew.brost@intel.com>
> Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
>
>
>
>
> On 14.07.25 15:27, Philipp Stanner wrote:
> > On Mon, 2025-07-14 at 15:08 +0200, Christian König wrote:
> > >
> > >
> > > On 14.07.25 14:46, Philipp Stanner wrote:
> > > > regarding the patch subject: the prefix we use for the scheduler
> > > > is:
> > > > drm/sched:
> > > >
> > > >
> > > > On Mon, 2025-07-14 at 14:23 +0800, Lin.Cao wrote:
> > > >
> > > > > When Application A submits jobs (a1, a2, a3) and application B
> > > > > submits
> > > >
> > > > s/Application/application
> > > >
> > > > > job b1 with a dependency on a2's scheduler fence, killing
> > > > > application A
> > > > > before run_job(a1) causes drm_sched_entity_kill_jobs_work() to
> > > > > force
> > > > > signal all jobs sequentially. However, due to missing
> > > > > work_run_job or
> > > > > work_free_job
> > > > >
> > > >
> > > > You probably mean "due to the work items work_run_job and
> > > > work_free_job
> > > > not being started […]".
> > > >
> > > > However, the call you add, drm_sched_wakeup(), immediately only
> > > > triggers work_run_job. It's not clear to me why you mention
> > > > work_free_job, because drm_sched_entity_kill_jobs_work() directly
> > > > invokes the free_job() callback.
> > >
> > > Yeah the description is rather confusing, took me more than one try
> > > to understand what's going on here as well. Let me try to explain it
> > > in my words:
> > >
> > > When an application A is killed the pending submissions from it are
> > > not executed, but rather torn down by
> > > drm_sched_entity_kill_jobs_work().
> > >
> > > If now a submission from application B depends on something
> > > application A wanted to do on the same scheduler instance the
> > > function drm_sched_entity_add_dependency_cb() would have optimized
> > > this dependency and assumed that the scheduler work would already run
> > > and try to pick a job.
> > >
> > > But that isn't the case when the submissions from application A are
> > > all killed. So make sure to start the scheduler work from
> > > drm_sched_entity_kill_jobs_work() to handle that case.
> >
> > Alright, so the bug then depends on B setting a dependency on A _after_
> > A was killed, doesn't it? Because the optimization in
> > _add_dependency_cb() can only have an effect after A's jobs have been
> > torn down.
>
> No, application A and application B submit right behind each other. Application A is then killed later on.
>
> The optimization in _add_dependency_cb() just waits for As submission to be handled by the scheduler, but that never happens because it was killed.
>
> > If there is such a timing order problem, the commit message should
> > mention it.
> >
> > The commit message definitely also needs to mention this optimization
> > in drm_sched_entity_add_dependency_cb() since it's essential for
> > understanding the bug.
>
> +1
>
> Christian.
>
> >
> >
> > Danke
> > P.
> >
> >
> > >
> > > Regards,
> > > Christian.
> > >
> > > >
> > > > > in entity_kill_job_work(), the scheduler enters sleep
> > > > > state, causing application B hang.
> > > >
> > > > So the issue is that if a1 never ran, the scheduler will not
> > > > continue
> > > > with the jobs of application B, correct?
> > > >
> > > > >
> > > > > Add drm_sched_wakeup() when entity_kill_job_work() to preventing
> > > >
> > > > s/to preventing/to prevent
> > > >
> > > > > scheduler sleep and subsequent application hangs.
> > > > >
> > > > > v2:
> > > > > - Move drm_sched_wakeup() to after drm_sched_fence_scheduled()
> > > >
> > > > Changelogs are cool and useful, but move them below the Signed-off
> > > > area
> > > > with another --- please, like so:
> > > >
> > > > Signed-off by: …
> > > > ---
> > > > v2:
> > > > …
> > > > ---
> > > > drivers/gpu/drm/scheduler/sched_entity.c | 1 +
> > > >
> > > >
> > > > >
> > > > > Signed-off-by: Lin.Cao <lincao12@amd.com>
> > > >
> > > > This fixes a bug. Even a racy bug, it seems. So we need a Fixes tag
> > > > and
> > > > put the stable kernel in CC.
> > > >
> > > > Please figure out with git blame, git log and git tag --contains
> > > > which
> > > > commit your patch fixes and which is the oldest kernel that
> > > > contains
> > > > the bug. Then add a Fixes: tag and Cc: the stable kernel. You'll
> > > > find
> > > > lots of examples for that in git log.
> > > >
> > > >
> > > > Thx
> > > > P.
> > > >
> > > > > ---
> > > > > drivers/gpu/drm/scheduler/sched_entity.c | 1 +
> > > > > 1 file changed, 1 insertion(+)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > index e671aa241720..66f2a43c58fd 100644
> > > > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > @@ -177,6 +177,7 @@ static void
> > > > > drm_sched_entity_kill_jobs_work(struct work_struct *wrk)
> > > > > struct drm_sched_job *job = container_of(wrk,
> > > > > typeof(*job), work);
> > > > >
> > > > > drm_sched_fence_scheduled(job->s_fence, NULL);
> > > > > + drm_sched_wakeup(job->sched);
> > > > > drm_sched_fence_finished(job->s_fence, -ESRCH);
> > > > > WARN_ON(job->s_fence->parent);
> > > > > job->sched->ops->free_job(job);
> > > >
> > >
> >
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
2025-07-15 9:12 ` Philipp Stanner
@ 2025-07-15 9:51 ` cao, lin
2025-07-15 10:27 ` Philipp Stanner
0 siblings, 1 reply; 15+ messages in thread
From: cao, lin @ 2025-07-15 9:51 UTC (permalink / raw)
To: Koenig, Christian, phasta@kernel.org,
dri-devel@lists.freedesktop.org
Cc: Yin, ZhenGuo (Chris), Deng, Emily, dakr@kernel.org,
matthew.brost@intel.com
[-- Attachment #1: Type: text/plain, Size: 10254 bytes --]
[AMD Official Use Only - AMD Internal Distribution Only]
Hi Philipp,
Thanks for your review, let me try to clarify why I added drm_sched_wakeup() to drm_sched_entity_kill_jobs_work():
When application A submits jobs (a1, a2, a3) and application B submits job b1 with a dependency on a1's scheduled fence, the normal execution flow is (function drm_sched_run_job_work()):
1. a1 gets popped from the entity by the scheduler
2. run_job(a1) executes
3. a1's scheduled fence gets signaled
4. drm_sched_run_job_work() calls drm_sched_run_job_queue() at the end
5. The scheduler wakes up and re-selects entities to pop jobs
6. Since b1's dependency is cleared, the scheduler can select b1 and continue the same flow
However, if application A is killed before a1 gets popped by the scheduler, then a1, a2, a3 are killed sequentially by drm_sched_entity_kill_jobs_cb(). During the kill process, their scheduled fences are still signaled, but the kill process itself lacks work_run_job. This means b1's dependency gets cleared, but there's no work_run_job to drive the scheduler to continue running, or we can say step 4 in the normal flow is missing, and the scheduler enters sleep state, which causes application B to hang.
So if we add drm_sched_wakeup() in drm_sched_entity_kill_jobs_work() after drm_sched_fence_scheduled(), the scheduler can be woken up, and application B can continue running after a1's scheduled fence is force signaled.
Thanks,
Lin
________________________________
From: Philipp Stanner <phasta@mailbox.org>
Sent: Tuesday, July 15, 2025 17:12
To: cao, lin <lin.cao@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; phasta@kernel.org <phasta@kernel.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>
Cc: Yin, ZhenGuo (Chris) <ZhenGuo.Yin@amd.com>; Deng, Emily <Emily.Deng@amd.com>; dakr@kernel.org <dakr@kernel.org>; matthew.brost@intel.com <matthew.brost@intel.com>
Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
On Tue, 2025-07-15 at 03:38 +0000, cao, lin wrote:
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>
>
> Hi Christian, Philipp,
>
>
> I modified the commit msg, and I hope it makes more sense:
>
>
> drm/sched: wake up scheduler when killing jobs to prevent hang
nit:
s/wake/Wake
>
>
> When application A submits jobs (a1, a2, a3) and application B submits
> job b1 with a dependency on a2's scheduler fence, killing application A
> before run_job(a1) causes drm_sched_entity_kill_jobs_work() to force
> signal all jobs sequentially. However, the optimization in
> drm_sched_entity_add_dependency_cb() waits for the fence to be scheduled
> by adding a callback (drm_sched_entity_clear_dep) instead of immediately
> waking up the scheduler. When A is killed before its jobs can run, the
> callback gets triggered but drm_sched_entity_clear_dep() only clears the
> dependency without waking up the scheduler, causing the scheduler to enter
> sleep state and application B to hang.
That now reads as if drm_sched_entity_clear_dep() is supposed to wake
up the scheduler. But you add the wakeup to a different function (the
work item).
Also the code actually looks like that:
xa_erase(&job->dependencies, index);
if (f && !dma_fence_add_callback(f, &job->finish_cb,
drm_sched_entity_kill_jobs_cb))
return;
dma_fence_put(f);
}
INIT_WORK(&job->work, drm_sched_entity_kill_jobs_work);
schedule_work(&job->work);
}
So if adding that callback succeeds we.. return immediately but we.. do
that for the first dependency, not for all of them?
Oh boy. That code base. We(tm) should look into pimping that up. Also
lacks documentation everywhere.
Anyways. If this solves a bug for you the patch looks fine (i.e., not
potentially harmful) by me and if the tests succeed we can merge it.
However, I'd feel better if you could clarify more why that function is
the right place to solve the bug.
Thanks,
P.
>
>
> Add drm_sched_wakeup() in entity_kill_job_work() to prevent scheduler
> sleep and subsequent application hangs.
>
>
> v2:
> - Move drm_sched_wakeup() to after drm_sched_fence_scheduled()
>
>
> Thanks,
> Lin
>
>
> From: Koenig, Christian <Christian.Koenig@amd.com>
> Sent: Monday, July 14, 2025 21:39
> To: phasta@kernel.org <phasta@kernel.org>; cao, lin <lin.cao@amd.com>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>
> Cc: Yin, ZhenGuo (Chris) <ZhenGuo.Yin@amd.com>; Deng, Emily <Emily.Deng@amd.com>; dakr@kernel.org <dakr@kernel.org>; matthew.brost@intel.com <matthew.brost@intel.com>
> Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
>
>
>
>
> On 14.07.25 15:27, Philipp Stanner wrote:
> > On Mon, 2025-07-14 at 15:08 +0200, Christian König wrote:
> > >
> > >
> > > On 14.07.25 14:46, Philipp Stanner wrote:
> > > > regarding the patch subject: the prefix we use for the scheduler
> > > > is:
> > > > drm/sched:
> > > >
> > > >
> > > > On Mon, 2025-07-14 at 14:23 +0800, Lin.Cao wrote:
> > > >
> > > > > When Application A submits jobs (a1, a2, a3) and application B
> > > > > submits
> > > >
> > > > s/Application/application
> > > >
> > > > > job b1 with a dependency on a2's scheduler fence, killing
> > > > > application A
> > > > > before run_job(a1) causes drm_sched_entity_kill_jobs_work() to
> > > > > force
> > > > > signal all jobs sequentially. However, due to missing
> > > > > work_run_job or
> > > > > work_free_job
> > > > >
> > > >
> > > > You probably mean "due to the work items work_run_job and
> > > > work_free_job
> > > > not being started […]".
> > > >
> > > > However, the call you add, drm_sched_wakeup(), immediately only
> > > > triggers work_run_job. It's not clear to me why you mention
> > > > work_free_job, because drm_sched_entity_kill_jobs_work() directly
> > > > invokes the free_job() callback.
> > >
> > > Yeah the description is rather confusing, took me more than one try
> > > to understand what's going on here as well. Let me try to explain it
> > > in my words:
> > >
> > > When an application A is killed the pending submissions from it are
> > > not executed, but rather torn down by
> > > drm_sched_entity_kill_jobs_work().
> > >
> > > If now a submission from application B depends on something
> > > application A wanted to do on the same scheduler instance the
> > > function drm_sched_entity_add_dependency_cb() would have optimized
> > > this dependency and assumed that the scheduler work would already run
> > > and try to pick a job.
> > >
> > > But that isn't the case when the submissions from application A are
> > > all killed. So make sure to start the scheduler work from
> > > drm_sched_entity_kill_jobs_work() to handle that case.
> >
> > Alright, so the bug then depends on B setting a dependency on A _after_
> > A was killed, doesn't it? Because the optimization in
> > _add_dependency_cb() can only have an effect after A's jobs have been
> > torn down.
>
> No, application A and application B submit right behind each other. Application A is then killed later on.
>
> The optimization in _add_dependency_cb() just waits for As submission to be handled by the scheduler, but that never happens because it was killed.
>
> > If there is such a timing order problem, the commit message should
> > mention it.
> >
> > The commit message definitely also needs to mention this optimization
> > in drm_sched_entity_add_dependency_cb() since it's essential for
> > understanding the bug.
>
> +1
>
> Christian.
>
> >
> >
> > Danke
> > P.
> >
> >
> > >
> > > Regards,
> > > Christian.
> > >
> > > >
> > > > > in entity_kill_job_work(), the scheduler enters sleep
> > > > > state, causing application B hang.
> > > >
> > > > So the issue is that if a1 never ran, the scheduler will not
> > > > continue
> > > > with the jobs of application B, correct?
> > > >
> > > > >
> > > > > Add drm_sched_wakeup() when entity_kill_job_work() to preventing
> > > >
> > > > s/to preventing/to prevent
> > > >
> > > > > scheduler sleep and subsequent application hangs.
> > > > >
> > > > > v2:
> > > > > - Move drm_sched_wakeup() to after drm_sched_fence_scheduled()
> > > >
> > > > Changelogs are cool and useful, but move them below the Signed-off
> > > > area
> > > > with another --- please, like so:
> > > >
> > > > Signed-off by: …
> > > > ---
> > > > v2:
> > > > …
> > > > ---
> > > > drivers/gpu/drm/scheduler/sched_entity.c | 1 +
> > > >
> > > >
> > > > >
> > > > > Signed-off-by: Lin.Cao <lincao12@amd.com>
> > > >
> > > > This fixes a bug. Even a racy bug, it seems. So we need a Fixes tag
> > > > and
> > > > put the stable kernel in CC.
> > > >
> > > > Please figure out with git blame, git log and git tag --contains
> > > > which
> > > > commit your patch fixes and which is the oldest kernel that
> > > > contains
> > > > the bug. Then add a Fixes: tag and Cc: the stable kernel. You'll
> > > > find
> > > > lots of examples for that in git log.
> > > >
> > > >
> > > > Thx
> > > > P.
> > > >
> > > > > ---
> > > > > drivers/gpu/drm/scheduler/sched_entity.c | 1 +
> > > > > 1 file changed, 1 insertion(+)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > index e671aa241720..66f2a43c58fd 100644
> > > > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > @@ -177,6 +177,7 @@ static void
> > > > > drm_sched_entity_kill_jobs_work(struct work_struct *wrk)
> > > > > struct drm_sched_job *job = container_of(wrk,
> > > > > typeof(*job), work);
> > > > >
> > > > > drm_sched_fence_scheduled(job->s_fence, NULL);
> > > > > + drm_sched_wakeup(job->sched);
> > > > > drm_sched_fence_finished(job->s_fence, -ESRCH);
> > > > > WARN_ON(job->s_fence->parent);
> > > > > job->sched->ops->free_job(job);
> > > >
> > >
> >
>
[-- Attachment #2: Type: text/html, Size: 17882 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
2025-07-15 9:51 ` cao, lin
@ 2025-07-15 10:27 ` Philipp Stanner
2025-07-15 10:52 ` Christian König
0 siblings, 1 reply; 15+ messages in thread
From: Philipp Stanner @ 2025-07-15 10:27 UTC (permalink / raw)
To: cao, lin, Koenig, Christian, phasta@kernel.org,
dri-devel@lists.freedesktop.org
Cc: Yin, ZhenGuo (Chris), Deng, Emily, dakr@kernel.org,
matthew.brost@intel.com
On Tue, 2025-07-15 at 09:51 +0000, cao, lin wrote:
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>
>
> Hi Philipp,
>
>
> Thanks for your review, let me try to clarify why I added drm_sched_wakeup() to drm_sched_entity_kill_jobs_work():
>
>
> When application A submits jobs (a1, a2, a3) and application B submits job b1 with a dependency on a1's scheduled fence, the normal execution flow is (function drm_sched_run_job_work()):
> 1. a1 gets popped from the entity by the scheduler
> 2. run_job(a1) executes
> 3. a1's scheduled fence gets signaled
> 4. drm_sched_run_job_work() calls drm_sched_run_job_queue() at the end
> 5. The scheduler wakes up and re-selects entities to pop jobs
> 6. Since b1's dependency is cleared, the scheduler can select b1 and continue the same flow
>
>
> However, if application A is killed before a1 gets popped by the scheduler, then a1, a2, a3 are killed sequentially by drm_sched_entity_kill_jobs_cb(). During the kill process, their scheduled fences are still signaled, but the kill process itself lacks work_run_job. This means b1's dependency gets cleared, but there's no work_run_job to drive the scheduler to continue running, or we can saystep 4 in the normal flow is missing, and the scheduler enters sleep state, which causes application B to hang.
> So if we add drm_sched_wakeup() in drm_sched_entity_kill_jobs_work() after drm_sched_fence_scheduled(), the scheduler can be woken up, and application B can continue running after a1's scheduled fence is force signaled.
Thanks for the detailed explanation. Makes sense.
Maybe you can detail in v3 that this "optimization" consists of the
work item not being scheduled. I think that was the piece of the puzzle
I was missing.
I / DRM tools will also include a link to this thread, so I think that
will then be sufficient.
Thx
P.
>
> Thanks,
> Lin
>
>
>
>
>
> From: Philipp Stanner <phasta@mailbox.org>
> Sent: Tuesday, July 15, 2025 17:12
> To: cao, lin <lin.cao@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; phasta@kernel.org <phasta@kernel.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>
> Cc: Yin, ZhenGuo (Chris) <ZhenGuo.Yin@amd.com>; Deng, Emily <Emily.Deng@amd.com>; dakr@kernel.org <dakr@kernel.org>; matthew.brost@intel.com <matthew.brost@intel.com>
> Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
>
>
> On Tue, 2025-07-15 at 03:38 +0000, cao, lin wrote:
> >
> > [AMD Official Use Only - AMD Internal Distribution Only]
> >
> >
> >
> > Hi Christian, Philipp,
> >
> >
> > I modified the commit msg, and I hope it makes more sense:
> >
> >
> > drm/sched: wake up scheduler when killing jobs to prevent hang
>
> nit:
> s/wake/Wake
>
> >
> >
> > When application A submits jobs (a1, a2, a3) and application B submits
> > job b1 with a dependency on a2's scheduler fence, killing application A
> > before run_job(a1) causes drm_sched_entity_kill_jobs_work() to force
> > signal all jobs sequentially. However, the optimization in
> > drm_sched_entity_add_dependency_cb() waits for the fence to be scheduled
> > by adding a callback (drm_sched_entity_clear_dep) instead of immediately
> > waking up the scheduler. When A is killed before its jobs can run, the
> > callback gets triggered but drm_sched_entity_clear_dep() only clears the
> > dependency without waking up the scheduler, causing the scheduler to enter
> > sleep state and application B to hang.
>
> That now reads as if drm_sched_entity_clear_dep() is supposed to wake
> up the scheduler. But you add the wakeup to a different function (the
> work item).
>
> Also the code actually looks like that:
>
>
> xa_erase(&job->dependencies, index);
> if (f && !dma_fence_add_callback(f, &job->finish_cb,
> drm_sched_entity_kill_jobs_cb))
> return;
>
> dma_fence_put(f);
> }
>
> INIT_WORK(&job->work, drm_sched_entity_kill_jobs_work);
> schedule_work(&job->work);
> }
>
> So if adding that callback succeeds we.. return immediately but we.. do
> that for the first dependency, not for all of them?
>
> Oh boy. That code base. We(tm) should look into pimping that up. Also
> lacks documentation everywhere.
>
>
> Anyways. If this solves a bug for you the patch looks fine (i.e., not
> potentially harmful) by me and if the tests succeed we can merge it.
> However, I'd feel better if you could clarify more why that function is
> the right place to solve the bug.
>
>
> Thanks,
> P.
>
>
> >
> >
> > Add drm_sched_wakeup() in entity_kill_job_work() to prevent scheduler
> > sleep and subsequent application hangs.
> >
> >
> > v2:
> > - Move drm_sched_wakeup() to after drm_sched_fence_scheduled()
> >
> >
> > Thanks,
> > Lin
> >
> >
> > From: Koenig, Christian <Christian.Koenig@amd.com>
> > Sent: Monday, July 14, 2025 21:39
> > To: phasta@kernel.org <phasta@kernel.org>; cao, lin <lin.cao@amd.com>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>
> > Cc: Yin, ZhenGuo (Chris) <ZhenGuo.Yin@amd.com>; Deng, Emily <Emily.Deng@amd.com>; dakr@kernel.org <dakr@kernel.org>; matthew.brost@intel.com <matthew.brost@intel.com>
> > Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
> >
> >
> >
> >
> > On 14.07.25 15:27, Philipp Stanner wrote:
> > > On Mon, 2025-07-14 at 15:08 +0200, Christian König wrote:
> > > >
> > > >
> > > > On 14.07.25 14:46, Philipp Stanner wrote:
> > > > > regarding the patch subject: the prefix we use for the scheduler
> > > > > is:
> > > > > drm/sched:
> > > > >
> > > > >
> > > > > On Mon, 2025-07-14 at 14:23 +0800, Lin.Cao wrote:
> > > > >
> > > > > > When Application A submits jobs (a1, a2, a3) and application B
> > > > > > submits
> > > > >
> > > > > s/Application/application
> > > > >
> > > > > > job b1 with a dependency on a2's scheduler fence, killing
> > > > > > application A
> > > > > > before run_job(a1) causes drm_sched_entity_kill_jobs_work() to
> > > > > > force
> > > > > > signal all jobs sequentially. However, due to missing
> > > > > > work_run_job or
> > > > > > work_free_job
> > > > > >
> > > > >
> > > > > You probably mean "due to the work items work_run_job and
> > > > > work_free_job
> > > > > not being started […]".
> > > > >
> > > > > However, the call you add, drm_sched_wakeup(), immediately only
> > > > > triggers work_run_job. It's not clear to me why you mention
> > > > > work_free_job, because drm_sched_entity_kill_jobs_work() directly
> > > > > invokes the free_job() callback.
> > > >
> > > > Yeah the description is rather confusing, took me more than one try
> > > > to understand what's going on here as well. Let me try to explain it
> > > > in my words:
> > > >
> > > > When an application A is killed the pending submissions from it are
> > > > not executed, but rather torn down by
> > > > drm_sched_entity_kill_jobs_work().
> > > >
> > > > If now a submission from application B depends on something
> > > > application A wanted to do on the same scheduler instance the
> > > > function drm_sched_entity_add_dependency_cb() would have optimized
> > > > this dependency and assumed that the scheduler work would already run
> > > > and try to pick a job.
> > > >
> > > > But that isn't the case when the submissions from application A are
> > > > all killed. So make sure to start the scheduler work from
> > > > drm_sched_entity_kill_jobs_work() to handle that case.
> > >
> > > Alright, so the bug then depends on B setting a dependency on A _after_
> > > A was killed, doesn't it? Because the optimization in
> > > _add_dependency_cb() can only have an effect after A's jobs have been
> > > torn down.
> >
> > No, application A and application B submit right behind each other. Application A is then killed later on.
> >
> > The optimization in _add_dependency_cb() just waits for As submission to be handled by the scheduler, but that never happens because it was killed.
> >
> > > If there is such a timing order problem, the commit message should
> > > mention it.
> > >
> > > The commit message definitely also needs to mention this optimization
> > > in drm_sched_entity_add_dependency_cb() since it's essential for
> > > understanding the bug.
> >
> > +1
> >
> > Christian.
> >
> > >
> > >
> > > Danke
> > > P.
> > >
> > >
> > > >
> > > > Regards,
> > > > Christian.
> > > >
> > > > >
> > > > > > in entity_kill_job_work(), the scheduler enters sleep
> > > > > > state, causing application B hang.
> > > > >
> > > > > So the issue is that if a1 never ran, the scheduler will not
> > > > > continue
> > > > > with the jobs of application B, correct?
> > > > >
> > > > > >
> > > > > > Add drm_sched_wakeup() when entity_kill_job_work() to preventing
> > > > >
> > > > > s/to preventing/to prevent
> > > > >
> > > > > > scheduler sleep and subsequent application hangs.
> > > > > >
> > > > > > v2:
> > > > > > - Move drm_sched_wakeup() to after drm_sched_fence_scheduled()
> > > > >
> > > > > Changelogs are cool and useful, but move them below the Signed-off
> > > > > area
> > > > > with another --- please, like so:
> > > > >
> > > > > Signed-off by: …
> > > > > ---
> > > > > v2:
> > > > > …
> > > > > ---
> > > > > drivers/gpu/drm/scheduler/sched_entity.c | 1 +
> > > > >
> > > > >
> > > > > >
> > > > > > Signed-off-by: Lin.Cao <lincao12@amd.com>
> > > > >
> > > > > This fixes a bug. Even a racy bug, it seems. So we need a Fixes tag
> > > > > and
> > > > > put the stable kernel in CC.
> > > > >
> > > > > Please figure out with git blame, git log and git tag --contains
> > > > > which
> > > > > commit your patch fixes and which is the oldest kernel that
> > > > > contains
> > > > > the bug. Then add a Fixes: tag and Cc: the stable kernel. You'll
> > > > > find
> > > > > lots of examples for that in git log.
> > > > >
> > > > >
> > > > > Thx
> > > > > P.
> > > > >
> > > > > > ---
> > > > > > drivers/gpu/drm/scheduler/sched_entity.c | 1 +
> > > > > > 1 file changed, 1 insertion(+)
> > > > > >
> > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > index e671aa241720..66f2a43c58fd 100644
> > > > > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > @@ -177,6 +177,7 @@ static void
> > > > > > drm_sched_entity_kill_jobs_work(struct work_struct *wrk)
> > > > > > struct drm_sched_job *job = container_of(wrk,
> > > > > > typeof(*job), work);
> > > > > >
> > > > > > drm_sched_fence_scheduled(job->s_fence, NULL);
> > > > > > + drm_sched_wakeup(job->sched);
> > > > > > drm_sched_fence_finished(job->s_fence, -ESRCH);
> > > > > > WARN_ON(job->s_fence->parent);
> > > > > > job->sched->ops->free_job(job);
> > > > >
> > > >
> > >
> >
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
2025-07-15 10:27 ` Philipp Stanner
@ 2025-07-15 10:52 ` Christian König
2025-07-15 12:20 ` Philipp Stanner
0 siblings, 1 reply; 15+ messages in thread
From: Christian König @ 2025-07-15 10:52 UTC (permalink / raw)
To: phasta, cao, lin, dri-devel@lists.freedesktop.org
Cc: Yin, ZhenGuo (Chris), Deng, Emily, dakr@kernel.org,
matthew.brost@intel.com
On 15.07.25 12:27, Philipp Stanner wrote:
> On Tue, 2025-07-15 at 09:51 +0000, cao, lin wrote:
>>
>> [AMD Official Use Only - AMD Internal Distribution Only]
>>
>>
>>
>> Hi Philipp,
>>
>>
>> Thanks for your review, let me try to clarify why I added drm_sched_wakeup() to drm_sched_entity_kill_jobs_work():
>>
>>
>> When application A submits jobs (a1, a2, a3) and application B submits job b1 with a dependency on a1's scheduled fence, the normal execution flow is (function drm_sched_run_job_work()):
>> 1. a1 gets popped from the entity by the scheduler
>> 2. run_job(a1) executes
>> 3. a1's scheduled fence gets signaled
>> 4. drm_sched_run_job_work() calls drm_sched_run_job_queue() at the end
>> 5. The scheduler wakes up and re-selects entities to pop jobs
>> 6. Since b1's dependency is cleared, the scheduler can select b1 and continue the same flow
>>
>>
>> However, if application A is killed before a1 gets popped by the scheduler, then a1, a2, a3 are killed sequentially by drm_sched_entity_kill_jobs_cb(). During the kill process, their scheduled fences are still signaled, but the kill process itself lacks work_run_job. This means b1's dependency gets cleared, but there's no work_run_job to drive the scheduler to continue running, or we can saystep 4 in the normal flow is missing, and the scheduler enters sleep state, which causes application B to hang.
>> So if we add drm_sched_wakeup() in drm_sched_entity_kill_jobs_work() after drm_sched_fence_scheduled(), the scheduler can be woken up, and application B can continue running after a1's scheduled fence is force signaled.
>
> Thanks for the detailed explanation. Makes sense.
> Maybe you can detail in v3 that this "optimization" consists of the
> work item not being scheduled.
Yeah, removing this "optimization" would also be a valid solution to the problem.
Christian.
> I think that was the piece of the puzzle
> I was missing.
>
> I / DRM tools will also include a link to this thread, so I think that
> will then be sufficient.
>
> Thx
> P.
>
>>
>> Thanks,
>> Lin
>>
>>
>>
>>
>>
>> From: Philipp Stanner <phasta@mailbox.org>
>> Sent: Tuesday, July 15, 2025 17:12
>> To: cao, lin <lin.cao@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; phasta@kernel.org <phasta@kernel.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>
>> Cc: Yin, ZhenGuo (Chris) <ZhenGuo.Yin@amd.com>; Deng, Emily <Emily.Deng@amd.com>; dakr@kernel.org <dakr@kernel.org>; matthew.brost@intel.com <matthew.brost@intel.com>
>> Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
>>
>>
>> On Tue, 2025-07-15 at 03:38 +0000, cao, lin wrote:
>>>
>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>
>>>
>>>
>>> Hi Christian, Philipp,
>>>
>>>
>>> I modified the commit msg, and I hope it makes more sense:
>>>
>>>
>>> drm/sched: wake up scheduler when killing jobs to prevent hang
>>
>> nit:
>> s/wake/Wake
>>
>>>
>>>
>>> When application A submits jobs (a1, a2, a3) and application B submits
>>> job b1 with a dependency on a2's scheduler fence, killing application A
>>> before run_job(a1) causes drm_sched_entity_kill_jobs_work() to force
>>> signal all jobs sequentially. However, the optimization in
>>> drm_sched_entity_add_dependency_cb() waits for the fence to be scheduled
>>> by adding a callback (drm_sched_entity_clear_dep) instead of immediately
>>> waking up the scheduler. When A is killed before its jobs can run, the
>>> callback gets triggered but drm_sched_entity_clear_dep() only clears the
>>> dependency without waking up the scheduler, causing the scheduler to enter
>>> sleep state and application B to hang.
>>
>> That now reads as if drm_sched_entity_clear_dep() is supposed to wake
>> up the scheduler. But you add the wakeup to a different function (the
>> work item).
>>
>> Also the code actually looks like that:
>>
>>
>> xa_erase(&job->dependencies, index);
>> if (f && !dma_fence_add_callback(f, &job->finish_cb,
>> drm_sched_entity_kill_jobs_cb))
>> return;
>>
>> dma_fence_put(f);
>> }
>>
>> INIT_WORK(&job->work, drm_sched_entity_kill_jobs_work);
>> schedule_work(&job->work);
>> }
>>
>> So if adding that callback succeeds we.. return immediately but we.. do
>> that for the first dependency, not for all of them?
>>
>> Oh boy. That code base. We(tm) should look into pimping that up. Also
>> lacks documentation everywhere.
>>
>>
>> Anyways. If this solves a bug for you the patch looks fine (i.e., not
>> potentially harmful) by me and if the tests succeed we can merge it.
>> However, I'd feel better if you could clarify more why that function is
>> the right place to solve the bug.
>>
>>
>> Thanks,
>> P.
>>
>>
>>>
>>>
>>> Add drm_sched_wakeup() in entity_kill_job_work() to prevent scheduler
>>> sleep and subsequent application hangs.
>>>
>>>
>>> v2:
>>> - Move drm_sched_wakeup() to after drm_sched_fence_scheduled()
>>>
>>>
>>> Thanks,
>>> Lin
>>>
>>>
>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>> Sent: Monday, July 14, 2025 21:39
>>> To: phasta@kernel.org <phasta@kernel.org>; cao, lin <lin.cao@amd.com>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>
>>> Cc: Yin, ZhenGuo (Chris) <ZhenGuo.Yin@amd.com>; Deng, Emily <Emily.Deng@amd.com>; dakr@kernel.org <dakr@kernel.org>; matthew.brost@intel.com <matthew.brost@intel.com>
>>> Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
>>>
>>>
>>>
>>>
>>> On 14.07.25 15:27, Philipp Stanner wrote:
>>>> On Mon, 2025-07-14 at 15:08 +0200, Christian König wrote:
>>>>>
>>>>>
>>>>> On 14.07.25 14:46, Philipp Stanner wrote:
>>>>>> regarding the patch subject: the prefix we use for the scheduler
>>>>>> is:
>>>>>> drm/sched:
>>>>>>
>>>>>>
>>>>>> On Mon, 2025-07-14 at 14:23 +0800, Lin.Cao wrote:
>>>>>>
>>>>>>> When Application A submits jobs (a1, a2, a3) and application B
>>>>>>> submits
>>>>>>
>>>>>> s/Application/application
>>>>>>
>>>>>>> job b1 with a dependency on a2's scheduler fence, killing
>>>>>>> application A
>>>>>>> before run_job(a1) causes drm_sched_entity_kill_jobs_work() to
>>>>>>> force
>>>>>>> signal all jobs sequentially. However, due to missing
>>>>>>> work_run_job or
>>>>>>> work_free_job
>>>>>>>
>>>>>>
>>>>>> You probably mean "due to the work items work_run_job and
>>>>>> work_free_job
>>>>>> not being started […]".
>>>>>>
>>>>>> However, the call you add, drm_sched_wakeup(), immediately only
>>>>>> triggers work_run_job. It's not clear to me why you mention
>>>>>> work_free_job, because drm_sched_entity_kill_jobs_work() directly
>>>>>> invokes the free_job() callback.
>>>>>
>>>>> Yeah the description is rather confusing, took me more than one try
>>>>> to understand what's going on here as well. Let me try to explain it
>>>>> in my words:
>>>>>
>>>>> When an application A is killed the pending submissions from it are
>>>>> not executed, but rather torn down by
>>>>> drm_sched_entity_kill_jobs_work().
>>>>>
>>>>> If now a submission from application B depends on something
>>>>> application A wanted to do on the same scheduler instance the
>>>>> function drm_sched_entity_add_dependency_cb() would have optimized
>>>>> this dependency and assumed that the scheduler work would already run
>>>>> and try to pick a job.
>>>>>
>>>>> But that isn't the case when the submissions from application A are
>>>>> all killed. So make sure to start the scheduler work from
>>>>> drm_sched_entity_kill_jobs_work() to handle that case.
>>>>
>>>> Alright, so the bug then depends on B setting a dependency on A _after_
>>>> A was killed, doesn't it? Because the optimization in
>>>> _add_dependency_cb() can only have an effect after A's jobs have been
>>>> torn down.
>>>
>>> No, application A and application B submit right behind each other. Application A is then killed later on.
>>>
>>> The optimization in _add_dependency_cb() just waits for As submission to be handled by the scheduler, but that never happens because it was killed.
>>>
>>>> If there is such a timing order problem, the commit message should
>>>> mention it.
>>>>
>>>> The commit message definitely also needs to mention this optimization
>>>> in drm_sched_entity_add_dependency_cb() since it's essential for
>>>> understanding the bug.
>>>
>>> +1
>>>
>>> Christian.
>>>
>>>>
>>>>
>>>> Danke
>>>> P.
>>>>
>>>>
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>>> in entity_kill_job_work(), the scheduler enters sleep
>>>>>>> state, causing application B hang.
>>>>>>
>>>>>> So the issue is that if a1 never ran, the scheduler will not
>>>>>> continue
>>>>>> with the jobs of application B, correct?
>>>>>>
>>>>>>>
>>>>>>> Add drm_sched_wakeup() when entity_kill_job_work() to preventing
>>>>>>
>>>>>> s/to preventing/to prevent
>>>>>>
>>>>>>> scheduler sleep and subsequent application hangs.
>>>>>>>
>>>>>>> v2:
>>>>>>> - Move drm_sched_wakeup() to after drm_sched_fence_scheduled()
>>>>>>
>>>>>> Changelogs are cool and useful, but move them below the Signed-off
>>>>>> area
>>>>>> with another --- please, like so:
>>>>>>
>>>>>> Signed-off by: …
>>>>>> ---
>>>>>> v2:
>>>>>> …
>>>>>> ---
>>>>>> drivers/gpu/drm/scheduler/sched_entity.c | 1 +
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Signed-off-by: Lin.Cao <lincao12@amd.com>
>>>>>>
>>>>>> This fixes a bug. Even a racy bug, it seems. So we need a Fixes tag
>>>>>> and
>>>>>> put the stable kernel in CC.
>>>>>>
>>>>>> Please figure out with git blame, git log and git tag --contains
>>>>>> which
>>>>>> commit your patch fixes and which is the oldest kernel that
>>>>>> contains
>>>>>> the bug. Then add a Fixes: tag and Cc: the stable kernel. You'll
>>>>>> find
>>>>>> lots of examples for that in git log.
>>>>>>
>>>>>>
>>>>>> Thx
>>>>>> P.
>>>>>>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/scheduler/sched_entity.c | 1 +
>>>>>>> 1 file changed, 1 insertion(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>> index e671aa241720..66f2a43c58fd 100644
>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>> @@ -177,6 +177,7 @@ static void
>>>>>>> drm_sched_entity_kill_jobs_work(struct work_struct *wrk)
>>>>>>> struct drm_sched_job *job = container_of(wrk,
>>>>>>> typeof(*job), work);
>>>>>>>
>>>>>>> drm_sched_fence_scheduled(job->s_fence, NULL);
>>>>>>> + drm_sched_wakeup(job->sched);
>>>>>>> drm_sched_fence_finished(job->s_fence, -ESRCH);
>>>>>>> WARN_ON(job->s_fence->parent);
>>>>>>> job->sched->ops->free_job(job);
>>>>>>
>>>>>
>>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
2025-07-15 10:52 ` Christian König
@ 2025-07-15 12:20 ` Philipp Stanner
2025-07-15 12:32 ` Christian König
0 siblings, 1 reply; 15+ messages in thread
From: Philipp Stanner @ 2025-07-15 12:20 UTC (permalink / raw)
To: Christian König, phasta, cao, lin,
dri-devel@lists.freedesktop.org
Cc: Yin, ZhenGuo (Chris), Deng, Emily, dakr@kernel.org,
matthew.brost@intel.com
On Tue, 2025-07-15 at 12:52 +0200, Christian König wrote:
> On 15.07.25 12:27, Philipp Stanner wrote:
> > On Tue, 2025-07-15 at 09:51 +0000, cao, lin wrote:
> > >
> > > [AMD Official Use Only - AMD Internal Distribution Only]
> > >
> > >
> > >
> > > Hi Philipp,
> > >
> > >
> > > Thanks for your review, let me try to clarify why I added
> > > drm_sched_wakeup() to drm_sched_entity_kill_jobs_work():
> > >
> > >
> > > When application A submits jobs (a1, a2, a3) and application B
> > > submits job b1 with a dependency on a1's scheduled fence, the
> > > normal execution flow is (function drm_sched_run_job_work()):
> > > 1. a1 gets popped from the entity by the scheduler
> > > 2. run_job(a1) executes
> > > 3. a1's scheduled fence gets signaled
> > > 4. drm_sched_run_job_work() calls drm_sched_run_job_queue() at
> > > the end
> > > 5. The scheduler wakes up and re-selects entities to pop jobs
> > > 6. Since b1's dependency is cleared, the scheduler can select b1
> > > and continue the same flow
> > >
> > >
> > > However, if application A is killed before a1 gets popped by the
> > > scheduler, then a1, a2, a3 are killed sequentially by
> > > drm_sched_entity_kill_jobs_cb(). During the kill process, their
> > > scheduled fences are still signaled, but the kill process itself
> > > lacks work_run_job. This means b1's dependency gets cleared, but
> > > there's no work_run_job to drive the scheduler to continue
> > > running, or we can saystep 4 in the normal flow is missing, and
> > > the scheduler enters sleep state, which causes application B to
> > > hang.
> > > So if we add drm_sched_wakeup() in
> > > drm_sched_entity_kill_jobs_work() after
> > > drm_sched_fence_scheduled(), the scheduler can be woken up, and
> > > application B can continue running after a1's scheduled fence is
> > > force signaled.
> >
> > Thanks for the detailed explanation. Makes sense.
> > Maybe you can detail in v3 that this "optimization" consists of the
> > work item not being scheduled.
>
> Yeah, removing this "optimization" would also be a valid solution to
> the problem.
If you at AMD are willing to work on that that'd be definitely fine.
Removing code is usually better than adding code.
Do you think being afraid of a performance regression is realistic
here, though?
P.
>
> Christian.
>
> > I think that was the piece of the puzzle
> > I was missing.
> >
> > I / DRM tools will also include a link to this thread, so I think
> > that
> > will then be sufficient.
> >
> > Thx
> > P.
> >
> > >
> > > Thanks,
> > > Lin
> > >
> > >
> > >
> > >
> > >
> > > From: Philipp Stanner <phasta@mailbox.org>
> > > Sent: Tuesday, July 15, 2025 17:12
> > > To: cao, lin <lin.cao@amd.com>; Koenig, Christian
> > > <Christian.Koenig@amd.com>;
> > > phasta@kernel.org <phasta@kernel.org>;
> > > dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>
> > > Cc: Yin, ZhenGuo (Chris) <ZhenGuo.Yin@amd.com>; Deng, Emily
> > > <Emily.Deng@amd.com>; dakr@kernel.org <dakr@kernel.org>;
> > > matthew.brost@intel.com <matthew.brost@intel.com>
> > > Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when
> > > killing app with dependent jobs
> > >
> > >
> > > On Tue, 2025-07-15 at 03:38 +0000, cao, lin wrote:
> > > >
> > > > [AMD Official Use Only - AMD Internal Distribution Only]
> > > >
> > > >
> > > >
> > > > Hi Christian, Philipp,
> > > >
> > > >
> > > > I modified the commit msg, and I hope it makes more sense:
> > > >
> > > >
> > > > drm/sched: wake up scheduler when killing jobs to prevent hang
> > >
> > > nit:
> > > s/wake/Wake
> > >
> > > >
> > > >
> > > > When application A submits jobs (a1, a2, a3) and application B
> > > > submits
> > > > job b1 with a dependency on a2's scheduler fence, killing
> > > > application A
> > > > before run_job(a1) causes drm_sched_entity_kill_jobs_work() to
> > > > force
> > > > signal all jobs sequentially. However, the optimization in
> > > > drm_sched_entity_add_dependency_cb() waits for the fence to be
> > > > scheduled
> > > > by adding a callback (drm_sched_entity_clear_dep) instead of
> > > > immediately
> > > > waking up the scheduler. When A is killed before its jobs can
> > > > run, the
> > > > callback gets triggered but drm_sched_entity_clear_dep() only
> > > > clears the
> > > > dependency without waking up the scheduler, causing the
> > > > scheduler to enter
> > > > sleep state and application B to hang.
> > >
> > > That now reads as if drm_sched_entity_clear_dep() is supposed to
> > > wake
> > > up the scheduler. But you add the wakeup to a different function
> > > (the
> > > work item).
> > >
> > > Also the code actually looks like that:
> > >
> > >
> > > xa_erase(&job->dependencies, index);
> > > if (f && !dma_fence_add_callback(f, &job-
> > > >finish_cb,
> > >
> > > drm_sched_entity_kill_jobs_cb))
> > > return;
> > >
> > > dma_fence_put(f);
> > > }
> > >
> > > INIT_WORK(&job->work, drm_sched_entity_kill_jobs_work);
> > > schedule_work(&job->work);
> > > }
> > >
> > > So if adding that callback succeeds we.. return immediately but
> > > we.. do
> > > that for the first dependency, not for all of them?
> > >
> > > Oh boy. That code base. We(tm) should look into pimping that up.
> > > Also
> > > lacks documentation everywhere.
> > >
> > >
> > > Anyways. If this solves a bug for you the patch looks fine (i.e.,
> > > not
> > > potentially harmful) by me and if the tests succeed we can merge
> > > it.
> > > However, I'd feel better if you could clarify more why that
> > > function is
> > > the right place to solve the bug.
> > >
> > >
> > > Thanks,
> > > P.
> > >
> > >
> > > >
> > > >
> > > > Add drm_sched_wakeup() in entity_kill_job_work() to prevent
> > > > scheduler
> > > > sleep and subsequent application hangs.
> > > >
> > > >
> > > > v2:
> > > > - Move drm_sched_wakeup() to after drm_sched_fence_scheduled()
> > > >
> > > >
> > > > Thanks,
> > > > Lin
> > > >
> > > >
> > > > From: Koenig, Christian <Christian.Koenig@amd.com>
> > > > Sent: Monday, July 14, 2025 21:39
> > > > To: phasta@kernel.org <phasta@kernel.org>; cao, lin
> > > > <lin.cao@amd.com>;
> > > > dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org
> > > > >
> > > > Cc: Yin, ZhenGuo (Chris) <ZhenGuo.Yin@amd.com>; Deng, Emily
> > > > <Emily.Deng@amd.com>; dakr@kernel.org <dakr@kernel.org>;
> > > > matthew.brost@intel.com <matthew.brost@intel.com>
> > > > Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when
> > > > killing app with dependent jobs
> > > >
> > > >
> > > >
> > > >
> > > > On 14.07.25 15:27, Philipp Stanner wrote:
> > > > > On Mon, 2025-07-14 at 15:08 +0200, Christian König wrote:
> > > > > >
> > > > > >
> > > > > > On 14.07.25 14:46, Philipp Stanner wrote:
> > > > > > > regarding the patch subject: the prefix we use for the
> > > > > > > scheduler
> > > > > > > is:
> > > > > > > drm/sched:
> > > > > > >
> > > > > > >
> > > > > > > On Mon, 2025-07-14 at 14:23 +0800, Lin.Cao wrote:
> > > > > > >
> > > > > > > > When Application A submits jobs (a1, a2, a3) and
> > > > > > > > application B
> > > > > > > > submits
> > > > > > >
> > > > > > > s/Application/application
> > > > > > >
> > > > > > > > job b1 with a dependency on a2's scheduler fence,
> > > > > > > > killing
> > > > > > > > application A
> > > > > > > > before run_job(a1) causes
> > > > > > > > drm_sched_entity_kill_jobs_work() to
> > > > > > > > force
> > > > > > > > signal all jobs sequentially. However, due to missing
> > > > > > > > work_run_job or
> > > > > > > > work_free_job
> > > > > > > >
> > > > > > >
> > > > > > > You probably mean "due to the work items work_run_job and
> > > > > > > work_free_job
> > > > > > > not being started […]".
> > > > > > >
> > > > > > > However, the call you add, drm_sched_wakeup(),
> > > > > > > immediately only
> > > > > > > triggers work_run_job. It's not clear to me why you
> > > > > > > mention
> > > > > > > work_free_job, because drm_sched_entity_kill_jobs_work()
> > > > > > > directly
> > > > > > > invokes the free_job() callback.
> > > > > >
> > > > > > Yeah the description is rather confusing, took me more than
> > > > > > one try
> > > > > > to understand what's going on here as well. Let me try to
> > > > > > explain it
> > > > > > in my words:
> > > > > >
> > > > > > When an application A is killed the pending submissions
> > > > > > from it are
> > > > > > not executed, but rather torn down by
> > > > > > drm_sched_entity_kill_jobs_work().
> > > > > >
> > > > > > If now a submission from application B depends on something
> > > > > > application A wanted to do on the same scheduler instance
> > > > > > the
> > > > > > function drm_sched_entity_add_dependency_cb() would have
> > > > > > optimized
> > > > > > this dependency and assumed that the scheduler work would
> > > > > > already run
> > > > > > and try to pick a job.
> > > > > >
> > > > > > But that isn't the case when the submissions from
> > > > > > application A are
> > > > > > all killed. So make sure to start the scheduler work from
> > > > > > drm_sched_entity_kill_jobs_work() to handle that case.
> > > > >
> > > > > Alright, so the bug then depends on B setting a dependency on
> > > > > A _after_
> > > > > A was killed, doesn't it? Because the optimization in
> > > > > _add_dependency_cb() can only have an effect after A's jobs
> > > > > have been
> > > > > torn down.
> > > >
> > > > No, application A and application B submit right behind each
> > > > other. Application A is then killed later on.
> > > >
> > > > The optimization in _add_dependency_cb() just waits for As
> > > > submission to be handled by the scheduler, but that never
> > > > happens because it was killed.
> > > >
> > > > > If there is such a timing order problem, the commit message
> > > > > should
> > > > > mention it.
> > > > >
> > > > > The commit message definitely also needs to mention this
> > > > > optimization
> > > > > in drm_sched_entity_add_dependency_cb() since it's essential
> > > > > for
> > > > > understanding the bug.
> > > >
> > > > +1
> > > >
> > > > Christian.
> > > >
> > > > >
> > > > >
> > > > > Danke
> > > > > P.
> > > > >
> > > > >
> > > > > >
> > > > > > Regards,
> > > > > > Christian.
> > > > > >
> > > > > > >
> > > > > > > > in entity_kill_job_work(), the scheduler enters sleep
> > > > > > > > state, causing application B hang.
> > > > > > >
> > > > > > > So the issue is that if a1 never ran, the scheduler will
> > > > > > > not
> > > > > > > continue
> > > > > > > with the jobs of application B, correct?
> > > > > > >
> > > > > > > >
> > > > > > > > Add drm_sched_wakeup() when entity_kill_job_work() to
> > > > > > > > preventing
> > > > > > >
> > > > > > > s/to preventing/to prevent
> > > > > > >
> > > > > > > > scheduler sleep and subsequent application hangs.
> > > > > > > >
> > > > > > > > v2:
> > > > > > > > - Move drm_sched_wakeup() to after
> > > > > > > > drm_sched_fence_scheduled()
> > > > > > >
> > > > > > > Changelogs are cool and useful, but move them below the
> > > > > > > Signed-off
> > > > > > > area
> > > > > > > with another --- please, like so:
> > > > > > >
> > > > > > > Signed-off by: …
> > > > > > > ---
> > > > > > > v2:
> > > > > > > …
> > > > > > > ---
> > > > > > > drivers/gpu/drm/scheduler/sched_entity.c | 1 +
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Signed-off-by: Lin.Cao <lincao12@amd.com>
> > > > > > >
> > > > > > > This fixes a bug. Even a racy bug, it seems. So we need a
> > > > > > > Fixes tag
> > > > > > > and
> > > > > > > put the stable kernel in CC.
> > > > > > >
> > > > > > > Please figure out with git blame, git log and git tag --
> > > > > > > contains
> > > > > > > which
> > > > > > > commit your patch fixes and which is the oldest kernel
> > > > > > > that
> > > > > > > contains
> > > > > > > the bug. Then add a Fixes: tag and Cc: the stable kernel.
> > > > > > > You'll
> > > > > > > find
> > > > > > > lots of examples for that in git log.
> > > > > > >
> > > > > > >
> > > > > > > Thx
> > > > > > > P.
> > > > > > >
> > > > > > > > ---
> > > > > > > > drivers/gpu/drm/scheduler/sched_entity.c | 1 +
> > > > > > > > 1 file changed, 1 insertion(+)
> > > > > > > >
> > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > > > b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > > > index e671aa241720..66f2a43c58fd 100644
> > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > > > @@ -177,6 +177,7 @@ static void
> > > > > > > > drm_sched_entity_kill_jobs_work(struct work_struct
> > > > > > > > *wrk)
> > > > > > > > struct drm_sched_job *job = container_of(wrk,
> > > > > > > > typeof(*job), work);
> > > > > > > >
> > > > > > > > drm_sched_fence_scheduled(job->s_fence, NULL);
> > > > > > > > + drm_sched_wakeup(job->sched);
> > > > > > > > drm_sched_fence_finished(job->s_fence, -ESRCH);
> > > > > > > > WARN_ON(job->s_fence->parent);
> > > > > > > > job->sched->ops->free_job(job);
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
2025-07-15 12:20 ` Philipp Stanner
@ 2025-07-15 12:32 ` Christian König
2025-07-15 13:07 ` Philipp Stanner
0 siblings, 1 reply; 15+ messages in thread
From: Christian König @ 2025-07-15 12:32 UTC (permalink / raw)
To: phasta, cao, lin, dri-devel@lists.freedesktop.org
Cc: Yin, ZhenGuo (Chris), Deng, Emily, dakr@kernel.org,
matthew.brost@intel.com
On 15.07.25 14:20, Philipp Stanner wrote:
> On Tue, 2025-07-15 at 12:52 +0200, Christian König wrote:
>> On 15.07.25 12:27, Philipp Stanner wrote:
>>> On Tue, 2025-07-15 at 09:51 +0000, cao, lin wrote:
>>>>
>>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>>
>>>>
>>>>
>>>> Hi Philipp,
>>>>
>>>>
>>>> Thanks for your review, let me try to clarify why I added
>>>> drm_sched_wakeup() to drm_sched_entity_kill_jobs_work():
>>>>
>>>>
>>>> When application A submits jobs (a1, a2, a3) and application B
>>>> submits job b1 with a dependency on a1's scheduled fence, the
>>>> normal execution flow is (function drm_sched_run_job_work()):
>>>> 1. a1 gets popped from the entity by the scheduler
>>>> 2. run_job(a1) executes
>>>> 3. a1's scheduled fence gets signaled
>>>> 4. drm_sched_run_job_work() calls drm_sched_run_job_queue() at
>>>> the end
>>>> 5. The scheduler wakes up and re-selects entities to pop jobs
>>>> 6. Since b1's dependency is cleared, the scheduler can select b1
>>>> and continue the same flow
>>>>
>>>>
>>>> However, if application A is killed before a1 gets popped by the
>>>> scheduler, then a1, a2, a3 are killed sequentially by
>>>> drm_sched_entity_kill_jobs_cb(). During the kill process, their
>>>> scheduled fences are still signaled, but the kill process itself
>>>> lacks work_run_job. This means b1's dependency gets cleared, but
>>>> there's no work_run_job to drive the scheduler to continue
>>>> running, or we can saystep 4 in the normal flow is missing, and
>>>> the scheduler enters sleep state, which causes application B to
>>>> hang.
>>>> So if we add drm_sched_wakeup() in
>>>> drm_sched_entity_kill_jobs_work() after
>>>> drm_sched_fence_scheduled(), the scheduler can be woken up, and
>>>> application B can continue running after a1's scheduled fence is
>>>> force signaled.
>>>
>>> Thanks for the detailed explanation. Makes sense.
>>> Maybe you can detail in v3 that this "optimization" consists of the
>>> work item not being scheduled.
>>
>> Yeah, removing this "optimization" would also be a valid solution to
>> the problem.
>
> If you at AMD are willing to work on that that'd be definitely fine.
> Removing code is usually better than adding code.
I fear I won't have time for that before my retirement :)
> Do you think being afraid of a performance regression is realistic
> here, though?
No, absolutely not. It was just a micro optimization done long ago.
On the other hand with the scheduler code base I'm not sure of anything any more.
Regards,
Christian.
>
>
> P.
>
>>
>> Christian.
>>
>>> I think that was the piece of the puzzle
>>> I was missing.
>>>
>>> I / DRM tools will also include a link to this thread, so I think
>>> that
>>> will then be sufficient.
>>>
>>> Thx
>>> P.
>>>
>>>>
>>>> Thanks,
>>>> Lin
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> From: Philipp Stanner <phasta@mailbox.org>
>>>> Sent: Tuesday, July 15, 2025 17:12
>>>> To: cao, lin <lin.cao@amd.com>; Koenig, Christian
>>>> <Christian.Koenig@amd.com>;
>>>> phasta@kernel.org <phasta@kernel.org>;
>>>> dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>
>>>> Cc: Yin, ZhenGuo (Chris) <ZhenGuo.Yin@amd.com>; Deng, Emily
>>>> <Emily.Deng@amd.com>; dakr@kernel.org <dakr@kernel.org>;
>>>> matthew.brost@intel.com <matthew.brost@intel.com>
>>>> Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when
>>>> killing app with dependent jobs
>>>>
>>>>
>>>> On Tue, 2025-07-15 at 03:38 +0000, cao, lin wrote:
>>>>>
>>>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>>>
>>>>>
>>>>>
>>>>> Hi Christian, Philipp,
>>>>>
>>>>>
>>>>> I modified the commit msg, and I hope it makes more sense:
>>>>>
>>>>>
>>>>> drm/sched: wake up scheduler when killing jobs to prevent hang
>>>>
>>>> nit:
>>>> s/wake/Wake
>>>>
>>>>>
>>>>>
>>>>> When application A submits jobs (a1, a2, a3) and application B
>>>>> submits
>>>>> job b1 with a dependency on a2's scheduler fence, killing
>>>>> application A
>>>>> before run_job(a1) causes drm_sched_entity_kill_jobs_work() to
>>>>> force
>>>>> signal all jobs sequentially. However, the optimization in
>>>>> drm_sched_entity_add_dependency_cb() waits for the fence to be
>>>>> scheduled
>>>>> by adding a callback (drm_sched_entity_clear_dep) instead of
>>>>> immediately
>>>>> waking up the scheduler. When A is killed before its jobs can
>>>>> run, the
>>>>> callback gets triggered but drm_sched_entity_clear_dep() only
>>>>> clears the
>>>>> dependency without waking up the scheduler, causing the
>>>>> scheduler to enter
>>>>> sleep state and application B to hang.
>>>>
>>>> That now reads as if drm_sched_entity_clear_dep() is supposed to
>>>> wake
>>>> up the scheduler. But you add the wakeup to a different function
>>>> (the
>>>> work item).
>>>>
>>>> Also the code actually looks like that:
>>>>
>>>>
>>>> xa_erase(&job->dependencies, index);
>>>> if (f && !dma_fence_add_callback(f, &job-
>>>>> finish_cb,
>>>>
>>>> drm_sched_entity_kill_jobs_cb))
>>>> return;
>>>>
>>>> dma_fence_put(f);
>>>> }
>>>>
>>>> INIT_WORK(&job->work, drm_sched_entity_kill_jobs_work);
>>>> schedule_work(&job->work);
>>>> }
>>>>
>>>> So if adding that callback succeeds we.. return immediately but
>>>> we.. do
>>>> that for the first dependency, not for all of them?
>>>>
>>>> Oh boy. That code base. We(tm) should look into pimping that up.
>>>> Also
>>>> lacks documentation everywhere.
>>>>
>>>>
>>>> Anyways. If this solves a bug for you the patch looks fine (i.e.,
>>>> not
>>>> potentially harmful) by me and if the tests succeed we can merge
>>>> it.
>>>> However, I'd feel better if you could clarify more why that
>>>> function is
>>>> the right place to solve the bug.
>>>>
>>>>
>>>> Thanks,
>>>> P.
>>>>
>>>>
>>>>>
>>>>>
>>>>> Add drm_sched_wakeup() in entity_kill_job_work() to prevent
>>>>> scheduler
>>>>> sleep and subsequent application hangs.
>>>>>
>>>>>
>>>>> v2:
>>>>> - Move drm_sched_wakeup() to after drm_sched_fence_scheduled()
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Lin
>>>>>
>>>>>
>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>> Sent: Monday, July 14, 2025 21:39
>>>>> To: phasta@kernel.org <phasta@kernel.org>; cao, lin
>>>>> <lin.cao@amd.com>;
>>>>> dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org
>>>>>>
>>>>> Cc: Yin, ZhenGuo (Chris) <ZhenGuo.Yin@amd.com>; Deng, Emily
>>>>> <Emily.Deng@amd.com>; dakr@kernel.org <dakr@kernel.org>;
>>>>> matthew.brost@intel.com <matthew.brost@intel.com>
>>>>> Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when
>>>>> killing app with dependent jobs
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 14.07.25 15:27, Philipp Stanner wrote:
>>>>>> On Mon, 2025-07-14 at 15:08 +0200, Christian König wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 14.07.25 14:46, Philipp Stanner wrote:
>>>>>>>> regarding the patch subject: the prefix we use for the
>>>>>>>> scheduler
>>>>>>>> is:
>>>>>>>> drm/sched:
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, 2025-07-14 at 14:23 +0800, Lin.Cao wrote:
>>>>>>>>
>>>>>>>>> When Application A submits jobs (a1, a2, a3) and
>>>>>>>>> application B
>>>>>>>>> submits
>>>>>>>>
>>>>>>>> s/Application/application
>>>>>>>>
>>>>>>>>> job b1 with a dependency on a2's scheduler fence,
>>>>>>>>> killing
>>>>>>>>> application A
>>>>>>>>> before run_job(a1) causes
>>>>>>>>> drm_sched_entity_kill_jobs_work() to
>>>>>>>>> force
>>>>>>>>> signal all jobs sequentially. However, due to missing
>>>>>>>>> work_run_job or
>>>>>>>>> work_free_job
>>>>>>>>>
>>>>>>>>
>>>>>>>> You probably mean "due to the work items work_run_job and
>>>>>>>> work_free_job
>>>>>>>> not being started […]".
>>>>>>>>
>>>>>>>> However, the call you add, drm_sched_wakeup(),
>>>>>>>> immediately only
>>>>>>>> triggers work_run_job. It's not clear to me why you
>>>>>>>> mention
>>>>>>>> work_free_job, because drm_sched_entity_kill_jobs_work()
>>>>>>>> directly
>>>>>>>> invokes the free_job() callback.
>>>>>>>
>>>>>>> Yeah the description is rather confusing, took me more than
>>>>>>> one try
>>>>>>> to understand what's going on here as well. Let me try to
>>>>>>> explain it
>>>>>>> in my words:
>>>>>>>
>>>>>>> When an application A is killed the pending submissions
>>>>>>> from it are
>>>>>>> not executed, but rather torn down by
>>>>>>> drm_sched_entity_kill_jobs_work().
>>>>>>>
>>>>>>> If now a submission from application B depends on something
>>>>>>> application A wanted to do on the same scheduler instance
>>>>>>> the
>>>>>>> function drm_sched_entity_add_dependency_cb() would have
>>>>>>> optimized
>>>>>>> this dependency and assumed that the scheduler work would
>>>>>>> already run
>>>>>>> and try to pick a job.
>>>>>>>
>>>>>>> But that isn't the case when the submissions from
>>>>>>> application A are
>>>>>>> all killed. So make sure to start the scheduler work from
>>>>>>> drm_sched_entity_kill_jobs_work() to handle that case.
>>>>>>
>>>>>> Alright, so the bug then depends on B setting a dependency on
>>>>>> A _after_
>>>>>> A was killed, doesn't it? Because the optimization in
>>>>>> _add_dependency_cb() can only have an effect after A's jobs
>>>>>> have been
>>>>>> torn down.
>>>>>
>>>>> No, application A and application B submit right behind each
>>>>> other. Application A is then killed later on.
>>>>>
>>>>> The optimization in _add_dependency_cb() just waits for As
>>>>> submission to be handled by the scheduler, but that never
>>>>> happens because it was killed.
>>>>>
>>>>>> If there is such a timing order problem, the commit message
>>>>>> should
>>>>>> mention it.
>>>>>>
>>>>>> The commit message definitely also needs to mention this
>>>>>> optimization
>>>>>> in drm_sched_entity_add_dependency_cb() since it's essential
>>>>>> for
>>>>>> understanding the bug.
>>>>>
>>>>> +1
>>>>>
>>>>> Christian.
>>>>>
>>>>>>
>>>>>>
>>>>>> Danke
>>>>>> P.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>>
>>>>>>>>> in entity_kill_job_work(), the scheduler enters sleep
>>>>>>>>> state, causing application B hang.
>>>>>>>>
>>>>>>>> So the issue is that if a1 never ran, the scheduler will
>>>>>>>> not
>>>>>>>> continue
>>>>>>>> with the jobs of application B, correct?
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Add drm_sched_wakeup() when entity_kill_job_work() to
>>>>>>>>> preventing
>>>>>>>>
>>>>>>>> s/to preventing/to prevent
>>>>>>>>
>>>>>>>>> scheduler sleep and subsequent application hangs.
>>>>>>>>>
>>>>>>>>> v2:
>>>>>>>>> - Move drm_sched_wakeup() to after
>>>>>>>>> drm_sched_fence_scheduled()
>>>>>>>>
>>>>>>>> Changelogs are cool and useful, but move them below the
>>>>>>>> Signed-off
>>>>>>>> area
>>>>>>>> with another --- please, like so:
>>>>>>>>
>>>>>>>> Signed-off by: …
>>>>>>>> ---
>>>>>>>> v2:
>>>>>>>> …
>>>>>>>> ---
>>>>>>>> drivers/gpu/drm/scheduler/sched_entity.c | 1 +
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Signed-off-by: Lin.Cao <lincao12@amd.com>
>>>>>>>>
>>>>>>>> This fixes a bug. Even a racy bug, it seems. So we need a
>>>>>>>> Fixes tag
>>>>>>>> and
>>>>>>>> put the stable kernel in CC.
>>>>>>>>
>>>>>>>> Please figure out with git blame, git log and git tag --
>>>>>>>> contains
>>>>>>>> which
>>>>>>>> commit your patch fixes and which is the oldest kernel
>>>>>>>> that
>>>>>>>> contains
>>>>>>>> the bug. Then add a Fixes: tag and Cc: the stable kernel.
>>>>>>>> You'll
>>>>>>>> find
>>>>>>>> lots of examples for that in git log.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thx
>>>>>>>> P.
>>>>>>>>
>>>>>>>>> ---
>>>>>>>>> drivers/gpu/drm/scheduler/sched_entity.c | 1 +
>>>>>>>>> 1 file changed, 1 insertion(+)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>>>> index e671aa241720..66f2a43c58fd 100644
>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>>>> @@ -177,6 +177,7 @@ static void
>>>>>>>>> drm_sched_entity_kill_jobs_work(struct work_struct
>>>>>>>>> *wrk)
>>>>>>>>> struct drm_sched_job *job = container_of(wrk,
>>>>>>>>> typeof(*job), work);
>>>>>>>>>
>>>>>>>>> drm_sched_fence_scheduled(job->s_fence, NULL);
>>>>>>>>> + drm_sched_wakeup(job->sched);
>>>>>>>>> drm_sched_fence_finished(job->s_fence, -ESRCH);
>>>>>>>>> WARN_ON(job->s_fence->parent);
>>>>>>>>> job->sched->ops->free_job(job);
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
2025-07-15 12:32 ` Christian König
@ 2025-07-15 13:07 ` Philipp Stanner
2025-07-15 13:20 ` cao, lin
0 siblings, 1 reply; 15+ messages in thread
From: Philipp Stanner @ 2025-07-15 13:07 UTC (permalink / raw)
To: Christian König, phasta, cao, lin,
dri-devel@lists.freedesktop.org
Cc: Yin, ZhenGuo (Chris), Deng, Emily, dakr@kernel.org,
matthew.brost@intel.com
On Tue, 2025-07-15 at 14:32 +0200, Christian König wrote:
> On 15.07.25 14:20, Philipp Stanner wrote:
> > On Tue, 2025-07-15 at 12:52 +0200, Christian König wrote:
> > > On 15.07.25 12:27, Philipp Stanner wrote:
> > > > On Tue, 2025-07-15 at 09:51 +0000, cao, lin wrote:
> > > > >
> > > > > [AMD Official Use Only - AMD Internal Distribution Only]
> > > > >
> > > > >
> > > > >
> > > > > Hi Philipp,
> > > > >
> > > > >
> > > > > Thanks for your review, let me try to clarify why I added
> > > > > drm_sched_wakeup() to drm_sched_entity_kill_jobs_work():
> > > > >
> > > > >
> > > > > When application A submits jobs (a1, a2, a3) and application
> > > > > B
> > > > > submits job b1 with a dependency on a1's scheduled fence, the
> > > > > normal execution flow is (function drm_sched_run_job_work()):
> > > > > 1. a1 gets popped from the entity by the scheduler
> > > > > 2. run_job(a1) executes
> > > > > 3. a1's scheduled fence gets signaled
> > > > > 4. drm_sched_run_job_work() calls drm_sched_run_job_queue()
> > > > > at
> > > > > the end
> > > > > 5. The scheduler wakes up and re-selects entities to pop jobs
> > > > > 6. Since b1's dependency is cleared, the scheduler can select
> > > > > b1
> > > > > and continue the same flow
> > > > >
> > > > >
> > > > > However, if application A is killed before a1 gets popped by
> > > > > the
> > > > > scheduler, then a1, a2, a3 are killed sequentially by
> > > > > drm_sched_entity_kill_jobs_cb(). During the kill process,
> > > > > their
> > > > > scheduled fences are still signaled, but the kill process
> > > > > itself
> > > > > lacks work_run_job. This means b1's dependency gets cleared,
> > > > > but
> > > > > there's no work_run_job to drive the scheduler to continue
> > > > > running, or we can saystep 4 in the normal flow is missing,
> > > > > and
> > > > > the scheduler enters sleep state, which causes application B
> > > > > to
> > > > > hang.
> > > > > So if we add drm_sched_wakeup() in
> > > > > drm_sched_entity_kill_jobs_work() after
> > > > > drm_sched_fence_scheduled(), the scheduler can be woken up,
> > > > > and
> > > > > application B can continue running after a1's scheduled fence
> > > > > is
> > > > > force signaled.
> > > >
> > > > Thanks for the detailed explanation. Makes sense.
> > > > Maybe you can detail in v3 that this "optimization" consists of
> > > > the
> > > > work item not being scheduled.
> > >
> > > Yeah, removing this "optimization" would also be a valid solution
> > > to
> > > the problem.
> >
> > If you at AMD are willing to work on that that'd be definitely
> > fine.
> > Removing code is usually better than adding code.
>
> I fear I won't have time for that before my retirement :)
You've got fresh, powerful folks like Lin! :)
But either solution is fine. If you just want it fixed, we can merge
the existing approach.
>
> > Do you think being afraid of a performance regression is realistic
> > here, though?
>
> No, absolutely not. It was just a micro optimization done long ago.
>
> On the other hand with the scheduler code base I'm not sure of
> anything any more.
"In my subsystem you have to run as fast as you can just to remain at
the same place", said the Red Queen to Alice.
:)
P.
>
> Regards,
> Christian.
>
> >
> >
> > P.
> >
> > >
> > > Christian.
> > >
> > > > I think that was the piece of the puzzle
> > > > I was missing.
> > > >
> > > > I / DRM tools will also include a link to this thread, so I
> > > > think
> > > > that
> > > > will then be sufficient.
> > > >
> > > > Thx
> > > > P.
> > > >
> > > > >
> > > > > Thanks,
> > > > > Lin
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > From: Philipp Stanner <phasta@mailbox.org>
> > > > > Sent: Tuesday, July 15, 2025 17:12
> > > > > To: cao, lin <lin.cao@amd.com>; Koenig, Christian
> > > > > <Christian.Koenig@amd.com>;
> > > > > phasta@kernel.org <phasta@kernel.org>;
> > > > > dri-devel@lists.freedesktop.org <
> > > > > dri-devel@lists.freedesktop.org>
> > > > > Cc: Yin, ZhenGuo (Chris) <ZhenGuo.Yin@amd.com>; Deng, Emily
> > > > > <Emily.Deng@amd.com>; dakr@kernel.org <dakr@kernel.org>;
> > > > > matthew.brost@intel.com <matthew.brost@intel.com>
> > > > > Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when
> > > > > killing app with dependent jobs
> > > > >
> > > > >
> > > > > On Tue, 2025-07-15 at 03:38 +0000, cao, lin wrote:
> > > > > >
> > > > > > [AMD Official Use Only - AMD Internal Distribution Only]
> > > > > >
> > > > > >
> > > > > >
> > > > > > Hi Christian, Philipp,
> > > > > >
> > > > > >
> > > > > > I modified the commit msg, and I hope it makes more sense:
> > > > > >
> > > > > >
> > > > > > drm/sched: wake up scheduler when killing jobs to prevent
> > > > > > hang
> > > > >
> > > > > nit:
> > > > > s/wake/Wake
> > > > >
> > > > > >
> > > > > >
> > > > > > When application A submits jobs (a1, a2, a3) and
> > > > > > application B
> > > > > > submits
> > > > > > job b1 with a dependency on a2's scheduler fence, killing
> > > > > > application A
> > > > > > before run_job(a1) causes drm_sched_entity_kill_jobs_work()
> > > > > > to
> > > > > > force
> > > > > > signal all jobs sequentially. However, the optimization in
> > > > > > drm_sched_entity_add_dependency_cb() waits for the fence to
> > > > > > be
> > > > > > scheduled
> > > > > > by adding a callback (drm_sched_entity_clear_dep) instead
> > > > > > of
> > > > > > immediately
> > > > > > waking up the scheduler. When A is killed before its jobs
> > > > > > can
> > > > > > run, the
> > > > > > callback gets triggered but drm_sched_entity_clear_dep()
> > > > > > only
> > > > > > clears the
> > > > > > dependency without waking up the scheduler, causing the
> > > > > > scheduler to enter
> > > > > > sleep state and application B to hang.
> > > > >
> > > > > That now reads as if drm_sched_entity_clear_dep() is supposed
> > > > > to
> > > > > wake
> > > > > up the scheduler. But you add the wakeup to a different
> > > > > function
> > > > > (the
> > > > > work item).
> > > > >
> > > > > Also the code actually looks like that:
> > > > >
> > > > >
> > > > > xa_erase(&job->dependencies, index);
> > > > > if (f && !dma_fence_add_callback(f, &job-
> > > > > > finish_cb,
> > > > >
> > > > > drm_sched_entity_kill_jobs_cb))
> > > > > return;
> > > > >
> > > > > dma_fence_put(f);
> > > > > }
> > > > >
> > > > > INIT_WORK(&job->work,
> > > > > drm_sched_entity_kill_jobs_work);
> > > > > schedule_work(&job->work);
> > > > > }
> > > > >
> > > > > So if adding that callback succeeds we.. return immediately
> > > > > but
> > > > > we.. do
> > > > > that for the first dependency, not for all of them?
> > > > >
> > > > > Oh boy. That code base. We(tm) should look into pimping that
> > > > > up.
> > > > > Also
> > > > > lacks documentation everywhere.
> > > > >
> > > > >
> > > > > Anyways. If this solves a bug for you the patch looks fine
> > > > > (i.e.,
> > > > > not
> > > > > potentially harmful) by me and if the tests succeed we can
> > > > > merge
> > > > > it.
> > > > > However, I'd feel better if you could clarify more why that
> > > > > function is
> > > > > the right place to solve the bug.
> > > > >
> > > > >
> > > > > Thanks,
> > > > > P.
> > > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > Add drm_sched_wakeup() in entity_kill_job_work() to prevent
> > > > > > scheduler
> > > > > > sleep and subsequent application hangs.
> > > > > >
> > > > > >
> > > > > > v2:
> > > > > > - Move drm_sched_wakeup() to after
> > > > > > drm_sched_fence_scheduled()
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Lin
> > > > > >
> > > > > >
> > > > > > From: Koenig, Christian <Christian.Koenig@amd.com>
> > > > > > Sent: Monday, July 14, 2025 21:39
> > > > > > To: phasta@kernel.org <phasta@kernel.org>; cao, lin
> > > > > > <lin.cao@amd.com>;
> > > > > > dri-devel@lists.freedesktop.org <
> > > > > > dri-devel@lists.freedesktop.org
> > > > > > >
> > > > > > Cc: Yin, ZhenGuo (Chris) <ZhenGuo.Yin@amd.com>; Deng, Emily
> > > > > > <Emily.Deng@amd.com>; dakr@kernel.org <dakr@kernel.org>;
> > > > > > matthew.brost@intel.com <matthew.brost@intel.com>
> > > > > > Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when
> > > > > > killing app with dependent jobs
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 14.07.25 15:27, Philipp Stanner wrote:
> > > > > > > On Mon, 2025-07-14 at 15:08 +0200, Christian König wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > On 14.07.25 14:46, Philipp Stanner wrote:
> > > > > > > > > regarding the patch subject: the prefix we use for
> > > > > > > > > the
> > > > > > > > > scheduler
> > > > > > > > > is:
> > > > > > > > > drm/sched:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Mon, 2025-07-14 at 14:23 +0800, Lin.Cao wrote:
> > > > > > > > >
> > > > > > > > > > When Application A submits jobs (a1, a2, a3) and
> > > > > > > > > > application B
> > > > > > > > > > submits
> > > > > > > > >
> > > > > > > > > s/Application/application
> > > > > > > > >
> > > > > > > > > > job b1 with a dependency on a2's scheduler fence,
> > > > > > > > > > killing
> > > > > > > > > > application A
> > > > > > > > > > before run_job(a1) causes
> > > > > > > > > > drm_sched_entity_kill_jobs_work() to
> > > > > > > > > > force
> > > > > > > > > > signal all jobs sequentially. However, due to
> > > > > > > > > > missing
> > > > > > > > > > work_run_job or
> > > > > > > > > > work_free_job
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > You probably mean "due to the work items work_run_job
> > > > > > > > > and
> > > > > > > > > work_free_job
> > > > > > > > > not being started […]".
> > > > > > > > >
> > > > > > > > > However, the call you add, drm_sched_wakeup(),
> > > > > > > > > immediately only
> > > > > > > > > triggers work_run_job. It's not clear to me why you
> > > > > > > > > mention
> > > > > > > > > work_free_job, because
> > > > > > > > > drm_sched_entity_kill_jobs_work()
> > > > > > > > > directly
> > > > > > > > > invokes the free_job() callback.
> > > > > > > >
> > > > > > > > Yeah the description is rather confusing, took me more
> > > > > > > > than
> > > > > > > > one try
> > > > > > > > to understand what's going on here as well. Let me try
> > > > > > > > to
> > > > > > > > explain it
> > > > > > > > in my words:
> > > > > > > >
> > > > > > > > When an application A is killed the pending submissions
> > > > > > > > from it are
> > > > > > > > not executed, but rather torn down by
> > > > > > > > drm_sched_entity_kill_jobs_work().
> > > > > > > >
> > > > > > > > If now a submission from application B depends on
> > > > > > > > something
> > > > > > > > application A wanted to do on the same scheduler
> > > > > > > > instance
> > > > > > > > the
> > > > > > > > function drm_sched_entity_add_dependency_cb() would
> > > > > > > > have
> > > > > > > > optimized
> > > > > > > > this dependency and assumed that the scheduler work
> > > > > > > > would
> > > > > > > > already run
> > > > > > > > and try to pick a job.
> > > > > > > >
> > > > > > > > But that isn't the case when the submissions from
> > > > > > > > application A are
> > > > > > > > all killed. So make sure to start the scheduler work
> > > > > > > > from
> > > > > > > > drm_sched_entity_kill_jobs_work() to handle that case.
> > > > > > >
> > > > > > > Alright, so the bug then depends on B setting a
> > > > > > > dependency on
> > > > > > > A _after_
> > > > > > > A was killed, doesn't it? Because the optimization in
> > > > > > > _add_dependency_cb() can only have an effect after A's
> > > > > > > jobs
> > > > > > > have been
> > > > > > > torn down.
> > > > > >
> > > > > > No, application A and application B submit right behind
> > > > > > each
> > > > > > other. Application A is then killed later on.
> > > > > >
> > > > > > The optimization in _add_dependency_cb() just waits for As
> > > > > > submission to be handled by the scheduler, but that never
> > > > > > happens because it was killed.
> > > > > >
> > > > > > > If there is such a timing order problem, the commit
> > > > > > > message
> > > > > > > should
> > > > > > > mention it.
> > > > > > >
> > > > > > > The commit message definitely also needs to mention this
> > > > > > > optimization
> > > > > > > in drm_sched_entity_add_dependency_cb() since it's
> > > > > > > essential
> > > > > > > for
> > > > > > > understanding the bug.
> > > > > >
> > > > > > +1
> > > > > >
> > > > > > Christian.
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Danke
> > > > > > > P.
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Christian.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > in entity_kill_job_work(), the scheduler enters
> > > > > > > > > > sleep
> > > > > > > > > > state, causing application B hang.
> > > > > > > > >
> > > > > > > > > So the issue is that if a1 never ran, the scheduler
> > > > > > > > > will
> > > > > > > > > not
> > > > > > > > > continue
> > > > > > > > > with the jobs of application B, correct?
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Add drm_sched_wakeup() when entity_kill_job_work()
> > > > > > > > > > to
> > > > > > > > > > preventing
> > > > > > > > >
> > > > > > > > > s/to preventing/to prevent
> > > > > > > > >
> > > > > > > > > > scheduler sleep and subsequent application hangs.
> > > > > > > > > >
> > > > > > > > > > v2:
> > > > > > > > > > - Move drm_sched_wakeup() to after
> > > > > > > > > > drm_sched_fence_scheduled()
> > > > > > > > >
> > > > > > > > > Changelogs are cool and useful, but move them below
> > > > > > > > > the
> > > > > > > > > Signed-off
> > > > > > > > > area
> > > > > > > > > with another --- please, like so:
> > > > > > > > >
> > > > > > > > > Signed-off by: …
> > > > > > > > > ---
> > > > > > > > > v2:
> > > > > > > > > …
> > > > > > > > > ---
> > > > > > > > > drivers/gpu/drm/scheduler/sched_entity.c | 1 +
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Lin.Cao <lincao12@amd.com>
> > > > > > > > >
> > > > > > > > > This fixes a bug. Even a racy bug, it seems. So we
> > > > > > > > > need a
> > > > > > > > > Fixes tag
> > > > > > > > > and
> > > > > > > > > put the stable kernel in CC.
> > > > > > > > >
> > > > > > > > > Please figure out with git blame, git log and git tag
> > > > > > > > > --
> > > > > > > > > contains
> > > > > > > > > which
> > > > > > > > > commit your patch fixes and which is the oldest
> > > > > > > > > kernel
> > > > > > > > > that
> > > > > > > > > contains
> > > > > > > > > the bug. Then add a Fixes: tag and Cc: the stable
> > > > > > > > > kernel.
> > > > > > > > > You'll
> > > > > > > > > find
> > > > > > > > > lots of examples for that in git log.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thx
> > > > > > > > > P.
> > > > > > > > >
> > > > > > > > > > ---
> > > > > > > > > > drivers/gpu/drm/scheduler/sched_entity.c | 1 +
> > > > > > > > > > 1 file changed, 1 insertion(+)
> > > > > > > > > >
> > > > > > > > > > diff --git
> > > > > > > > > > a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > > > > > b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > > > > > index e671aa241720..66f2a43c58fd 100644
> > > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > > > > > @@ -177,6 +177,7 @@ static void
> > > > > > > > > > drm_sched_entity_kill_jobs_work(struct work_struct
> > > > > > > > > > *wrk)
> > > > > > > > > > struct drm_sched_job *job = container_of(wrk,
> > > > > > > > > > typeof(*job), work);
> > > > > > > > > >
> > > > > > > > > > drm_sched_fence_scheduled(job->s_fence, NULL);
> > > > > > > > > > + drm_sched_wakeup(job->sched);
> > > > > > > > > > drm_sched_fence_finished(job->s_fence, -ESRCH);
> > > > > > > > > > WARN_ON(job->s_fence->parent);
> > > > > > > > > > job->sched->ops->free_job(job);
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
2025-07-15 13:07 ` Philipp Stanner
@ 2025-07-15 13:20 ` cao, lin
0 siblings, 0 replies; 15+ messages in thread
From: cao, lin @ 2025-07-15 13:20 UTC (permalink / raw)
To: Koenig, Christian, phasta@kernel.org,
dri-devel@lists.freedesktop.org
Cc: Yin, ZhenGuo (Chris), Deng, Emily, dakr@kernel.org,
matthew.brost@intel.com
[-- Attachment #1: Type: text/plain, Size: 17460 bytes --]
[AMD Official Use Only - AMD Internal Distribution Only]
Hi Philipp, Christian,
Thanks for the kind words! :)
I've reconsidered the approach and decided to go with the "removing code" solution instead. I'll drop the v3 patch and submit a new v4 that removes the problematic optimization in drm_sched_entity_add_dependency_cb().
Thanks,
Lin
________________________________
From: Philipp Stanner <phasta@mailbox.org>
Sent: Tuesday, July 15, 2025 21:07
To: Koenig, Christian <Christian.Koenig@amd.com>; phasta@kernel.org <phasta@kernel.org>; cao, lin <lin.cao@amd.com>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>
Cc: Yin, ZhenGuo (Chris) <ZhenGuo.Yin@amd.com>; Deng, Emily <Emily.Deng@amd.com>; dakr@kernel.org <dakr@kernel.org>; matthew.brost@intel.com <matthew.brost@intel.com>
Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
On Tue, 2025-07-15 at 14:32 +0200, Christian König wrote:
> On 15.07.25 14:20, Philipp Stanner wrote:
> > On Tue, 2025-07-15 at 12:52 +0200, Christian König wrote:
> > > On 15.07.25 12:27, Philipp Stanner wrote:
> > > > On Tue, 2025-07-15 at 09:51 +0000, cao, lin wrote:
> > > > >
> > > > > [AMD Official Use Only - AMD Internal Distribution Only]
> > > > >
> > > > >
> > > > >
> > > > > Hi Philipp,
> > > > >
> > > > >
> > > > > Thanks for your review, let me try to clarify why I added
> > > > > drm_sched_wakeup() to drm_sched_entity_kill_jobs_work():
> > > > >
> > > > >
> > > > > When application A submits jobs (a1, a2, a3) and application
> > > > > B
> > > > > submits job b1 with a dependency on a1's scheduled fence, the
> > > > > normal execution flow is (function drm_sched_run_job_work()):
> > > > > 1. a1 gets popped from the entity by the scheduler
> > > > > 2. run_job(a1) executes
> > > > > 3. a1's scheduled fence gets signaled
> > > > > 4. drm_sched_run_job_work() calls drm_sched_run_job_queue()
> > > > > at
> > > > > the end
> > > > > 5. The scheduler wakes up and re-selects entities to pop jobs
> > > > > 6. Since b1's dependency is cleared, the scheduler can select
> > > > > b1
> > > > > and continue the same flow
> > > > >
> > > > >
> > > > > However, if application A is killed before a1 gets popped by
> > > > > the
> > > > > scheduler, then a1, a2, a3 are killed sequentially by
> > > > > drm_sched_entity_kill_jobs_cb(). During the kill process,
> > > > > their
> > > > > scheduled fences are still signaled, but the kill process
> > > > > itself
> > > > > lacks work_run_job. This means b1's dependency gets cleared,
> > > > > but
> > > > > there's no work_run_job to drive the scheduler to continue
> > > > > running, or we can saystep 4 in the normal flow is missing,
> > > > > and
> > > > > the scheduler enters sleep state, which causes application B
> > > > > to
> > > > > hang.
> > > > > So if we add drm_sched_wakeup() in
> > > > > drm_sched_entity_kill_jobs_work() after
> > > > > drm_sched_fence_scheduled(), the scheduler can be woken up,
> > > > > and
> > > > > application B can continue running after a1's scheduled fence
> > > > > is
> > > > > force signaled.
> > > >
> > > > Thanks for the detailed explanation. Makes sense.
> > > > Maybe you can detail in v3 that this "optimization" consists of
> > > > the
> > > > work item not being scheduled.
> > >
> > > Yeah, removing this "optimization" would also be a valid solution
> > > to
> > > the problem.
> >
> > If you at AMD are willing to work on that that'd be definitely
> > fine.
> > Removing code is usually better than adding code.
>
> I fear I won't have time for that before my retirement :)
You've got fresh, powerful folks like Lin! :)
But either solution is fine. If you just want it fixed, we can merge
the existing approach.
>
> > Do you think being afraid of a performance regression is realistic
> > here, though?
>
> No, absolutely not. It was just a micro optimization done long ago.
>
> On the other hand with the scheduler code base I'm not sure of
> anything any more.
"In my subsystem you have to run as fast as you can just to remain at
the same place", said the Red Queen to Alice.
:)
P.
>
> Regards,
> Christian.
>
> >
> >
> > P.
> >
> > >
> > > Christian.
> > >
> > > > I think that was the piece of the puzzle
> > > > I was missing.
> > > >
> > > > I / DRM tools will also include a link to this thread, so I
> > > > think
> > > > that
> > > > will then be sufficient.
> > > >
> > > > Thx
> > > > P.
> > > >
> > > > >
> > > > > Thanks,
> > > > > Lin
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > From: Philipp Stanner <phasta@mailbox.org>
> > > > > Sent: Tuesday, July 15, 2025 17:12
> > > > > To: cao, lin <lin.cao@amd.com>; Koenig, Christian
> > > > > <Christian.Koenig@amd.com>;
> > > > > phasta@kernel.org <phasta@kernel.org>;
> > > > > dri-devel@lists.freedesktop.org <
> > > > > dri-devel@lists.freedesktop.org>
> > > > > Cc: Yin, ZhenGuo (Chris) <ZhenGuo.Yin@amd.com>; Deng, Emily
> > > > > <Emily.Deng@amd.com>; dakr@kernel.org <dakr@kernel.org>;
> > > > > matthew.brost@intel.com <matthew.brost@intel.com>
> > > > > Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when
> > > > > killing app with dependent jobs
> > > > >
> > > > >
> > > > > On Tue, 2025-07-15 at 03:38 +0000, cao, lin wrote:
> > > > > >
> > > > > > [AMD Official Use Only - AMD Internal Distribution Only]
> > > > > >
> > > > > >
> > > > > >
> > > > > > Hi Christian, Philipp,
> > > > > >
> > > > > >
> > > > > > I modified the commit msg, and I hope it makes more sense:
> > > > > >
> > > > > >
> > > > > > drm/sched: wake up scheduler when killing jobs to prevent
> > > > > > hang
> > > > >
> > > > > nit:
> > > > > s/wake/Wake
> > > > >
> > > > > >
> > > > > >
> > > > > > When application A submits jobs (a1, a2, a3) and
> > > > > > application B
> > > > > > submits
> > > > > > job b1 with a dependency on a2's scheduler fence, killing
> > > > > > application A
> > > > > > before run_job(a1) causes drm_sched_entity_kill_jobs_work()
> > > > > > to
> > > > > > force
> > > > > > signal all jobs sequentially. However, the optimization in
> > > > > > drm_sched_entity_add_dependency_cb() waits for the fence to
> > > > > > be
> > > > > > scheduled
> > > > > > by adding a callback (drm_sched_entity_clear_dep) instead
> > > > > > of
> > > > > > immediately
> > > > > > waking up the scheduler. When A is killed before its jobs
> > > > > > can
> > > > > > run, the
> > > > > > callback gets triggered but drm_sched_entity_clear_dep()
> > > > > > only
> > > > > > clears the
> > > > > > dependency without waking up the scheduler, causing the
> > > > > > scheduler to enter
> > > > > > sleep state and application B to hang.
> > > > >
> > > > > That now reads as if drm_sched_entity_clear_dep() is supposed
> > > > > to
> > > > > wake
> > > > > up the scheduler. But you add the wakeup to a different
> > > > > function
> > > > > (the
> > > > > work item).
> > > > >
> > > > > Also the code actually looks like that:
> > > > >
> > > > >
> > > > > xa_erase(&job->dependencies, index);
> > > > > if (f && !dma_fence_add_callback(f, &job-
> > > > > > finish_cb,
> > > > >
> > > > > drm_sched_entity_kill_jobs_cb))
> > > > > return;
> > > > >
> > > > > dma_fence_put(f);
> > > > > }
> > > > >
> > > > > INIT_WORK(&job->work,
> > > > > drm_sched_entity_kill_jobs_work);
> > > > > schedule_work(&job->work);
> > > > > }
> > > > >
> > > > > So if adding that callback succeeds we.. return immediately
> > > > > but
> > > > > we.. do
> > > > > that for the first dependency, not for all of them?
> > > > >
> > > > > Oh boy. That code base. We(tm) should look into pimping that
> > > > > up.
> > > > > Also
> > > > > lacks documentation everywhere.
> > > > >
> > > > >
> > > > > Anyways. If this solves a bug for you the patch looks fine
> > > > > (i.e.,
> > > > > not
> > > > > potentially harmful) by me and if the tests succeed we can
> > > > > merge
> > > > > it.
> > > > > However, I'd feel better if you could clarify more why that
> > > > > function is
> > > > > the right place to solve the bug.
> > > > >
> > > > >
> > > > > Thanks,
> > > > > P.
> > > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > Add drm_sched_wakeup() in entity_kill_job_work() to prevent
> > > > > > scheduler
> > > > > > sleep and subsequent application hangs.
> > > > > >
> > > > > >
> > > > > > v2:
> > > > > > - Move drm_sched_wakeup() to after
> > > > > > drm_sched_fence_scheduled()
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Lin
> > > > > >
> > > > > >
> > > > > > From: Koenig, Christian <Christian.Koenig@amd.com>
> > > > > > Sent: Monday, July 14, 2025 21:39
> > > > > > To: phasta@kernel.org <phasta@kernel.org>; cao, lin
> > > > > > <lin.cao@amd.com>;
> > > > > > dri-devel@lists.freedesktop.org <
> > > > > > dri-devel@lists.freedesktop.org
> > > > > > >
> > > > > > Cc: Yin, ZhenGuo (Chris) <ZhenGuo.Yin@amd.com>; Deng, Emily
> > > > > > <Emily.Deng@amd.com>; dakr@kernel.org <dakr@kernel.org>;
> > > > > > matthew.brost@intel.com <matthew.brost@intel.com>
> > > > > > Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when
> > > > > > killing app with dependent jobs
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 14.07.25 15:27, Philipp Stanner wrote:
> > > > > > > On Mon, 2025-07-14 at 15:08 +0200, Christian König wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > On 14.07.25 14:46, Philipp Stanner wrote:
> > > > > > > > > regarding the patch subject: the prefix we use for
> > > > > > > > > the
> > > > > > > > > scheduler
> > > > > > > > > is:
> > > > > > > > > drm/sched:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Mon, 2025-07-14 at 14:23 +0800, Lin.Cao wrote:
> > > > > > > > >
> > > > > > > > > > When Application A submits jobs (a1, a2, a3) and
> > > > > > > > > > application B
> > > > > > > > > > submits
> > > > > > > > >
> > > > > > > > > s/Application/application
> > > > > > > > >
> > > > > > > > > > job b1 with a dependency on a2's scheduler fence,
> > > > > > > > > > killing
> > > > > > > > > > application A
> > > > > > > > > > before run_job(a1) causes
> > > > > > > > > > drm_sched_entity_kill_jobs_work() to
> > > > > > > > > > force
> > > > > > > > > > signal all jobs sequentially. However, due to
> > > > > > > > > > missing
> > > > > > > > > > work_run_job or
> > > > > > > > > > work_free_job
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > You probably mean "due to the work items work_run_job
> > > > > > > > > and
> > > > > > > > > work_free_job
> > > > > > > > > not being started […]".
> > > > > > > > >
> > > > > > > > > However, the call you add, drm_sched_wakeup(),
> > > > > > > > > immediately only
> > > > > > > > > triggers work_run_job. It's not clear to me why you
> > > > > > > > > mention
> > > > > > > > > work_free_job, because
> > > > > > > > > drm_sched_entity_kill_jobs_work()
> > > > > > > > > directly
> > > > > > > > > invokes the free_job() callback.
> > > > > > > >
> > > > > > > > Yeah the description is rather confusing, took me more
> > > > > > > > than
> > > > > > > > one try
> > > > > > > > to understand what's going on here as well. Let me try
> > > > > > > > to
> > > > > > > > explain it
> > > > > > > > in my words:
> > > > > > > >
> > > > > > > > When an application A is killed the pending submissions
> > > > > > > > from it are
> > > > > > > > not executed, but rather torn down by
> > > > > > > > drm_sched_entity_kill_jobs_work().
> > > > > > > >
> > > > > > > > If now a submission from application B depends on
> > > > > > > > something
> > > > > > > > application A wanted to do on the same scheduler
> > > > > > > > instance
> > > > > > > > the
> > > > > > > > function drm_sched_entity_add_dependency_cb() would
> > > > > > > > have
> > > > > > > > optimized
> > > > > > > > this dependency and assumed that the scheduler work
> > > > > > > > would
> > > > > > > > already run
> > > > > > > > and try to pick a job.
> > > > > > > >
> > > > > > > > But that isn't the case when the submissions from
> > > > > > > > application A are
> > > > > > > > all killed. So make sure to start the scheduler work
> > > > > > > > from
> > > > > > > > drm_sched_entity_kill_jobs_work() to handle that case.
> > > > > > >
> > > > > > > Alright, so the bug then depends on B setting a
> > > > > > > dependency on
> > > > > > > A _after_
> > > > > > > A was killed, doesn't it? Because the optimization in
> > > > > > > _add_dependency_cb() can only have an effect after A's
> > > > > > > jobs
> > > > > > > have been
> > > > > > > torn down.
> > > > > >
> > > > > > No, application A and application B submit right behind
> > > > > > each
> > > > > > other. Application A is then killed later on.
> > > > > >
> > > > > > The optimization in _add_dependency_cb() just waits for As
> > > > > > submission to be handled by the scheduler, but that never
> > > > > > happens because it was killed.
> > > > > >
> > > > > > > If there is such a timing order problem, the commit
> > > > > > > message
> > > > > > > should
> > > > > > > mention it.
> > > > > > >
> > > > > > > The commit message definitely also needs to mention this
> > > > > > > optimization
> > > > > > > in drm_sched_entity_add_dependency_cb() since it's
> > > > > > > essential
> > > > > > > for
> > > > > > > understanding the bug.
> > > > > >
> > > > > > +1
> > > > > >
> > > > > > Christian.
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Danke
> > > > > > > P.
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Christian.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > in entity_kill_job_work(), the scheduler enters
> > > > > > > > > > sleep
> > > > > > > > > > state, causing application B hang.
> > > > > > > > >
> > > > > > > > > So the issue is that if a1 never ran, the scheduler
> > > > > > > > > will
> > > > > > > > > not
> > > > > > > > > continue
> > > > > > > > > with the jobs of application B, correct?
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Add drm_sched_wakeup() when entity_kill_job_work()
> > > > > > > > > > to
> > > > > > > > > > preventing
> > > > > > > > >
> > > > > > > > > s/to preventing/to prevent
> > > > > > > > >
> > > > > > > > > > scheduler sleep and subsequent application hangs.
> > > > > > > > > >
> > > > > > > > > > v2:
> > > > > > > > > > - Move drm_sched_wakeup() to after
> > > > > > > > > > drm_sched_fence_scheduled()
> > > > > > > > >
> > > > > > > > > Changelogs are cool and useful, but move them below
> > > > > > > > > the
> > > > > > > > > Signed-off
> > > > > > > > > area
> > > > > > > > > with another --- please, like so:
> > > > > > > > >
> > > > > > > > > Signed-off by: …
> > > > > > > > > ---
> > > > > > > > > v2:
> > > > > > > > > …
> > > > > > > > > ---
> > > > > > > > > drivers/gpu/drm/scheduler/sched_entity.c | 1 +
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Lin.Cao <lincao12@amd.com>
> > > > > > > > >
> > > > > > > > > This fixes a bug. Even a racy bug, it seems. So we
> > > > > > > > > need a
> > > > > > > > > Fixes tag
> > > > > > > > > and
> > > > > > > > > put the stable kernel in CC.
> > > > > > > > >
> > > > > > > > > Please figure out with git blame, git log and git tag
> > > > > > > > > --
> > > > > > > > > contains
> > > > > > > > > which
> > > > > > > > > commit your patch fixes and which is the oldest
> > > > > > > > > kernel
> > > > > > > > > that
> > > > > > > > > contains
> > > > > > > > > the bug. Then add a Fixes: tag and Cc: the stable
> > > > > > > > > kernel.
> > > > > > > > > You'll
> > > > > > > > > find
> > > > > > > > > lots of examples for that in git log.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thx
> > > > > > > > > P.
> > > > > > > > >
> > > > > > > > > > ---
> > > > > > > > > > drivers/gpu/drm/scheduler/sched_entity.c | 1 +
> > > > > > > > > > 1 file changed, 1 insertion(+)
> > > > > > > > > >
> > > > > > > > > > diff --git
> > > > > > > > > > a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > > > > > b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > > > > > index e671aa241720..66f2a43c58fd 100644
> > > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > > > > > @@ -177,6 +177,7 @@ static void
> > > > > > > > > > drm_sched_entity_kill_jobs_work(struct work_struct
> > > > > > > > > > *wrk)
> > > > > > > > > > struct drm_sched_job *job = container_of(wrk,
> > > > > > > > > > typeof(*job), work);
> > > > > > > > > >
> > > > > > > > > > drm_sched_fence_scheduled(job->s_fence, NULL);
> > > > > > > > > > + drm_sched_wakeup(job->sched);
> > > > > > > > > > drm_sched_fence_finished(job->s_fence, -ESRCH);
> > > > > > > > > > WARN_ON(job->s_fence->parent);
> > > > > > > > > > job->sched->ops->free_job(job);
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
[-- Attachment #2: Type: text/html, Size: 31944 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2025-07-15 13:20 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-14 6:23 [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs Lin.Cao
2025-07-14 11:03 ` Christian König
2025-07-14 12:46 ` Philipp Stanner
2025-07-14 13:08 ` Christian König
2025-07-14 13:27 ` Philipp Stanner
2025-07-14 13:39 ` Christian König
2025-07-15 3:38 ` cao, lin
2025-07-15 9:12 ` Philipp Stanner
2025-07-15 9:51 ` cao, lin
2025-07-15 10:27 ` Philipp Stanner
2025-07-15 10:52 ` Christian König
2025-07-15 12:20 ` Philipp Stanner
2025-07-15 12:32 ` Christian König
2025-07-15 13:07 ` Philipp Stanner
2025-07-15 13:20 ` cao, lin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).