* [PATCH v1] drm/sched: fix deadlock in drm_sched_entity_kill_jobs_cb
@ 2025-10-29 9:11 Pierre-Eric Pelloux-Prayer
2025-10-29 13:05 ` Christian König
2025-10-30 11:17 ` Philipp Stanner
0 siblings, 2 replies; 5+ messages in thread
From: Pierre-Eric Pelloux-Prayer @ 2025-10-29 9:11 UTC (permalink / raw)
To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
Christian König, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, David Airlie, Simona Vetter, Sumit Semwal
Cc: Pierre-Eric Pelloux-Prayer, Christian König, dri-devel,
linux-kernel, linux-media, linaro-mm-sig
https://gitlab.freedesktop.org/mesa/mesa/-/issues/13908 pointed out
a possible deadlock:
[ 1231.611031] Possible interrupt unsafe locking scenario:
[ 1231.611033] CPU0 CPU1
[ 1231.611034] ---- ----
[ 1231.611035] lock(&xa->xa_lock#17);
[ 1231.611038] local_irq_disable();
[ 1231.611039] lock(&fence->lock);
[ 1231.611041] lock(&xa->xa_lock#17);
[ 1231.611044] <Interrupt>
[ 1231.611045] lock(&fence->lock);
[ 1231.611047]
*** DEADLOCK ***
My initial fix was to replace xa_erase by xa_erase_irq, but Christian
pointed out that calling dma_fence_add_callback from a callback can
also deadlock if the signalling fence and the one passed to
dma_fence_add_callback share the same lock.
To fix both issues, the code iterating on dependencies and re-arming them
is moved out to drm_sched_entity_kill_jobs_work.
Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
---
drivers/gpu/drm/scheduler/sched_entity.c | 34 +++++++++++++-----------
1 file changed, 19 insertions(+), 15 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index c8e949f4a568..fe174a4857be 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -173,26 +173,15 @@ int drm_sched_entity_error(struct drm_sched_entity *entity)
}
EXPORT_SYMBOL(drm_sched_entity_error);
+static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
+ struct dma_fence_cb *cb);
+
static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk)
{
struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
-
- drm_sched_fence_scheduled(job->s_fence, NULL);
- drm_sched_fence_finished(job->s_fence, -ESRCH);
- WARN_ON(job->s_fence->parent);
- job->sched->ops->free_job(job);
-}
-
-/* Signal the scheduler finished fence when the entity in question is killed. */
-static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
- struct dma_fence_cb *cb)
-{
- struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
- finish_cb);
+ struct dma_fence *f;
unsigned long index;
- dma_fence_put(f);
-
/* Wait for all dependencies to avoid data corruptions */
xa_for_each(&job->dependencies, index, f) {
struct drm_sched_fence *s_fence = to_drm_sched_fence(f);
@@ -220,6 +209,21 @@ static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
dma_fence_put(f);
}
+ drm_sched_fence_scheduled(job->s_fence, NULL);
+ drm_sched_fence_finished(job->s_fence, -ESRCH);
+ WARN_ON(job->s_fence->parent);
+ job->sched->ops->free_job(job);
+}
+
+/* Signal the scheduler finished fence when the entity in question is killed. */
+static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
+ struct dma_fence_cb *cb)
+{
+ struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
+ finish_cb);
+
+ dma_fence_put(f);
+
INIT_WORK(&job->work, drm_sched_entity_kill_jobs_work);
schedule_work(&job->work);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH v1] drm/sched: fix deadlock in drm_sched_entity_kill_jobs_cb
2025-10-29 9:11 [PATCH v1] drm/sched: fix deadlock in drm_sched_entity_kill_jobs_cb Pierre-Eric Pelloux-Prayer
@ 2025-10-29 13:05 ` Christian König
2025-10-30 11:17 ` Philipp Stanner
1 sibling, 0 replies; 5+ messages in thread
From: Christian König @ 2025-10-29 13:05 UTC (permalink / raw)
To: Pierre-Eric Pelloux-Prayer, Matthew Brost, Danilo Krummrich,
Philipp Stanner, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, David Airlie, Simona Vetter, Sumit Semwal
Cc: Christian König, dri-devel, linux-kernel, linux-media,
linaro-mm-sig
On 10/29/25 10:11, Pierre-Eric Pelloux-Prayer wrote:
> https://gitlab.freedesktop.org/mesa/mesa/-/issues/13908 pointed out
> a possible deadlock:
>
> [ 1231.611031] Possible interrupt unsafe locking scenario:
>
> [ 1231.611033] CPU0 CPU1
> [ 1231.611034] ---- ----
> [ 1231.611035] lock(&xa->xa_lock#17);
> [ 1231.611038] local_irq_disable();
> [ 1231.611039] lock(&fence->lock);
> [ 1231.611041] lock(&xa->xa_lock#17);
> [ 1231.611044] <Interrupt>
> [ 1231.611045] lock(&fence->lock);
> [ 1231.611047]
> *** DEADLOCK ***
>
> My initial fix was to replace xa_erase by xa_erase_irq, but Christian
> pointed out that calling dma_fence_add_callback from a callback can
> also deadlock if the signalling fence and the one passed to
> dma_fence_add_callback share the same lock.
>
> To fix both issues, the code iterating on dependencies and re-arming them
> is moved out to drm_sched_entity_kill_jobs_work.
>
> Suggested-by: Christian König <christian.koenig@amd.com>
> Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
> ---
> drivers/gpu/drm/scheduler/sched_entity.c | 34 +++++++++++++-----------
> 1 file changed, 19 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index c8e949f4a568..fe174a4857be 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -173,26 +173,15 @@ int drm_sched_entity_error(struct drm_sched_entity *entity)
> }
> EXPORT_SYMBOL(drm_sched_entity_error);
>
> +static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
> + struct dma_fence_cb *cb);
> +
> static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk)
> {
> struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
> -
> - drm_sched_fence_scheduled(job->s_fence, NULL);
> - drm_sched_fence_finished(job->s_fence, -ESRCH);
> - WARN_ON(job->s_fence->parent);
> - job->sched->ops->free_job(job);
> -}
> -
> -/* Signal the scheduler finished fence when the entity in question is killed. */
> -static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
> - struct dma_fence_cb *cb)
> -{
> - struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
> - finish_cb);
> + struct dma_fence *f;
> unsigned long index;
>
> - dma_fence_put(f);
> -
> /* Wait for all dependencies to avoid data corruptions */
> xa_for_each(&job->dependencies, index, f) {
> struct drm_sched_fence *s_fence = to_drm_sched_fence(f);
> @@ -220,6 +209,21 @@ static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
> dma_fence_put(f);
> }
>
> + drm_sched_fence_scheduled(job->s_fence, NULL);
> + drm_sched_fence_finished(job->s_fence, -ESRCH);
> + WARN_ON(job->s_fence->parent);
> + job->sched->ops->free_job(job);
> +}
> +
> +/* Signal the scheduler finished fence when the entity in question is killed. */
> +static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
> + struct dma_fence_cb *cb)
> +{
> + struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
> + finish_cb);
> +
> + dma_fence_put(f);
> +
> INIT_WORK(&job->work, drm_sched_entity_kill_jobs_work);
> schedule_work(&job->work);
> }
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v1] drm/sched: fix deadlock in drm_sched_entity_kill_jobs_cb
2025-10-29 9:11 [PATCH v1] drm/sched: fix deadlock in drm_sched_entity_kill_jobs_cb Pierre-Eric Pelloux-Prayer
2025-10-29 13:05 ` Christian König
@ 2025-10-30 11:17 ` Philipp Stanner
2025-10-30 12:06 ` Pierre-Eric Pelloux-Prayer
1 sibling, 1 reply; 5+ messages in thread
From: Philipp Stanner @ 2025-10-30 11:17 UTC (permalink / raw)
To: Pierre-Eric Pelloux-Prayer, Matthew Brost, Danilo Krummrich,
Philipp Stanner, Christian König, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
Sumit Semwal
Cc: Christian König, dri-devel, linux-kernel, linux-media,
linaro-mm-sig
On Wed, 2025-10-29 at 10:11 +0100, Pierre-Eric Pelloux-Prayer wrote:
> https://gitlab.freedesktop.org/mesa/mesa/-/issues/13908 pointed out
This link should be moved to the tag section at the bottom at a Closes:
tag. Optionally a Reported-by:, too.
> a possible deadlock:
>
> [ 1231.611031] Possible interrupt unsafe locking scenario:
>
> [ 1231.611033] CPU0 CPU1
> [ 1231.611034] ---- ----
> [ 1231.611035] lock(&xa->xa_lock#17);
> [ 1231.611038] local_irq_disable();
> [ 1231.611039] lock(&fence->lock);
> [ 1231.611041] lock(&xa->xa_lock#17);
> [ 1231.611044] <Interrupt>
> [ 1231.611045] lock(&fence->lock);
> [ 1231.611047]
> *** DEADLOCK ***
>
The commit message is lacking an explanation as to _how_ and _when_ the
deadlock comes to be. That's a prerequisite for understanding why the
below is the proper fix and solution.
The issue seems to be that you cannot perform certain tasks from within
that work item?
> My initial fix was to replace xa_erase by xa_erase_irq, but Christian
> pointed out that calling dma_fence_add_callback from a callback can
> also deadlock if the signalling fence and the one passed to
> dma_fence_add_callback share the same lock.
>
> To fix both issues, the code iterating on dependencies and re-arming them
> is moved out to drm_sched_entity_kill_jobs_work.
>
> Suggested-by: Christian König <christian.koenig@amd.com>
> Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
> ---
> drivers/gpu/drm/scheduler/sched_entity.c | 34 +++++++++++++-----------
> 1 file changed, 19 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index c8e949f4a568..fe174a4857be 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -173,26 +173,15 @@ int drm_sched_entity_error(struct drm_sched_entity *entity)
> }
> EXPORT_SYMBOL(drm_sched_entity_error);
>
> +static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
> + struct dma_fence_cb *cb);
> +
> static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk)
> {
> struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
> -
> - drm_sched_fence_scheduled(job->s_fence, NULL);
> - drm_sched_fence_finished(job->s_fence, -ESRCH);
> - WARN_ON(job->s_fence->parent);
> - job->sched->ops->free_job(job);
Can free_job() really not be called from within work item context?
P.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v1] drm/sched: fix deadlock in drm_sched_entity_kill_jobs_cb
2025-10-30 11:17 ` Philipp Stanner
@ 2025-10-30 12:06 ` Pierre-Eric Pelloux-Prayer
2025-10-30 12:26 ` Philipp Stanner
0 siblings, 1 reply; 5+ messages in thread
From: Pierre-Eric Pelloux-Prayer @ 2025-10-30 12:06 UTC (permalink / raw)
To: phasta, Pierre-Eric Pelloux-Prayer, Matthew Brost,
Danilo Krummrich, Christian König, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
Sumit Semwal
Cc: Christian König, dri-devel, linux-kernel, linux-media,
linaro-mm-sig
Le 30/10/2025 à 12:17, Philipp Stanner a écrit :
> On Wed, 2025-10-29 at 10:11 +0100, Pierre-Eric Pelloux-Prayer wrote:
>> https://gitlab.freedesktop.org/mesa/mesa/-/issues/13908 pointed out
>
> This link should be moved to the tag section at the bottom at a Closes:
> tag. Optionally a Reported-by:, too.
The bug report is about a different issue. The potential deadlock being fixed by
this patch was discovered while investigating it.
I'll add a Reported-by tag though.
>
>> a possible deadlock:
>>
>> [ 1231.611031] Possible interrupt unsafe locking scenario:
>>
>> [ 1231.611033] CPU0 CPU1
>> [ 1231.611034] ---- ----
>> [ 1231.611035] lock(&xa->xa_lock#17);
>> [ 1231.611038] local_irq_disable();
>> [ 1231.611039] lock(&fence->lock);
>> [ 1231.611041] lock(&xa->xa_lock#17);
>> [ 1231.611044] <Interrupt>
>> [ 1231.611045] lock(&fence->lock);
>> [ 1231.611047]
>> *** DEADLOCK ***
>>
>
> The commit message is lacking an explanation as to _how_ and _when_ the
> deadlock comes to be. That's a prerequisite for understanding why the
> below is the proper fix and solution.
I copy-pasted a small chunk of the full deadlock analysis report included in the
ticket because it's 300+ lines long. Copying the full log isn't useful IMO, but
I can add more context.
The problem is that a thread (CPU0 above) can lock the job's dependencies
xa_array without disabling the interrupts.
If a fence signals while CPU0 holds this lock and drm_sched_entity_kill_jobs_cb
is called, it will try to grab the xa_array lock which is not possible because
CPU0 holds it already.
>
> The issue seems to be that you cannot perform certain tasks from within
> that work item?
>
>> My initial fix was to replace xa_erase by xa_erase_irq, but Christian
>> pointed out that calling dma_fence_add_callback from a callback can
>> also deadlock if the signalling fence and the one passed to
>> dma_fence_add_callback share the same lock.
>>
>> To fix both issues, the code iterating on dependencies and re-arming them
>> is moved out to drm_sched_entity_kill_jobs_work.
>>
>> Suggested-by: Christian König <christian.koenig@amd.com>
>> Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
>> ---
>> drivers/gpu/drm/scheduler/sched_entity.c | 34 +++++++++++++-----------
>> 1 file changed, 19 insertions(+), 15 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>> index c8e949f4a568..fe174a4857be 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -173,26 +173,15 @@ int drm_sched_entity_error(struct drm_sched_entity *entity)
>> }
>> EXPORT_SYMBOL(drm_sched_entity_error);
>>
>> +static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
>> + struct dma_fence_cb *cb);
>> +
>> static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk)
>> {
>> struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
>> -
>> - drm_sched_fence_scheduled(job->s_fence, NULL);
>> - drm_sched_fence_finished(job->s_fence, -ESRCH);
>> - WARN_ON(job->s_fence->parent);
>> - job->sched->ops->free_job(job);
>
> Can free_job() really not be called from within work item context?
It's still called from drm_sched_entity_kill_jobs_work but the diff is slightly
confusing.
Pierre-Eric
>
>
> P.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v1] drm/sched: fix deadlock in drm_sched_entity_kill_jobs_cb
2025-10-30 12:06 ` Pierre-Eric Pelloux-Prayer
@ 2025-10-30 12:26 ` Philipp Stanner
0 siblings, 0 replies; 5+ messages in thread
From: Philipp Stanner @ 2025-10-30 12:26 UTC (permalink / raw)
To: Pierre-Eric Pelloux-Prayer, phasta, Pierre-Eric Pelloux-Prayer,
Matthew Brost, Danilo Krummrich, Christian König,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
Simona Vetter, Sumit Semwal
Cc: Christian König, dri-devel, linux-kernel, linux-media,
linaro-mm-sig
On Thu, 2025-10-30 at 13:06 +0100, Pierre-Eric Pelloux-Prayer wrote:
>
>
> Le 30/10/2025 à 12:17, Philipp Stanner a écrit :
> > On Wed, 2025-10-29 at 10:11 +0100, Pierre-Eric Pelloux-Prayer wrote:
> > > https://gitlab.freedesktop.org/mesa/mesa/-/issues/13908 pointed out
> >
> > This link should be moved to the tag section at the bottom at a Closes:
> > tag. Optionally a Reported-by:, too.
>
> The bug report is about a different issue. The potential deadlock being fixed by
> this patch was discovered while investigating it.
> I'll add a Reported-by tag though.
>
> >
> > > a possible deadlock:
> > >
> > > [ 1231.611031] Possible interrupt unsafe locking scenario:
> > >
> > > [ 1231.611033] CPU0 CPU1
> > > [ 1231.611034] ---- ----
> > > [ 1231.611035] lock(&xa->xa_lock#17);
> > > [ 1231.611038] local_irq_disable();
> > > [ 1231.611039] lock(&fence->lock);
> > > [ 1231.611041] lock(&xa->xa_lock#17);
> > > [ 1231.611044] <Interrupt>
> > > [ 1231.611045] lock(&fence->lock);
> > > [ 1231.611047]
> > > *** DEADLOCK ***
> > >
> >
> > The commit message is lacking an explanation as to _how_ and _when_ the
> > deadlock comes to be. That's a prerequisite for understanding why the
> > below is the proper fix and solution.
>
> I copy-pasted a small chunk of the full deadlock analysis report included in the
> ticket because it's 300+ lines long. Copying the full log isn't useful IMO, but
> I can add more context.
The log wouldn't be useful, but a human-generated explanation as you
detail it below.
>
> The problem is that a thread (CPU0 above) can lock the job's dependencies
> xa_array without disabling the interrupts.
Which drm_sched function would that be?
> If a fence signals while CPU0 holds this lock and drm_sched_entity_kill_jobs_cb
> is called, it will try to grab the xa_array lock which is not possible because
> CPU0 holds it already.
You mean an *interrupt* signals the fence? Shouldn't interrupt issues
be solved with spin_lock_irqdisable() – but we can't have that because
it's the xarray doing that internally?
You don't have to explain that in this mail-thread, a v2 detailing that
would be suficient.
>
>
> >
> > The issue seems to be that you cannot perform certain tasks from within
> > that work item?
[…]
> >
> > > +static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
> > > + struct dma_fence_cb *cb);
> > > +
> > > static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk)
> > > {
> > > struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
> > > -
> > > - drm_sched_fence_scheduled(job->s_fence, NULL);
> > > - drm_sched_fence_finished(job->s_fence, -ESRCH);
> > > - WARN_ON(job->s_fence->parent);
> > > - job->sched->ops->free_job(job);
> >
> > Can free_job() really not be called from within work item context?
>
> It's still called from drm_sched_entity_kill_jobs_work but the diff is slightly
> confusing.
OK, probably my bad. But just asking, do you use
git format-patch --histogram
?
histogram often produces better diffs than default.
P.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-10-30 12:26 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-29 9:11 [PATCH v1] drm/sched: fix deadlock in drm_sched_entity_kill_jobs_cb Pierre-Eric Pelloux-Prayer
2025-10-29 13:05 ` Christian König
2025-10-30 11:17 ` Philipp Stanner
2025-10-30 12:06 ` Pierre-Eric Pelloux-Prayer
2025-10-30 12:26 ` Philipp Stanner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).