* [PATCH v1] drm/sched: Fix racy access to drm_sched_entity.dependency
@ 2025-06-25 14:37 Pierre-Eric Pelloux-Prayer
2025-06-26 8:41 ` Tvrtko Ursulin
0 siblings, 1 reply; 7+ messages in thread
From: Pierre-Eric Pelloux-Prayer @ 2025-06-25 14:37 UTC (permalink / raw)
To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
Christian König, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, David Airlie, Simona Vetter, Tvrtko Ursulin,
Pierre-Eric Pelloux-Prayer
Cc: dri-devel, linux-kernel
The drm_sched_job_unschedulable trace point can access
entity->dependency after it was cleared by the callback
installed in drm_sched_entity_add_dependency_cb, causing:
BUG: kernel NULL pointer dereference, address: 0000000000000020
Workqueue: comp_1.1.0 drm_sched_run_job_work [gpu_sched]
RIP: 0010:trace_event_raw_event_drm_sched_job_unschedulable+0x70/0xd0 [gpu_sched]
To fix this we either need to take a reference of the fence before
setting up the callbacks, or move the trace_drm_sched_job_unschedulable
calls into drm_sched_entity_add_dependency_cb where they can be
done earlier. The former option is the more correct one because with
the latter we might emit the event and still be able to schedule the
job if the fence is signaled in-between. Despite that, I've
implemented the latter, since it's a bit simpler and the extra event
is not a deal breaker for tools anyway.
Fixes: 76d97c870f29 (drm/sched: Trace dependencies for GPU jobs)
Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
---
drivers/gpu/drm/scheduler/sched_entity.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 5635b3a826d8..a23b204cac5c 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -401,7 +401,8 @@ EXPORT_SYMBOL(drm_sched_entity_set_priority);
* Add a callback to the current dependency of the entity to wake up the
* scheduler when the entity becomes available.
*/
-static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
+static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity,
+ struct drm_sched_job *sched_job)
{
struct drm_gpu_scheduler *sched = entity->rq->sched;
struct dma_fence *fence = entity->dependency;
@@ -429,6 +430,11 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
fence = dma_fence_get(&s_fence->scheduled);
dma_fence_put(entity->dependency);
entity->dependency = fence;
+
+ if (trace_drm_sched_job_unschedulable_enabled() &&
+ !dma_fence_is_signaled(fence))
+ trace_drm_sched_job_unschedulable(sched_job, fence);
+
if (!dma_fence_add_callback(fence, &entity->cb,
drm_sched_entity_clear_dep))
return true;
@@ -438,6 +444,10 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
return false;
}
+ if (trace_drm_sched_job_unschedulable_enabled() &&
+ !dma_fence_is_signaled(entity->dependency))
+ trace_drm_sched_job_unschedulable(sched_job, entity->dependency);
+
if (!dma_fence_add_callback(entity->dependency, &entity->cb,
drm_sched_entity_wakeup))
return true;
@@ -478,10 +488,8 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
while ((entity->dependency =
drm_sched_job_dependency(sched_job, entity))) {
- if (drm_sched_entity_add_dependency_cb(entity)) {
- trace_drm_sched_job_unschedulable(sched_job, entity->dependency);
+ if (drm_sched_entity_add_dependency_cb(entity, sched_job))
return NULL;
- }
}
/* skip jobs from entity that marked guilty */
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v1] drm/sched: Fix racy access to drm_sched_entity.dependency
2025-06-25 14:37 [PATCH v1] drm/sched: Fix racy access to drm_sched_entity.dependency Pierre-Eric Pelloux-Prayer
@ 2025-06-26 8:41 ` Tvrtko Ursulin
2025-06-26 13:43 ` Pierre-Eric Pelloux-Prayer
0 siblings, 1 reply; 7+ messages in thread
From: Tvrtko Ursulin @ 2025-06-26 8:41 UTC (permalink / raw)
To: Pierre-Eric Pelloux-Prayer, Matthew Brost, Danilo Krummrich,
Philipp Stanner, Christian König, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter
Cc: dri-devel, linux-kernel
On 25/06/2025 15:37, Pierre-Eric Pelloux-Prayer wrote:
> The drm_sched_job_unschedulable trace point can access
> entity->dependency after it was cleared by the callback
> installed in drm_sched_entity_add_dependency_cb, causing:
>
> BUG: kernel NULL pointer dereference, address: 0000000000000020
> Workqueue: comp_1.1.0 drm_sched_run_job_work [gpu_sched]
> RIP: 0010:trace_event_raw_event_drm_sched_job_unschedulable+0x70/0xd0 [gpu_sched]
>
> To fix this we either need to take a reference of the fence before
> setting up the callbacks, or move the trace_drm_sched_job_unschedulable
> calls into drm_sched_entity_add_dependency_cb where they can be
> done earlier. The former option is the more correct one because with
> the latter we might emit the event and still be able to schedule the
> job if the fence is signaled in-between. Despite that, I've
> implemented the latter, since it's a bit simpler and the extra event
> is not a deal breaker for tools anyway.
>
> Fixes: 76d97c870f29 (drm/sched: Trace dependencies for GPU jobs)
>
> Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
> ---
> drivers/gpu/drm/scheduler/sched_entity.c | 16 ++++++++++++----
> 1 file changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index 5635b3a826d8..a23b204cac5c 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -401,7 +401,8 @@ EXPORT_SYMBOL(drm_sched_entity_set_priority);
> * Add a callback to the current dependency of the entity to wake up the
> * scheduler when the entity becomes available.
> */
> -static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
> +static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity,
> + struct drm_sched_job *sched_job)
> {
> struct drm_gpu_scheduler *sched = entity->rq->sched;
> struct dma_fence *fence = entity->dependency;
> @@ -429,6 +430,11 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
> fence = dma_fence_get(&s_fence->scheduled);
> dma_fence_put(entity->dependency);
> entity->dependency = fence;
> +
> + if (trace_drm_sched_job_unschedulable_enabled() &&
> + !dma_fence_is_signaled(fence))
> + trace_drm_sched_job_unschedulable(sched_job, fence);
> +
> if (!dma_fence_add_callback(fence, &entity->cb,
> drm_sched_entity_clear_dep))
Would placing it here be simpler, like this:
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -430,8 +430,10 @@ static bool
drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
dma_fence_put(entity->dependency);
entity->dependency = fence;
if (!dma_fence_add_callback(fence, &entity->cb,
- drm_sched_entity_clear_dep))
+ drm_sched_entity_clear_dep)) {
+ trace_drm_sched_job_unschedulable(sched_job, fence);
return true;
+ }
/* Ignore it when it is already scheduled */
dma_fence_put(fence);
Both sched_job and fence references are held on this path (effectively
for sched_job), so it is safe from that point of view. Makes the "is
signaled" check not needed because dma_fence_add_callback() told us so.
And trace_drm_sched_job_unschedulable_enabled() is then not needed too.
> return true;
> @@ -438,6 +444,10 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
> return false;
> }
>
> + if (trace_drm_sched_job_unschedulable_enabled() &&
> + !dma_fence_is_signaled(entity->dependency))
> + trace_drm_sched_job_unschedulable(sched_job, entity->dependency);
> +
> if (!dma_fence_add_callback(entity->dependency, &entity->cb,
> drm_sched_entity_wakeup))
For this path the same thing, since entity->dependency reference is also
held (taken in drm_sched_job_dependency() and released two lines below
this hunk).
Regards,
Tvrtko
> return true;
> @@ -478,10 +488,8 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
>
> while ((entity->dependency =
> drm_sched_job_dependency(sched_job, entity))) {
> - if (drm_sched_entity_add_dependency_cb(entity)) {
> - trace_drm_sched_job_unschedulable(sched_job, entity->dependency);
> + if (drm_sched_entity_add_dependency_cb(entity, sched_job))
> return NULL;
> - }
> }
>
> /* skip jobs from entity that marked guilty */
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v1] drm/sched: Fix racy access to drm_sched_entity.dependency
2025-06-26 8:41 ` Tvrtko Ursulin
@ 2025-06-26 13:43 ` Pierre-Eric Pelloux-Prayer
2025-06-26 14:05 ` Tvrtko Ursulin
0 siblings, 1 reply; 7+ messages in thread
From: Pierre-Eric Pelloux-Prayer @ 2025-06-26 13:43 UTC (permalink / raw)
To: Tvrtko Ursulin, Pierre-Eric Pelloux-Prayer, Matthew Brost,
Danilo Krummrich, Philipp Stanner, Christian König,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
Simona Vetter
Cc: dri-devel
Hi,
Le 26/06/2025 à 10:41, Tvrtko Ursulin a écrit :
>
> On 25/06/2025 15:37, Pierre-Eric Pelloux-Prayer wrote:
>> The drm_sched_job_unschedulable trace point can access
>> entity->dependency after it was cleared by the callback
>> installed in drm_sched_entity_add_dependency_cb, causing:
>>
>> BUG: kernel NULL pointer dereference, address: 0000000000000020
>> Workqueue: comp_1.1.0 drm_sched_run_job_work [gpu_sched]
>> RIP: 0010:trace_event_raw_event_drm_sched_job_unschedulable+0x70/0xd0 [gpu_sched]
>>
>> To fix this we either need to take a reference of the fence before
>> setting up the callbacks, or move the trace_drm_sched_job_unschedulable
>> calls into drm_sched_entity_add_dependency_cb where they can be
>> done earlier. The former option is the more correct one because with
>> the latter we might emit the event and still be able to schedule the
>> job if the fence is signaled in-between. Despite that, I've
>> implemented the latter, since it's a bit simpler and the extra event
>> is not a deal breaker for tools anyway.
>>
>> Fixes: 76d97c870f29 (drm/sched: Trace dependencies for GPU jobs)
>>
>> Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
>> ---
>> drivers/gpu/drm/scheduler/sched_entity.c | 16 ++++++++++++----
>> 1 file changed, 12 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>> index 5635b3a826d8..a23b204cac5c 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -401,7 +401,8 @@ EXPORT_SYMBOL(drm_sched_entity_set_priority);
>> * Add a callback to the current dependency of the entity to wake up the
>> * scheduler when the entity becomes available.
>> */
>> -static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>> +static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity,
>> + struct drm_sched_job *sched_job)
>> {
>> struct drm_gpu_scheduler *sched = entity->rq->sched;
>> struct dma_fence *fence = entity->dependency;
>> @@ -429,6 +430,11 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>> fence = dma_fence_get(&s_fence->scheduled);
>> dma_fence_put(entity->dependency);
>> entity->dependency = fence;
>> +
>> + if (trace_drm_sched_job_unschedulable_enabled() &&
>> + !dma_fence_is_signaled(fence))
>> + trace_drm_sched_job_unschedulable(sched_job, fence);
>> +
>> if (!dma_fence_add_callback(fence, &entity->cb,
>> drm_sched_entity_clear_dep))
>
> Would placing it here be simpler, like this:
>
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -430,8 +430,10 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
> dma_fence_put(entity->dependency);
> entity->dependency = fence;
> if (!dma_fence_add_callback(fence, &entity->cb,
> - drm_sched_entity_clear_dep))
> + drm_sched_entity_clear_dep)) {
At this point, the callback has been installed which means it can be called.
And when it's called it will drop the reference we had of entity->dependency / fence.
So unless I'm mistaken, the trace event below might still dereference freed memory.
Pierre-Eric
> + trace_drm_sched_job_unschedulable(sched_job, fence);
> return true;
> + }
>
> /* Ignore it when it is already scheduled */
> dma_fence_put(fence);
>
> Both sched_job and fence references are held on this path (effectively for sched_job), so it is safe
> from that point of view. Makes the "is signaled" check not needed because dma_fence_add_callback()
> told us so. And trace_drm_sched_job_unschedulable_enabled() is then not needed too.
>
>> return true;
>> @@ -438,6 +444,10 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>> return false;
>> }
>> + if (trace_drm_sched_job_unschedulable_enabled() &&
>> + !dma_fence_is_signaled(entity->dependency))
>> + trace_drm_sched_job_unschedulable(sched_job, entity->dependency);
>> +
>> if (!dma_fence_add_callback(entity->dependency, &entity->cb,
>> drm_sched_entity_wakeup))
>
> For this path the same thing, since entity->dependency reference is also held (taken in
> drm_sched_job_dependency() and released two lines below this hunk).
>
> Regards,
>
> Tvrtko
>
>> return true;
>> @@ -478,10 +488,8 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
>> while ((entity->dependency =
>> drm_sched_job_dependency(sched_job, entity))) {
>> - if (drm_sched_entity_add_dependency_cb(entity)) {
>> - trace_drm_sched_job_unschedulable(sched_job, entity->dependency);
>> + if (drm_sched_entity_add_dependency_cb(entity, sched_job))
>> return NULL;
>> - }
>> }
>> /* skip jobs from entity that marked guilty */
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v1] drm/sched: Fix racy access to drm_sched_entity.dependency
2025-06-26 13:43 ` Pierre-Eric Pelloux-Prayer
@ 2025-06-26 14:05 ` Tvrtko Ursulin
2025-07-21 15:18 ` Pierre-Eric Pelloux-Prayer
0 siblings, 1 reply; 7+ messages in thread
From: Tvrtko Ursulin @ 2025-06-26 14:05 UTC (permalink / raw)
To: Pierre-Eric Pelloux-Prayer, Pierre-Eric Pelloux-Prayer,
Matthew Brost, Danilo Krummrich, Philipp Stanner,
Christian König, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, David Airlie, Simona Vetter
Cc: dri-devel
On 26/06/2025 14:43, Pierre-Eric Pelloux-Prayer wrote:
> Hi,
>
> Le 26/06/2025 à 10:41, Tvrtko Ursulin a écrit :
>>
>> On 25/06/2025 15:37, Pierre-Eric Pelloux-Prayer wrote:
>>> The drm_sched_job_unschedulable trace point can access
>>> entity->dependency after it was cleared by the callback
>>> installed in drm_sched_entity_add_dependency_cb, causing:
>>>
>>> BUG: kernel NULL pointer dereference, address: 0000000000000020
>>> Workqueue: comp_1.1.0 drm_sched_run_job_work [gpu_sched]
>>> RIP: 0010:trace_event_raw_event_drm_sched_job_unschedulable+0x70/0xd0
>>> [gpu_sched]
>>>
>>> To fix this we either need to take a reference of the fence before
>>> setting up the callbacks, or move the trace_drm_sched_job_unschedulable
>>> calls into drm_sched_entity_add_dependency_cb where they can be
>>> done earlier. The former option is the more correct one because with
>>> the latter we might emit the event and still be able to schedule the
>>> job if the fence is signaled in-between. Despite that, I've
>>> implemented the latter, since it's a bit simpler and the extra event
>>> is not a deal breaker for tools anyway.
>>>
>>> Fixes: 76d97c870f29 (drm/sched: Trace dependencies for GPU jobs)
>>>
>>> Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-
>>> prayer@amd.com>
>>> ---
>>> drivers/gpu/drm/scheduler/sched_entity.c | 16 ++++++++++++----
>>> 1 file changed, 12 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/
>>> drm/scheduler/sched_entity.c
>>> index 5635b3a826d8..a23b204cac5c 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>> @@ -401,7 +401,8 @@ EXPORT_SYMBOL(drm_sched_entity_set_priority);
>>> * Add a callback to the current dependency of the entity to wake
>>> up the
>>> * scheduler when the entity becomes available.
>>> */
>>> -static bool drm_sched_entity_add_dependency_cb(struct
>>> drm_sched_entity *entity)
>>> +static bool drm_sched_entity_add_dependency_cb(struct
>>> drm_sched_entity *entity,
>>> + struct drm_sched_job *sched_job)
>>> {
>>> struct drm_gpu_scheduler *sched = entity->rq->sched;
>>> struct dma_fence *fence = entity->dependency;
>>> @@ -429,6 +430,11 @@ static bool
>>> drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>> fence = dma_fence_get(&s_fence->scheduled);
>>> dma_fence_put(entity->dependency);
>>> entity->dependency = fence;
>>> +
>>> + if (trace_drm_sched_job_unschedulable_enabled() &&
>>> + !dma_fence_is_signaled(fence))
>>> + trace_drm_sched_job_unschedulable(sched_job, fence);
>>> +
>>> if (!dma_fence_add_callback(fence, &entity->cb,
>>> drm_sched_entity_clear_dep))
>>
>> Would placing it here be simpler, like this:
>>
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -430,8 +430,10 @@ static bool
>> drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>> dma_fence_put(entity->dependency);
>> entity->dependency = fence;
>> if (!dma_fence_add_callback(fence, &entity->cb,
>> - drm_sched_entity_clear_dep))
>> +
>> drm_sched_entity_clear_dep)) {
>
> At this point, the callback has been installed which means it can be
> called.
> And when it's called it will drop the reference we had of entity-
> >dependency / fence.
>
> So unless I'm mistaken, the trace event below might still dereference
> freed memory.
Bah sorry, I got confused.
You just need to fix the formatting of the fixes tag:
76d97c870f29 ("drm/sched: Trace dependencies for GPU jobs")
With that:
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Regards,
Tvrtko
>
> Pierre-Eric
>
>
>> + trace_drm_sched_job_unschedulable(sched_job,
>> fence);
>> return true;
>> + }
>>
>> /* Ignore it when it is already scheduled */
>> dma_fence_put(fence);
>>
>> Both sched_job and fence references are held on this path (effectively
>> for sched_job), so it is safe from that point of view. Makes the "is
>> signaled" check not needed because dma_fence_add_callback() told us
>> so. And trace_drm_sched_job_unschedulable_enabled() is then not needed
>> too.
>>
>>> return true;
>>> @@ -438,6 +444,10 @@ static bool
>>> drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>> return false;
>>> }
>>> + if (trace_drm_sched_job_unschedulable_enabled() &&
>>> + !dma_fence_is_signaled(entity->dependency))
>>> + trace_drm_sched_job_unschedulable(sched_job, entity-
>>> >dependency);
>>> +
>>> if (!dma_fence_add_callback(entity->dependency, &entity->cb,
>>> drm_sched_entity_wakeup))
>>
>> For this path the same thing, since entity->dependency reference is
>> also held (taken in drm_sched_job_dependency() and released two lines
>> below this hunk).
>>
>> Regards,
>>
>> Tvrtko
>>
>>> return true;
>>> @@ -478,10 +488,8 @@ struct drm_sched_job
>>> *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
>>> while ((entity->dependency =
>>> drm_sched_job_dependency(sched_job, entity))) {
>>> - if (drm_sched_entity_add_dependency_cb(entity)) {
>>> - trace_drm_sched_job_unschedulable(sched_job, entity-
>>> >dependency);
>>> + if (drm_sched_entity_add_dependency_cb(entity, sched_job))
>>> return NULL;
>>> - }
>>> }
>>> /* skip jobs from entity that marked guilty */
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v1] drm/sched: Fix racy access to drm_sched_entity.dependency
2025-06-26 14:05 ` Tvrtko Ursulin
@ 2025-07-21 15:18 ` Pierre-Eric Pelloux-Prayer
2025-08-20 9:06 ` Pierre-Eric Pelloux-Prayer
0 siblings, 1 reply; 7+ messages in thread
From: Pierre-Eric Pelloux-Prayer @ 2025-07-21 15:18 UTC (permalink / raw)
To: Tvrtko Ursulin, Pierre-Eric Pelloux-Prayer, Matthew Brost,
Danilo Krummrich, Philipp Stanner, Christian König,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
Simona Vetter
Cc: dri-devel
Le 26/06/2025 à 16:05, Tvrtko Ursulin a écrit :
>
> On 26/06/2025 14:43, Pierre-Eric Pelloux-Prayer wrote:
>> Hi,
>>
>> Le 26/06/2025 à 10:41, Tvrtko Ursulin a écrit :
>>>
>>> On 25/06/2025 15:37, Pierre-Eric Pelloux-Prayer wrote:
>>>> The drm_sched_job_unschedulable trace point can access
>>>> entity->dependency after it was cleared by the callback
>>>> installed in drm_sched_entity_add_dependency_cb, causing:
>>>>
>>>> BUG: kernel NULL pointer dereference, address: 0000000000000020
>>>> Workqueue: comp_1.1.0 drm_sched_run_job_work [gpu_sched]
>>>> RIP: 0010:trace_event_raw_event_drm_sched_job_unschedulable+0x70/0xd0 [gpu_sched]
>>>>
>>>> To fix this we either need to take a reference of the fence before
>>>> setting up the callbacks, or move the trace_drm_sched_job_unschedulable
>>>> calls into drm_sched_entity_add_dependency_cb where they can be
>>>> done earlier. The former option is the more correct one because with
>>>> the latter we might emit the event and still be able to schedule the
>>>> job if the fence is signaled in-between. Despite that, I've
>>>> implemented the latter, since it's a bit simpler and the extra event
>>>> is not a deal breaker for tools anyway.
>>>>
>>>> Fixes: 76d97c870f29 (drm/sched: Trace dependencies for GPU jobs)
>>>>
>>>> Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux- prayer@amd.com>
>>>> ---
>>>> drivers/gpu/drm/scheduler/sched_entity.c | 16 ++++++++++++----
>>>> 1 file changed, 12 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/ drm/scheduler/sched_entity.c
>>>> index 5635b3a826d8..a23b204cac5c 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>> @@ -401,7 +401,8 @@ EXPORT_SYMBOL(drm_sched_entity_set_priority);
>>>> * Add a callback to the current dependency of the entity to wake up the
>>>> * scheduler when the entity becomes available.
>>>> */
>>>> -static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>>> +static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity,
>>>> + struct drm_sched_job *sched_job)
>>>> {
>>>> struct drm_gpu_scheduler *sched = entity->rq->sched;
>>>> struct dma_fence *fence = entity->dependency;
>>>> @@ -429,6 +430,11 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity
>>>> *entity)
>>>> fence = dma_fence_get(&s_fence->scheduled);
>>>> dma_fence_put(entity->dependency);
>>>> entity->dependency = fence;
>>>> +
>>>> + if (trace_drm_sched_job_unschedulable_enabled() &&
>>>> + !dma_fence_is_signaled(fence))
>>>> + trace_drm_sched_job_unschedulable(sched_job, fence);
>>>> +
>>>> if (!dma_fence_add_callback(fence, &entity->cb,
>>>> drm_sched_entity_clear_dep))
>>>
>>> Would placing it here be simpler, like this:
>>>
>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>> @@ -430,8 +430,10 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>> dma_fence_put(entity->dependency);
>>> entity->dependency = fence;
>>> if (!dma_fence_add_callback(fence, &entity->cb,
>>> - drm_sched_entity_clear_dep))
>>> + drm_sched_entity_clear_dep)) {
>>
>> At this point, the callback has been installed which means it can be called.
>> And when it's called it will drop the reference we had of entity- >dependency / fence.
>>
>> So unless I'm mistaken, the trace event below might still dereference freed memory.
>
> Bah sorry, I got confused.
>
> You just need to fix the formatting of the fixes tag:
>
> 76d97c870f29 ("drm/sched: Trace dependencies for GPU jobs")
Do I need to send a v2 just to amend the commit message or can it be fixed up by the maintainer
pushing the patch to the right git tree?
Pierre-Eric
>
> With that:
>
> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>
> Regards,
>
> Tvrtko
>
>>
>> Pierre-Eric
>>
>>
>>> + trace_drm_sched_job_unschedulable(sched_job, fence);
>>> return true;
>>> + }
>>>
>>> /* Ignore it when it is already scheduled */
>>> dma_fence_put(fence);
>>>
>>> Both sched_job and fence references are held on this path (effectively for sched_job), so it is
>>> safe from that point of view. Makes the "is signaled" check not needed because
>>> dma_fence_add_callback() told us so. And trace_drm_sched_job_unschedulable_enabled() is then not
>>> needed too.
>>>
>>>> return true;
>>>> @@ -438,6 +444,10 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity
>>>> *entity)
>>>> return false;
>>>> }
>>>> + if (trace_drm_sched_job_unschedulable_enabled() &&
>>>> + !dma_fence_is_signaled(entity->dependency))
>>>> + trace_drm_sched_job_unschedulable(sched_job, entity- >dependency);
>>>> +
>>>> if (!dma_fence_add_callback(entity->dependency, &entity->cb,
>>>> drm_sched_entity_wakeup))
>>>
>>> For this path the same thing, since entity->dependency reference is also held (taken in
>>> drm_sched_job_dependency() and released two lines below this hunk).
>>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>>> return true;
>>>> @@ -478,10 +488,8 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity
>>>> *entity)
>>>> while ((entity->dependency =
>>>> drm_sched_job_dependency(sched_job, entity))) {
>>>> - if (drm_sched_entity_add_dependency_cb(entity)) {
>>>> - trace_drm_sched_job_unschedulable(sched_job, entity- >dependency);
>>>> + if (drm_sched_entity_add_dependency_cb(entity, sched_job))
>>>> return NULL;
>>>> - }
>>>> }
>>>> /* skip jobs from entity that marked guilty */
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v1] drm/sched: Fix racy access to drm_sched_entity.dependency
2025-07-21 15:18 ` Pierre-Eric Pelloux-Prayer
@ 2025-08-20 9:06 ` Pierre-Eric Pelloux-Prayer
2025-08-25 9:59 ` Philipp Stanner
0 siblings, 1 reply; 7+ messages in thread
From: Pierre-Eric Pelloux-Prayer @ 2025-08-20 9:06 UTC (permalink / raw)
To: Tvrtko Ursulin, Pierre-Eric Pelloux-Prayer, Matthew Brost,
Danilo Krummrich, Philipp Stanner, Christian König,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
Simona Vetter
Cc: dri-devel
Le 21/07/2025 à 17:18, Pierre-Eric Pelloux-Prayer a écrit :
>
>
> Le 26/06/2025 à 16:05, Tvrtko Ursulin a écrit :
>>
>> On 26/06/2025 14:43, Pierre-Eric Pelloux-Prayer wrote:
>>> Hi,
>>>
>>> Le 26/06/2025 à 10:41, Tvrtko Ursulin a écrit :
>>>>
>>>> On 25/06/2025 15:37, Pierre-Eric Pelloux-Prayer wrote:
>>>>> The drm_sched_job_unschedulable trace point can access
>>>>> entity->dependency after it was cleared by the callback
>>>>> installed in drm_sched_entity_add_dependency_cb, causing:
>>>>>
>>>>> BUG: kernel NULL pointer dereference, address: 0000000000000020
>>>>> Workqueue: comp_1.1.0 drm_sched_run_job_work [gpu_sched]
>>>>> RIP: 0010:trace_event_raw_event_drm_sched_job_unschedulable+0x70/0xd0 [gpu_sched]
>>>>>
>>>>> To fix this we either need to take a reference of the fence before
>>>>> setting up the callbacks, or move the trace_drm_sched_job_unschedulable
>>>>> calls into drm_sched_entity_add_dependency_cb where they can be
>>>>> done earlier. The former option is the more correct one because with
>>>>> the latter we might emit the event and still be able to schedule the
>>>>> job if the fence is signaled in-between. Despite that, I've
>>>>> implemented the latter, since it's a bit simpler and the extra event
>>>>> is not a deal breaker for tools anyway.
>>>>>
>>>>> Fixes: 76d97c870f29 (drm/sched: Trace dependencies for GPU jobs)
>>>>>
>>>>> Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux- prayer@amd.com>
>>>>> ---
>>>>> drivers/gpu/drm/scheduler/sched_entity.c | 16 ++++++++++++----
>>>>> 1 file changed, 12 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/ drm/scheduler/sched_entity.c
>>>>> index 5635b3a826d8..a23b204cac5c 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> @@ -401,7 +401,8 @@ EXPORT_SYMBOL(drm_sched_entity_set_priority);
>>>>> * Add a callback to the current dependency of the entity to wake up the
>>>>> * scheduler when the entity becomes available.
>>>>> */
>>>>> -static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>>>> +static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity,
>>>>> + struct drm_sched_job *sched_job)
>>>>> {
>>>>> struct drm_gpu_scheduler *sched = entity->rq->sched;
>>>>> struct dma_fence *fence = entity->dependency;
>>>>> @@ -429,6 +430,11 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity
>>>>> *entity)
>>>>> fence = dma_fence_get(&s_fence->scheduled);
>>>>> dma_fence_put(entity->dependency);
>>>>> entity->dependency = fence;
>>>>> +
>>>>> + if (trace_drm_sched_job_unschedulable_enabled() &&
>>>>> + !dma_fence_is_signaled(fence))
>>>>> + trace_drm_sched_job_unschedulable(sched_job, fence);
>>>>> +
>>>>> if (!dma_fence_add_callback(fence, &entity->cb,
>>>>> drm_sched_entity_clear_dep))
>>>>
>>>> Would placing it here be simpler, like this:
>>>>
>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>> @@ -430,8 +430,10 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity
>>>> *entity)
>>>> dma_fence_put(entity->dependency);
>>>> entity->dependency = fence;
>>>> if (!dma_fence_add_callback(fence, &entity->cb,
>>>> - drm_sched_entity_clear_dep))
>>>> + drm_sched_entity_clear_dep)) {
>>>
>>> At this point, the callback has been installed which means it can be called.
>>> And when it's called it will drop the reference we had of entity- >dependency / fence.
>>>
>>> So unless I'm mistaken, the trace event below might still dereference freed memory.
>>
>> Bah sorry, I got confused.
>>
>> You just need to fix the formatting of the fixes tag:
>>
>> 76d97c870f29 ("drm/sched: Trace dependencies for GPU jobs")
>
> Do I need to send a v2 just to amend the commit message or can it be fixed up by the maintainer
> pushing the patch to the right git tree?
>
> Pierre-Eric
ping?
Should this go through drm-misc-fixes as it's a fix for a commit that landed during this release cycle?
Thanks,
Pierre-Eric
>
>>
>> With that:
>>
>> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>
>
>
>>
>> Regards,
>>
>> Tvrtko
>>
>>>
>>> Pierre-Eric
>>>
>>>
>>>> + trace_drm_sched_job_unschedulable(sched_job, fence);
>>>> return true;
>>>> + }
>>>>
>>>> /* Ignore it when it is already scheduled */
>>>> dma_fence_put(fence);
>>>>
>>>> Both sched_job and fence references are held on this path (effectively for sched_job), so it is
>>>> safe from that point of view. Makes the "is signaled" check not needed because
>>>> dma_fence_add_callback() told us so. And trace_drm_sched_job_unschedulable_enabled() is then not
>>>> needed too.
>>>>
>>>>> return true;
>>>>> @@ -438,6 +444,10 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity
>>>>> *entity)
>>>>> return false;
>>>>> }
>>>>> + if (trace_drm_sched_job_unschedulable_enabled() &&
>>>>> + !dma_fence_is_signaled(entity->dependency))
>>>>> + trace_drm_sched_job_unschedulable(sched_job, entity- >dependency);
>>>>> +
>>>>> if (!dma_fence_add_callback(entity->dependency, &entity->cb,
>>>>> drm_sched_entity_wakeup))
>>>>
>>>> For this path the same thing, since entity->dependency reference is also held (taken in
>>>> drm_sched_job_dependency() and released two lines below this hunk).
>>>>
>>>> Regards,
>>>>
>>>> Tvrtko
>>>>
>>>>> return true;
>>>>> @@ -478,10 +488,8 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity
>>>>> *entity)
>>>>> while ((entity->dependency =
>>>>> drm_sched_job_dependency(sched_job, entity))) {
>>>>> - if (drm_sched_entity_add_dependency_cb(entity)) {
>>>>> - trace_drm_sched_job_unschedulable(sched_job, entity- >dependency);
>>>>> + if (drm_sched_entity_add_dependency_cb(entity, sched_job))
>>>>> return NULL;
>>>>> - }
>>>>> }
>>>>> /* skip jobs from entity that marked guilty */
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v1] drm/sched: Fix racy access to drm_sched_entity.dependency
2025-08-20 9:06 ` Pierre-Eric Pelloux-Prayer
@ 2025-08-25 9:59 ` Philipp Stanner
0 siblings, 0 replies; 7+ messages in thread
From: Philipp Stanner @ 2025-08-25 9:59 UTC (permalink / raw)
To: Pierre-Eric Pelloux-Prayer, Tvrtko Ursulin,
Pierre-Eric Pelloux-Prayer, Matthew Brost, Danilo Krummrich,
Philipp Stanner, Christian König, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter
Cc: dri-devel
On Wed, 2025-08-20 at 11:06 +0200, Pierre-Eric Pelloux-Prayer wrote:
>
>
> Le 21/07/2025 à 17:18, Pierre-Eric Pelloux-Prayer a écrit :
> >
> >
> > Le 26/06/2025 à 16:05, Tvrtko Ursulin a écrit :
> > >
> > > On 26/06/2025 14:43, Pierre-Eric Pelloux-Prayer wrote:
> > > > Hi,
> > > >
> > > > Le 26/06/2025 à 10:41, Tvrtko Ursulin a écrit :
> > > > >
> > > > > On 25/06/2025 15:37, Pierre-Eric Pelloux-Prayer wrote:
> > > > > > The drm_sched_job_unschedulable trace point can access
> > > > > > entity->dependency after it was cleared by the callback
> > > > > > installed in drm_sched_entity_add_dependency_cb, causing:
> > > > > >
> > > > > > BUG: kernel NULL pointer dereference, address: 0000000000000020
> > > > > > Workqueue: comp_1.1.0 drm_sched_run_job_work [gpu_sched]
> > > > > > RIP: 0010:trace_event_raw_event_drm_sched_job_unschedulable+0x70/0xd0 [gpu_sched]
> > > > > >
> > > > > > To fix this we either need to take a reference of the fence before
> > > > > > setting up the callbacks, or move the trace_drm_sched_job_unschedulable
> > > > > > calls into drm_sched_entity_add_dependency_cb where they can be
> > > > > > done earlier. The former option is the more correct one because with
> > > > > > the latter we might emit the event and still be able to schedule the
> > > > > > job if the fence is signaled in-between. Despite that, I've
> > > > > > implemented the latter, since it's a bit simpler and the extra event
> > > > > > is not a deal breaker for tools anyway.
> > > > > >
> > > > > > Fixes: 76d97c870f29 (drm/sched: Trace dependencies for GPU jobs)
> > > > > >
> > > > > > Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux- prayer@amd.com>
> > > > > > ---
> > > > > > drivers/gpu/drm/scheduler/sched_entity.c | 16 ++++++++++++----
> > > > > > 1 file changed, 12 insertions(+), 4 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/ drm/scheduler/sched_entity.c
> > > > > > index 5635b3a826d8..a23b204cac5c 100644
> > > > > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > @@ -401,7 +401,8 @@ EXPORT_SYMBOL(drm_sched_entity_set_priority);
> > > > > > * Add a callback to the current dependency of the entity to wake up the
> > > > > > * scheduler when the entity becomes available.
> > > > > > */
> > > > > > -static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
> > > > > > +static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity,
> > > > > > + struct drm_sched_job *sched_job)
> > > > > > {
> > > > > > struct drm_gpu_scheduler *sched = entity->rq->sched;
> > > > > > struct dma_fence *fence = entity->dependency;
> > > > > > @@ -429,6 +430,11 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity
> > > > > > *entity)
> > > > > > fence = dma_fence_get(&s_fence->scheduled);
> > > > > > dma_fence_put(entity->dependency);
> > > > > > entity->dependency = fence;
> > > > > > +
> > > > > > + if (trace_drm_sched_job_unschedulable_enabled() &&
> > > > > > + !dma_fence_is_signaled(fence))
> > > > > > + trace_drm_sched_job_unschedulable(sched_job, fence);
> > > > > > +
> > > > > > if (!dma_fence_add_callback(fence, &entity->cb,
> > > > > > drm_sched_entity_clear_dep))
> > > > >
> > > > > Would placing it here be simpler, like this:
> > > > >
> > > > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > @@ -430,8 +430,10 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity
> > > > > *entity)
> > > > > dma_fence_put(entity->dependency);
> > > > > entity->dependency = fence;
> > > > > if (!dma_fence_add_callback(fence, &entity->cb,
> > > > > - drm_sched_entity_clear_dep))
> > > > > + drm_sched_entity_clear_dep)) {
> > > >
> > > > At this point, the callback has been installed which means it can be called.
> > > > And when it's called it will drop the reference we had of entity- >dependency / fence.
> > > >
> > > > So unless I'm mistaken, the trace event below might still dereference freed memory.
> > >
> > > Bah sorry, I got confused.
> > >
> > > You just need to fix the formatting of the fixes tag:
> > >
> > > 76d97c870f29 ("drm/sched: Trace dependencies for GPU jobs")
> >
> > Do I need to send a v2 just to amend the commit message or can it be fixed up by the maintainer
> > pushing the patch to the right git tree?
> >
> > Pierre-Eric
>
> ping?
>
> Should this go through drm-misc-fixes as it's a fix for a commit that landed during this release cycle?
Sry, I was afk a while.
This problem is now already in 6.17-rc2, so the fix must go to drm-
misc-fixes. There is now a merge conflict, I think caused by
15f77764e90a713ee3916ca424757688e4f565b9
Can you rebase on drm-misc-fixes (master should also work) and submit a
v2 with Tvrtko's suggestion for the Fixes: tag?
Thx
P.
>
> Thanks,
> Pierre-Eric
>
>
> >
> > >
> > > With that:
> > >
> > > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> >
> >
> >
> > >
> > > Regards,
> > >
> > > Tvrtko
> > >
> > > >
> > > > Pierre-Eric
> > > >
> > > >
> > > > > + trace_drm_sched_job_unschedulable(sched_job, fence);
> > > > > return true;
> > > > > + }
> > > > >
> > > > > /* Ignore it when it is already scheduled */
> > > > > dma_fence_put(fence);
> > > > >
> > > > > Both sched_job and fence references are held on this path (effectively for sched_job), so it is
> > > > > safe from that point of view. Makes the "is signaled" check not needed because
> > > > > dma_fence_add_callback() told us so. And trace_drm_sched_job_unschedulable_enabled() is then not
> > > > > needed too.
> > > > >
> > > > > > return true;
> > > > > > @@ -438,6 +444,10 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity
> > > > > > *entity)
> > > > > > return false;
> > > > > > }
> > > > > > + if (trace_drm_sched_job_unschedulable_enabled() &&
> > > > > > + !dma_fence_is_signaled(entity->dependency))
> > > > > > + trace_drm_sched_job_unschedulable(sched_job, entity- >dependency);
> > > > > > +
> > > > > > if (!dma_fence_add_callback(entity->dependency, &entity->cb,
> > > > > > drm_sched_entity_wakeup))
> > > > >
> > > > > For this path the same thing, since entity->dependency reference is also held (taken in
> > > > > drm_sched_job_dependency() and released two lines below this hunk).
> > > > >
> > > > > Regards,
> > > > >
> > > > > Tvrtko
> > > > >
> > > > > > return true;
> > > > > > @@ -478,10 +488,8 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity
> > > > > > *entity)
> > > > > > while ((entity->dependency =
> > > > > > drm_sched_job_dependency(sched_job, entity))) {
> > > > > > - if (drm_sched_entity_add_dependency_cb(entity)) {
> > > > > > - trace_drm_sched_job_unschedulable(sched_job, entity- >dependency);
> > > > > > + if (drm_sched_entity_add_dependency_cb(entity, sched_job))
> > > > > > return NULL;
> > > > > > - }
> > > > > > }
> > > > > > /* skip jobs from entity that marked guilty */
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-08-25 9:59 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-25 14:37 [PATCH v1] drm/sched: Fix racy access to drm_sched_entity.dependency Pierre-Eric Pelloux-Prayer
2025-06-26 8:41 ` Tvrtko Ursulin
2025-06-26 13:43 ` Pierre-Eric Pelloux-Prayer
2025-06-26 14:05 ` Tvrtko Ursulin
2025-07-21 15:18 ` Pierre-Eric Pelloux-Prayer
2025-08-20 9:06 ` Pierre-Eric Pelloux-Prayer
2025-08-25 9:59 ` Philipp Stanner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).