* [PATCH 6.12.y 1/3] drm/sched: Optimise drm_sched_entity_push_job
@ 2025-09-22 13:09 Jules Maselbas
2025-09-22 15:30 ` Philipp Stanner
0 siblings, 1 reply; 14+ messages in thread
From: Jules Maselbas @ 2025-09-22 13:09 UTC (permalink / raw)
To: stable
Cc: gregkh, Tvrtko Ursulin, Christian König, Alex Deucher,
Luben Tuikov, Matthew Brost, Philipp Stanner
From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
commit d42a254633c773921884a19e8a1a0f53a31150c3 upstream.
In FIFO mode (which is the default), both drm_sched_entity_push_job() and
drm_sched_rq_update_fifo(), where the latter calls the former, are
currently taking and releasing the same entity->rq_lock.
We can avoid that design inelegance, and also have a miniscule
efficiency improvement on the submit from idle path, by introducing a new
drm_sched_rq_update_fifo_locked() helper and pulling up the lock taking to
its callers.
v2:
* Remove drm_sched_rq_update_fifo() altogether. (Christian)
v3:
* Improved commit message. (Philipp)
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Luben Tuikov <ltuikov89@gmail.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Philipp Stanner <pstanner@redhat.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Philipp Stanner <pstanner@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241016122013.7857-2-tursulin@igalia.com
(cherry picked from commit d42a254633c773921884a19e8a1a0f53a31150c3)
Signed-off-by: Jules Maselbas <jmaselbas@zdiv.net>
---
drivers/gpu/drm/scheduler/sched_entity.c | 13 +++++++++----
drivers/gpu/drm/scheduler/sched_main.c | 6 +++---
include/drm/gpu_scheduler.h | 2 +-
3 files changed, 13 insertions(+), 8 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 3e75fc1f6607..9dbae7b08bc9 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -505,8 +505,12 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
struct drm_sched_job *next;
next = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
- if (next)
- drm_sched_rq_update_fifo(entity, next->submit_ts);
+ if (next) {
+ spin_lock(&entity->rq_lock);
+ drm_sched_rq_update_fifo_locked(entity,
+ next->submit_ts);
+ spin_unlock(&entity->rq_lock);
+ }
}
/* Jobs and entities might have different lifecycles. Since we're
@@ -606,10 +610,11 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
sched = rq->sched;
drm_sched_rq_add_entity(rq, entity);
- spin_unlock(&entity->rq_lock);
if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
- drm_sched_rq_update_fifo(entity, submit_ts);
+ drm_sched_rq_update_fifo_locked(entity, submit_ts);
+
+ spin_unlock(&entity->rq_lock);
drm_sched_wakeup(sched);
}
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 416590ea0dc3..3609d5a8fecd 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -169,14 +169,15 @@ static inline void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *enti
}
}
-void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts)
+void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, ktime_t ts)
{
/*
* Both locks need to be grabbed, one to protect from entity->rq change
* for entity from within concurrent drm_sched_entity_select_rq and the
* other to update the rb tree structure.
*/
- spin_lock(&entity->rq_lock);
+ lockdep_assert_held(&entity->rq_lock);
+
spin_lock(&entity->rq->lock);
drm_sched_rq_remove_fifo_locked(entity);
@@ -187,7 +188,6 @@ void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts)
drm_sched_entity_compare_before);
spin_unlock(&entity->rq->lock);
- spin_unlock(&entity->rq_lock);
}
/**
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 9c437a057e5d..346a3c261b43 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -593,7 +593,7 @@ void drm_sched_rq_add_entity(struct drm_sched_rq *rq,
void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
struct drm_sched_entity *entity);
-void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts);
+void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, ktime_t ts);
int drm_sched_entity_init(struct drm_sched_entity *entity,
enum drm_sched_priority priority,
--
2.51.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 6.12.y 1/3] drm/sched: Optimise drm_sched_entity_push_job
2025-09-22 13:09 [PATCH 6.12.y 1/3] drm/sched: Optimise drm_sched_entity_push_job Jules Maselbas
@ 2025-09-22 15:30 ` Philipp Stanner
2025-09-22 17:39 ` Christian König
0 siblings, 1 reply; 14+ messages in thread
From: Philipp Stanner @ 2025-09-22 15:30 UTC (permalink / raw)
To: Jules Maselbas, stable
Cc: gregkh, Tvrtko Ursulin, Christian König, Alex Deucher,
Luben Tuikov, Matthew Brost
On Mon, 2025-09-22 at 15:09 +0200, Jules Maselbas wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>
> commit d42a254633c773921884a19e8a1a0f53a31150c3 upstream.
>
> In FIFO mode (which is the default), both drm_sched_entity_push_job() and
> drm_sched_rq_update_fifo(), where the latter calls the former, are
> currently taking and releasing the same entity->rq_lock.
>
> We can avoid that design inelegance, and also have a miniscule
> efficiency improvement on the submit from idle path, by introducing a new
> drm_sched_rq_update_fifo_locked() helper and pulling up the lock taking to
> its callers.
>
> v2:
> * Remove drm_sched_rq_update_fifo() altogether. (Christian)
>
> v3:
> * Improved commit message. (Philipp)
>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Luben Tuikov <ltuikov89@gmail.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Philipp Stanner <pstanner@redhat.com>
> Reviewed-by: Christian König <christian.koenig@amd.com>
> Signed-off-by: Philipp Stanner <pstanner@redhat.com>
> Link: https://patchwork.freedesktop.org/patch/msgid/20241016122013.7857-2-tursulin@igalia.com
> (cherry picked from commit d42a254633c773921884a19e8a1a0f53a31150c3)
> Signed-off-by: Jules Maselbas <jmaselbas@zdiv.net>
Am I interpreting this mail correctly: you want to get this patch into
stable?
Why? It doesn't fix a bug.
P.
> ---
> drivers/gpu/drm/scheduler/sched_entity.c | 13 +++++++++----
> drivers/gpu/drm/scheduler/sched_main.c | 6 +++---
> include/drm/gpu_scheduler.h | 2 +-
> 3 files changed, 13 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index 3e75fc1f6607..9dbae7b08bc9 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -505,8 +505,12 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
> struct drm_sched_job *next;
>
> next = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
> - if (next)
> - drm_sched_rq_update_fifo(entity, next->submit_ts);
> + if (next) {
> + spin_lock(&entity->rq_lock);
> + drm_sched_rq_update_fifo_locked(entity,
> + next->submit_ts);
> + spin_unlock(&entity->rq_lock);
> + }
> }
>
> /* Jobs and entities might have different lifecycles. Since we're
> @@ -606,10 +610,11 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
> sched = rq->sched;
>
> drm_sched_rq_add_entity(rq, entity);
> - spin_unlock(&entity->rq_lock);
>
> if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
> - drm_sched_rq_update_fifo(entity, submit_ts);
> + drm_sched_rq_update_fifo_locked(entity, submit_ts);
> +
> + spin_unlock(&entity->rq_lock);
>
> drm_sched_wakeup(sched);
> }
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 416590ea0dc3..3609d5a8fecd 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -169,14 +169,15 @@ static inline void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *enti
> }
> }
>
> -void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts)
> +void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, ktime_t ts)
> {
> /*
> * Both locks need to be grabbed, one to protect from entity->rq change
> * for entity from within concurrent drm_sched_entity_select_rq and the
> * other to update the rb tree structure.
> */
> - spin_lock(&entity->rq_lock);
> + lockdep_assert_held(&entity->rq_lock);
> +
> spin_lock(&entity->rq->lock);
>
> drm_sched_rq_remove_fifo_locked(entity);
> @@ -187,7 +188,6 @@ void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts)
> drm_sched_entity_compare_before);
>
> spin_unlock(&entity->rq->lock);
> - spin_unlock(&entity->rq_lock);
> }
>
> /**
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 9c437a057e5d..346a3c261b43 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -593,7 +593,7 @@ void drm_sched_rq_add_entity(struct drm_sched_rq *rq,
> void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
> struct drm_sched_entity *entity);
>
> -void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts);
> +void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, ktime_t ts);
>
> int drm_sched_entity_init(struct drm_sched_entity *entity,
> enum drm_sched_priority priority,
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 6.12.y 1/3] drm/sched: Optimise drm_sched_entity_push_job
2025-09-22 15:30 ` Philipp Stanner
@ 2025-09-22 17:39 ` Christian König
2025-09-22 20:50 ` Jules Maselbas
0 siblings, 1 reply; 14+ messages in thread
From: Christian König @ 2025-09-22 17:39 UTC (permalink / raw)
To: Philipp Stanner, Jules Maselbas, stable
Cc: gregkh, Tvrtko Ursulin, Alex Deucher, Luben Tuikov, Matthew Brost
On 22.09.25 17:30, Philipp Stanner wrote:
> On Mon, 2025-09-22 at 15:09 +0200, Jules Maselbas wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>
>> commit d42a254633c773921884a19e8a1a0f53a31150c3 upstream.
>>
>> In FIFO mode (which is the default), both drm_sched_entity_push_job() and
>> drm_sched_rq_update_fifo(), where the latter calls the former, are
>> currently taking and releasing the same entity->rq_lock.
>>
>> We can avoid that design inelegance, and also have a miniscule
>> efficiency improvement on the submit from idle path, by introducing a new
>> drm_sched_rq_update_fifo_locked() helper and pulling up the lock taking to
>> its callers.
>>
>> v2:
>> * Remove drm_sched_rq_update_fifo() altogether. (Christian)
>>
>> v3:
>> * Improved commit message. (Philipp)
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: Alex Deucher <alexander.deucher@amd.com>
>> Cc: Luben Tuikov <ltuikov89@gmail.com>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: Philipp Stanner <pstanner@redhat.com>
>> Reviewed-by: Christian König <christian.koenig@amd.com>
>> Signed-off-by: Philipp Stanner <pstanner@redhat.com>
>> Link: https://patchwork.freedesktop.org/patch/msgid/20241016122013.7857-2-tursulin@igalia.com
>> (cherry picked from commit d42a254633c773921884a19e8a1a0f53a31150c3)
>> Signed-off-by: Jules Maselbas <jmaselbas@zdiv.net>
>
> Am I interpreting this mail correctly: you want to get this patch into
> stable?
>
> Why? It doesn't fix a bug.
Patch #3 in this series depends on the other two, but I agree that isn't a good idea.
We should just adjust patch #3 to apply on the older kernel as well instead of backporting patches #1 and #2.
Regards,
Christian.
>
>
> P.
>
>> ---
>> drivers/gpu/drm/scheduler/sched_entity.c | 13 +++++++++----
>> drivers/gpu/drm/scheduler/sched_main.c | 6 +++---
>> include/drm/gpu_scheduler.h | 2 +-
>> 3 files changed, 13 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>> index 3e75fc1f6607..9dbae7b08bc9 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -505,8 +505,12 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
>> struct drm_sched_job *next;
>>
>> next = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
>> - if (next)
>> - drm_sched_rq_update_fifo(entity, next->submit_ts);
>> + if (next) {
>> + spin_lock(&entity->rq_lock);
>> + drm_sched_rq_update_fifo_locked(entity,
>> + next->submit_ts);
>> + spin_unlock(&entity->rq_lock);
>> + }
>> }
>>
>> /* Jobs and entities might have different lifecycles. Since we're
>> @@ -606,10 +610,11 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>> sched = rq->sched;
>>
>> drm_sched_rq_add_entity(rq, entity);
>> - spin_unlock(&entity->rq_lock);
>>
>> if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>> - drm_sched_rq_update_fifo(entity, submit_ts);
>> + drm_sched_rq_update_fifo_locked(entity, submit_ts);
>> +
>> + spin_unlock(&entity->rq_lock);
>>
>> drm_sched_wakeup(sched);
>> }
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index 416590ea0dc3..3609d5a8fecd 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -169,14 +169,15 @@ static inline void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *enti
>> }
>> }
>>
>> -void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts)
>> +void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, ktime_t ts)
>> {
>> /*
>> * Both locks need to be grabbed, one to protect from entity->rq change
>> * for entity from within concurrent drm_sched_entity_select_rq and the
>> * other to update the rb tree structure.
>> */
>> - spin_lock(&entity->rq_lock);
>> + lockdep_assert_held(&entity->rq_lock);
>> +
>> spin_lock(&entity->rq->lock);
>>
>> drm_sched_rq_remove_fifo_locked(entity);
>> @@ -187,7 +188,6 @@ void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts)
>> drm_sched_entity_compare_before);
>>
>> spin_unlock(&entity->rq->lock);
>> - spin_unlock(&entity->rq_lock);
>> }
>>
>> /**
>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>> index 9c437a057e5d..346a3c261b43 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -593,7 +593,7 @@ void drm_sched_rq_add_entity(struct drm_sched_rq *rq,
>> void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>> struct drm_sched_entity *entity);
>>
>> -void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts);
>> +void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, ktime_t ts);
>>
>> int drm_sched_entity_init(struct drm_sched_entity *entity,
>> enum drm_sched_priority priority,
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 6.12.y 1/3] drm/sched: Optimise drm_sched_entity_push_job
2025-09-22 17:39 ` Christian König
@ 2025-09-22 20:50 ` Jules Maselbas
2025-09-23 12:08 ` Philipp Stanner
0 siblings, 1 reply; 14+ messages in thread
From: Jules Maselbas @ 2025-09-22 20:50 UTC (permalink / raw)
To: Christian König, Philipp Stanner, Jules Maselbas, stable
Cc: gregkh, Tvrtko Ursulin, Alex Deucher, Luben Tuikov, Matthew Brost
On Mon Sep 22, 2025 at 7:39 PM CEST, Christian König wrote:
> On 22.09.25 17:30, Philipp Stanner wrote:
>> On Mon, 2025-09-22 at 15:09 +0200, Jules Maselbas wrote:
>>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>>
>>> commit d42a254633c773921884a19e8a1a0f53a31150c3 upstream.
>>>
>>> In FIFO mode (which is the default), both drm_sched_entity_push_job() and
>>> drm_sched_rq_update_fifo(), where the latter calls the former, are
>>> currently taking and releasing the same entity->rq_lock.
>>>
>>> We can avoid that design inelegance, and also have a miniscule
>>> efficiency improvement on the submit from idle path, by introducing a new
>>> drm_sched_rq_update_fifo_locked() helper and pulling up the lock taking to
>>> its callers.
>>>
>>> v2:
>>> * Remove drm_sched_rq_update_fifo() altogether. (Christian)
>>>
>>> v3:
>>> * Improved commit message. (Philipp)
>>>
>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>> Cc: Christian König <christian.koenig@amd.com>
>>> Cc: Alex Deucher <alexander.deucher@amd.com>
>>> Cc: Luben Tuikov <ltuikov89@gmail.com>
>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>> Cc: Philipp Stanner <pstanner@redhat.com>
>>> Reviewed-by: Christian König <christian.koenig@amd.com>
>>> Signed-off-by: Philipp Stanner <pstanner@redhat.com>
>>> Link: https://patchwork.freedesktop.org/patch/msgid/20241016122013.7857-2-tursulin@igalia.com
>>> (cherry picked from commit d42a254633c773921884a19e8a1a0f53a31150c3)
>>> Signed-off-by: Jules Maselbas <jmaselbas@zdiv.net>
>>
>> Am I interpreting this mail correctly: you want to get this patch into
>> stable?
>>
>> Why? It doesn't fix a bug.
>
> Patch #3 in this series depends on the other two, but I agree that isn't a good idea.
Yes patch #3 fixes a freeze in amdgpu
> We should just adjust patch #3 to apply on the older kernel as well instead of backporting patches #1 and #2.
I initially modified patch #3 to use .rq_lock instead of .lock, but i didn't felt very confident with this modification.
Should i sent a new version with a modified patch #3 ?
If so, how the change should be reflected in the commit message ?
(I initially ask #kernelnewbies but ended pulling the two other patches)
Best,
Jules
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 6.12.y 1/3] drm/sched: Optimise drm_sched_entity_push_job
2025-09-22 20:50 ` Jules Maselbas
@ 2025-09-23 12:08 ` Philipp Stanner
2025-09-23 12:33 ` Christian König
0 siblings, 1 reply; 14+ messages in thread
From: Philipp Stanner @ 2025-09-23 12:08 UTC (permalink / raw)
To: Jules Maselbas, Christian König, stable
Cc: gregkh, Tvrtko Ursulin, Alex Deucher, Luben Tuikov, Matthew Brost
On Mon, 2025-09-22 at 22:50 +0200, Jules Maselbas wrote:
> On Mon Sep 22, 2025 at 7:39 PM CEST, Christian König wrote:
> > On 22.09.25 17:30, Philipp Stanner wrote:
> > > On Mon, 2025-09-22 at 15:09 +0200, Jules Maselbas wrote:
> > > > From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> > > >
> > > > commit d42a254633c773921884a19e8a1a0f53a31150c3 upstream.
> > > >
> > > > In FIFO mode (which is the default), both drm_sched_entity_push_job() and
> > > > drm_sched_rq_update_fifo(), where the latter calls the former, are
> > > > currently taking and releasing the same entity->rq_lock.
> > > >
> > > > We can avoid that design inelegance, and also have a miniscule
> > > > efficiency improvement on the submit from idle path, by introducing a new
> > > > drm_sched_rq_update_fifo_locked() helper and pulling up the lock taking to
> > > > its callers.
> > > >
> > > > v2:
> > > > * Remove drm_sched_rq_update_fifo() altogether. (Christian)
> > > >
> > > > v3:
> > > > * Improved commit message. (Philipp)
> > > >
> > > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> > > > Cc: Christian König <christian.koenig@amd.com>
> > > > Cc: Alex Deucher <alexander.deucher@amd.com>
> > > > Cc: Luben Tuikov <ltuikov89@gmail.com>
> > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > Cc: Philipp Stanner <pstanner@redhat.com>
> > > > Reviewed-by: Christian König <christian.koenig@amd.com>
> > > > Signed-off-by: Philipp Stanner <pstanner@redhat.com>
> > > > Link: https://patchwork.freedesktop.org/patch/msgid/20241016122013.7857-2-tursulin@igalia.com
> > > > (cherry picked from commit d42a254633c773921884a19e8a1a0f53a31150c3)
> > > > Signed-off-by: Jules Maselbas <jmaselbas@zdiv.net>
> > >
> > > Am I interpreting this mail correctly: you want to get this patch into
> > > stable?
> > >
> > > Why? It doesn't fix a bug.
> >
> > Patch #3 in this series depends on the other two, but I agree that isn't a good idea.
> Yes patch #3 fixes a freeze in amdgpu
>
> > We should just adjust patch #3 to apply on the older kernel as well instead of backporting patches #1 and #2.
> I initially modified patch #3 to use .rq_lock instead of .lock, but i didn't felt very confident with this modification.
> Should i sent a new version with a modified patch #3 ?
> If so, how the change should be reflected in the commit message ?
> (I initially ask #kernelnewbies but ended pulling the two other patches)
You know folks, situations like that are why we want to strongly
discourage accessing another API's struct members directly. There is no
API contract for them.
And a proper API function rarely changes its interface, and if it does,
it's easy to find for the contributor where drivers need to be
adjusted. If we were all following that rule, you wouldn't even have to
bother with patches #1 and #2.
That said, I see two proper solutions for your problem:
A. amdgpu is the one stopping the entities anyways, isn't it? It
knows which entities it has killed. So that information could be
stored in struct amdgpu_vm.
B. Add an API: drm_sched_entity_is_stopped(). There's also
drm_sched_entity_is_idle(), but I guess that won't serve your
purpose?
And btw, as we're at it:
@Christian: Danilo and I recently asked about whether entities can
still outlive their scheduler in amdgpu?
That seems to be the reason why that race-"fix" in drm_sched_fini() was
added, which is the only other place that can mark an entity as
stopped, except for the proper place: drm_sched_entity_kill().
P.
>
> Best,
> Jules
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 6.12.y 1/3] drm/sched: Optimise drm_sched_entity_push_job
2025-09-23 12:08 ` Philipp Stanner
@ 2025-09-23 12:33 ` Christian König
2025-09-23 13:10 ` Philipp Stanner
2025-09-24 11:00 ` Danilo Krummrich
0 siblings, 2 replies; 14+ messages in thread
From: Christian König @ 2025-09-23 12:33 UTC (permalink / raw)
To: Philipp Stanner, Jules Maselbas, stable
Cc: gregkh, Tvrtko Ursulin, Alex Deucher, Luben Tuikov, Matthew Brost
On 23.09.25 14:08, Philipp Stanner wrote:
> On Mon, 2025-09-22 at 22:50 +0200, Jules Maselbas wrote:
>> On Mon Sep 22, 2025 at 7:39 PM CEST, Christian König wrote:
>>> On 22.09.25 17:30, Philipp Stanner wrote:
>>>> On Mon, 2025-09-22 at 15:09 +0200, Jules Maselbas wrote:
>>>>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>>>>
>>>>> commit d42a254633c773921884a19e8a1a0f53a31150c3 upstream.
>>>>>
>>>>> In FIFO mode (which is the default), both drm_sched_entity_push_job() and
>>>>> drm_sched_rq_update_fifo(), where the latter calls the former, are
>>>>> currently taking and releasing the same entity->rq_lock.
>>>>>
>>>>> We can avoid that design inelegance, and also have a miniscule
>>>>> efficiency improvement on the submit from idle path, by introducing a new
>>>>> drm_sched_rq_update_fifo_locked() helper and pulling up the lock taking to
>>>>> its callers.
>>>>>
>>>>> v2:
>>>>> * Remove drm_sched_rq_update_fifo() altogether. (Christian)
>>>>>
>>>>> v3:
>>>>> * Improved commit message. (Philipp)
>>>>>
>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>>>> Cc: Christian König <christian.koenig@amd.com>
>>>>> Cc: Alex Deucher <alexander.deucher@amd.com>
>>>>> Cc: Luben Tuikov <ltuikov89@gmail.com>
>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>> Cc: Philipp Stanner <pstanner@redhat.com>
>>>>> Reviewed-by: Christian König <christian.koenig@amd.com>
>>>>> Signed-off-by: Philipp Stanner <pstanner@redhat.com>
>>>>> Link: https://patchwork.freedesktop.org/patch/msgid/20241016122013.7857-2-tursulin@igalia.com
>>>>> (cherry picked from commit d42a254633c773921884a19e8a1a0f53a31150c3)
>>>>> Signed-off-by: Jules Maselbas <jmaselbas@zdiv.net>
>>>>
>>>> Am I interpreting this mail correctly: you want to get this patch into
>>>> stable?
>>>>
>>>> Why? It doesn't fix a bug.
>>>
>>> Patch #3 in this series depends on the other two, but I agree that isn't a good idea.
>> Yes patch #3 fixes a freeze in amdgpu
>>
>>> We should just adjust patch #3 to apply on the older kernel as well instead of backporting patches #1 and #2.
>> I initially modified patch #3 to use .rq_lock instead of .lock, but i didn't felt very confident with this modification.
>> Should i sent a new version with a modified patch #3 ?
>> If so, how the change should be reflected in the commit message ?
>> (I initially ask #kernelnewbies but ended pulling the two other patches)
>
> You know folks, situations like that are why we want to strongly
> discourage accessing another API's struct members directly. There is no
> API contract for them.
>
> And a proper API function rarely changes its interface, and if it does,
> it's easy to find for the contributor where drivers need to be
> adjusted. If we were all following that rule, you wouldn't even have to
> bother with patches #1 and #2.
>
> That said, I see two proper solutions for your problem:
>
> A. amdgpu is the one stopping the entities anyways, isn't it? It
> knows which entities it has killed. So that information could be
> stored in struct amdgpu_vm.
No, it's the scheduler which decides when entities are stopped.
Otherwise we would need to re-invent the flush logic for every driver again.
> B. Add an API: drm_sched_entity_is_stopped(). There's also
> drm_sched_entity_is_idle(), but I guess that won't serve your
> purpose?
drm_sched_entity_is_stopped() should do it. drm_sched_entity_is_idle() is something different and should potentially even not be exported to drivers in the first place.
> And btw, as we're at it:
> @Christian: Danilo and I recently asked about whether entities can
> still outlive their scheduler in amdgpu?
That should have been fixed by now. This happened only on hot-unplug and that was re-designed quite a bit.
> That seems to be the reason why that race-"fix" in drm_sched_fini() was
> added, which is the only other place that can mark an entity as
> stopped, except for the proper place: drm_sched_entity_kill().
That is potentially still good to have.
Regards,
Christian.
>
>
> P.
>
>>
>> Best,
>> Jules
>>
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 6.12.y 1/3] drm/sched: Optimise drm_sched_entity_push_job
2025-09-23 12:33 ` Christian König
@ 2025-09-23 13:10 ` Philipp Stanner
2025-09-23 13:18 ` Christian König
2025-09-24 11:00 ` Danilo Krummrich
1 sibling, 1 reply; 14+ messages in thread
From: Philipp Stanner @ 2025-09-23 13:10 UTC (permalink / raw)
To: Christian König, Jules Maselbas, stable
Cc: gregkh, Tvrtko Ursulin, Alex Deucher, Luben Tuikov, Matthew Brost
On Tue, 2025-09-23 at 14:33 +0200, Christian König wrote:
> On 23.09.25 14:08, Philipp Stanner wrote:
> > On Mon, 2025-09-22 at 22:50 +0200, Jules Maselbas wrote:
> > > On Mon Sep 22, 2025 at 7:39 PM CEST, Christian König wrote:
> > > > On 22.09.25 17:30, Philipp Stanner wrote:
> > > > > On Mon, 2025-09-22 at 15:09 +0200, Jules Maselbas wrote:
> > > > > > From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> > > > > >
> > > > > > commit d42a254633c773921884a19e8a1a0f53a31150c3 upstream.
> > > > > >
> > > > > > In FIFO mode (which is the default), both drm_sched_entity_push_job() and
> > > > > > drm_sched_rq_update_fifo(), where the latter calls the former, are
> > > > > > currently taking and releasing the same entity->rq_lock.
> > > > > >
> > > > > > We can avoid that design inelegance, and also have a miniscule
> > > > > > efficiency improvement on the submit from idle path, by introducing a new
> > > > > > drm_sched_rq_update_fifo_locked() helper and pulling up the lock taking to
> > > > > > its callers.
> > > > > >
> > > > > > v2:
> > > > > > * Remove drm_sched_rq_update_fifo() altogether. (Christian)
> > > > > >
> > > > > > v3:
> > > > > > * Improved commit message. (Philipp)
> > > > > >
> > > > > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> > > > > > Cc: Christian König <christian.koenig@amd.com>
> > > > > > Cc: Alex Deucher <alexander.deucher@amd.com>
> > > > > > Cc: Luben Tuikov <ltuikov89@gmail.com>
> > > > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > > > Cc: Philipp Stanner <pstanner@redhat.com>
> > > > > > Reviewed-by: Christian König <christian.koenig@amd.com>
> > > > > > Signed-off-by: Philipp Stanner <pstanner@redhat.com>
> > > > > > Link: https://patchwork.freedesktop.org/patch/msgid/20241016122013.7857-2-tursulin@igalia.com
> > > > > > (cherry picked from commit d42a254633c773921884a19e8a1a0f53a31150c3)
> > > > > > Signed-off-by: Jules Maselbas <jmaselbas@zdiv.net>
> > > > >
> > > > > Am I interpreting this mail correctly: you want to get this patch into
> > > > > stable?
> > > > >
> > > > > Why? It doesn't fix a bug.
> > > >
> > > > Patch #3 in this series depends on the other two, but I agree that isn't a good idea.
> > > Yes patch #3 fixes a freeze in amdgpu
> > >
> > > > We should just adjust patch #3 to apply on the older kernel as well instead of backporting patches #1 and #2.
> > > I initially modified patch #3 to use .rq_lock instead of .lock, but i didn't felt very confident with this modification.
> > > Should i sent a new version with a modified patch #3 ?
> > > If so, how the change should be reflected in the commit message ?
> > > (I initially ask #kernelnewbies but ended pulling the two other patches)
> >
> > You know folks, situations like that are why we want to strongly
> > discourage accessing another API's struct members directly. There is no
> > API contract for them.
> >
> > And a proper API function rarely changes its interface, and if it does,
> > it's easy to find for the contributor where drivers need to be
> > adjusted. If we were all following that rule, you wouldn't even have to
> > bother with patches #1 and #2.
> >
> > That said, I see two proper solutions for your problem:
> >
> > A. amdgpu is the one stopping the entities anyways, isn't it? It
> > knows which entities it has killed. So that information could be
> > stored in struct amdgpu_vm.
>
> No, it's the scheduler which decides when entities are stopped.
The scheduler sets the stopped-flag, but that effectively only happens
when you either flush() or fini() the entity. OR if you run into that
drm_sched_fini() race.
>
> Otherwise we would need to re-invent the flush logic for every driver again.
Let's ask differently: Does amdgpu check here whether
drm_sched_entity_fini() or drm_sched_entity_flush() have been called on
those entities already?
>
> > B. Add an API: drm_sched_entity_is_stopped(). There's also
> > drm_sched_entity_is_idle(), but I guess that won't serve your
> > purpose?
>
> drm_sched_entity_is_stopped() should do it. drm_sched_entity_is_idle() is something different and should potentially even not be exported to drivers in the first place.
Fine by me.
>
> > And btw, as we're at it:
> > @Christian: Danilo and I recently asked about whether entities can
> > still outlive their scheduler in amdgpu?
>
> That should have been fixed by now. This happened only on hot-unplug and that was re-designed quite a bit.
>
> > That seems to be the reason why that race-"fix" in drm_sched_fini() was
> > added, which is the only other place that can mark an entity as
> > stopped, except for the proper place: drm_sched_entity_kill().
>
> That is potentially still good to have.
That's why we left it for now and just added a FIXME, because there's
not really any benefit in potentially blowing up drivers by removing it
(well, technically blowing up drivers like that would reveal
significant lifetime and, thus, design issues. But it wouldn't be
"nice").
Still, it's a clear sign of (undocumented…) scheduler lifetimes being
violated :(
P.
>
> Regards,
> Christian.
>
> >
> >
> > P.
> >
> > >
> > > Best,
> > > Jules
> > >
> >
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 6.12.y 1/3] drm/sched: Optimise drm_sched_entity_push_job
2025-09-23 13:10 ` Philipp Stanner
@ 2025-09-23 13:18 ` Christian König
0 siblings, 0 replies; 14+ messages in thread
From: Christian König @ 2025-09-23 13:18 UTC (permalink / raw)
To: Philipp Stanner, Jules Maselbas, stable
Cc: gregkh, Tvrtko Ursulin, Alex Deucher, Luben Tuikov, Matthew Brost
On 23.09.25 15:10, Philipp Stanner wrote:
> On Tue, 2025-09-23 at 14:33 +0200, Christian König wrote:
>> On 23.09.25 14:08, Philipp Stanner wrote:
>>> On Mon, 2025-09-22 at 22:50 +0200, Jules Maselbas wrote:
>>>> On Mon Sep 22, 2025 at 7:39 PM CEST, Christian König wrote:
>>>>> On 22.09.25 17:30, Philipp Stanner wrote:
>>>>>> On Mon, 2025-09-22 at 15:09 +0200, Jules Maselbas wrote:
>>>>>>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>>>>>>
>>>>>>> commit d42a254633c773921884a19e8a1a0f53a31150c3 upstream.
>>>>>>>
>>>>>>> In FIFO mode (which is the default), both drm_sched_entity_push_job() and
>>>>>>> drm_sched_rq_update_fifo(), where the latter calls the former, are
>>>>>>> currently taking and releasing the same entity->rq_lock.
>>>>>>>
>>>>>>> We can avoid that design inelegance, and also have a miniscule
>>>>>>> efficiency improvement on the submit from idle path, by introducing a new
>>>>>>> drm_sched_rq_update_fifo_locked() helper and pulling up the lock taking to
>>>>>>> its callers.
>>>>>>>
>>>>>>> v2:
>>>>>>> * Remove drm_sched_rq_update_fifo() altogether. (Christian)
>>>>>>>
>>>>>>> v3:
>>>>>>> * Improved commit message. (Philipp)
>>>>>>>
>>>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>>>>>> Cc: Christian König <christian.koenig@amd.com>
>>>>>>> Cc: Alex Deucher <alexander.deucher@amd.com>
>>>>>>> Cc: Luben Tuikov <ltuikov89@gmail.com>
>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>> Cc: Philipp Stanner <pstanner@redhat.com>
>>>>>>> Reviewed-by: Christian König <christian.koenig@amd.com>
>>>>>>> Signed-off-by: Philipp Stanner <pstanner@redhat.com>
>>>>>>> Link: https://patchwork.freedesktop.org/patch/msgid/20241016122013.7857-2-tursulin@igalia.com
>>>>>>> (cherry picked from commit d42a254633c773921884a19e8a1a0f53a31150c3)
>>>>>>> Signed-off-by: Jules Maselbas <jmaselbas@zdiv.net>
>>>>>>
>>>>>> Am I interpreting this mail correctly: you want to get this patch into
>>>>>> stable?
>>>>>>
>>>>>> Why? It doesn't fix a bug.
>>>>>
>>>>> Patch #3 in this series depends on the other two, but I agree that isn't a good idea.
>>>> Yes patch #3 fixes a freeze in amdgpu
>>>>
>>>>> We should just adjust patch #3 to apply on the older kernel as well instead of backporting patches #1 and #2.
>>>> I initially modified patch #3 to use .rq_lock instead of .lock, but i didn't felt very confident with this modification.
>>>> Should i sent a new version with a modified patch #3 ?
>>>> If so, how the change should be reflected in the commit message ?
>>>> (I initially ask #kernelnewbies but ended pulling the two other patches)
>>>
>>> You know folks, situations like that are why we want to strongly
>>> discourage accessing another API's struct members directly. There is no
>>> API contract for them.
>>>
>>> And a proper API function rarely changes its interface, and if it does,
>>> it's easy to find for the contributor where drivers need to be
>>> adjusted. If we were all following that rule, you wouldn't even have to
>>> bother with patches #1 and #2.
>>>
>>> That said, I see two proper solutions for your problem:
>>>
>>> A. amdgpu is the one stopping the entities anyways, isn't it? It
>>> knows which entities it has killed. So that information could be
>>> stored in struct amdgpu_vm.
>>
>> No, it's the scheduler which decides when entities are stopped.
>
> The scheduler sets the stopped-flag, but that effectively only happens
> when you either flush() or fini() the entity. OR if you run into that
> drm_sched_fini() race.
>
>>
>> Otherwise we would need to re-invent the flush logic for every driver again.
>
> Let's ask differently: Does amdgpu check here whether
> drm_sched_entity_fini() or drm_sched_entity_flush() have been called on
> those entities already?
No and it should not.
drm_entity_flush() can be called many times (and even concurrently) on the same entity.
Only when the scheduler sees that it is called by the last submitter of jobs *and* because this submitter was terminated by a SIGKILL then the entity is killed as well.
Background is that the file flush callback this is used with is basically called all the time. And as we now have found once more also because userspace forgotten to set CLOEXEC.
>>
>>> B. Add an API: drm_sched_entity_is_stopped(). There's also
>>> drm_sched_entity_is_idle(), but I guess that won't serve your
>>> purpose?
>>
>> drm_sched_entity_is_stopped() should do it. drm_sched_entity_is_idle() is something different and should potentially even not be exported to drivers in the first place.
>
> Fine by me.
>
>>
>>> And btw, as we're at it:
>>> @Christian: Danilo and I recently asked about whether entities can
>>> still outlive their scheduler in amdgpu?
>>
>> That should have been fixed by now. This happened only on hot-unplug and that was re-designed quite a bit.
>>
>>> That seems to be the reason why that race-"fix" in drm_sched_fini() was
>>> added, which is the only other place that can mark an entity as
>>> stopped, except for the proper place: drm_sched_entity_kill().
>>
>> That is potentially still good to have.
>
> That's why we left it for now and just added a FIXME, because there's
> not really any benefit in potentially blowing up drivers by removing it
> (well, technically blowing up drivers like that would reveal
> significant lifetime and, thus, design issues. But it wouldn't be
> "nice").
>
> Still, it's a clear sign of (undocumented…) scheduler lifetimes being
> violated :(
Yeah, I know :(
Regards,
Christian.
>
>
> P.
>
>>
>> Regards,
>> Christian.
>>
>>>
>>>
>>> P.
>>>
>>>>
>>>> Best,
>>>> Jules
>>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 6.12.y 1/3] drm/sched: Optimise drm_sched_entity_push_job
2025-09-23 12:33 ` Christian König
2025-09-23 13:10 ` Philipp Stanner
@ 2025-09-24 11:00 ` Danilo Krummrich
2025-09-24 12:22 ` Christian König
1 sibling, 1 reply; 14+ messages in thread
From: Danilo Krummrich @ 2025-09-24 11:00 UTC (permalink / raw)
To: Christian König
Cc: Philipp Stanner, Jules Maselbas, stable, gregkh, Tvrtko Ursulin,
Alex Deucher, Luben Tuikov, Matthew Brost
On Tue Sep 23, 2025 at 2:33 PM CEST, Christian König wrote:
> On 23.09.25 14:08, Philipp Stanner wrote:
>> You know folks, situations like that are why we want to strongly
>> discourage accessing another API's struct members directly. There is no
>> API contract for them.
Indeed, please don't peek API internals. If you need additional functionality,
please send a patch adding a supported API for the component instead.
Drivers messing with component internals makes impossible to maintain them in
the long term.
>> And a proper API function rarely changes its interface, and if it does,
>> it's easy to find for the contributor where drivers need to be
>> adjusted. If we were all following that rule, you wouldn't even have to
>> bother with patches #1 and #2.
>>
>> That said, I see two proper solutions for your problem:
>>
>> A. amdgpu is the one stopping the entities anyways, isn't it? It
>> knows which entities it has killed. So that information could be
>> stored in struct amdgpu_vm.
>
> No, it's the scheduler which decides when entities are stopped.
Can you please show me the code where the scheduler calls any of
drm_sched_entity_fini(), drm_sched_entity_flush(), drm_sched_entity_destroy()?
Or are you referring the broken hack in drm_sched_fini() (introduced by commit
c61cdbdbffc1 ("drm/scheduler: Fix hang when sched_entity released")) where it is
just ignored that we need to take the entity lock as well, because it
inconviniently would lead to lock inversion?
spin_lock(&rq->lock);
list_for_each_entry(s_entity, &rq->entities, list)
/*
* Prevents reinsertion and marks job_queue as idle,
* it will be removed from the rq in drm_sched_entity_fini()
* eventually
*/
s_entity->stopped = true;
spin_unlock(&rq->lock);
The patch description that introduced the hack says:
If scheduler is already stopped by the time sched_entity
is released and entity's job_queue not empty I encountred
a hang in drm_sched_entity_flush.
But this sounds to me as if amdgpu simply doesn't implement the correct shutdown
ordering. Why do nouveau, Xe and other drivers don't have this problem? Why do
we need to solve it in the scheduler instead?
Maybe there are reasonable answers to that. And assuming there are, it still
isn't a justification for building on top of a broken workaround. :(
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 6.12.y 1/3] drm/sched: Optimise drm_sched_entity_push_job
2025-09-24 11:00 ` Danilo Krummrich
@ 2025-09-24 12:22 ` Christian König
0 siblings, 0 replies; 14+ messages in thread
From: Christian König @ 2025-09-24 12:22 UTC (permalink / raw)
To: Danilo Krummrich
Cc: Philipp Stanner, Jules Maselbas, stable, gregkh, Tvrtko Ursulin,
Alex Deucher, Luben Tuikov, Matthew Brost
On 24.09.25 13:00, Danilo Krummrich wrote:
> On Tue Sep 23, 2025 at 2:33 PM CEST, Christian König wrote:
>> On 23.09.25 14:08, Philipp Stanner wrote:
>>> You know folks, situations like that are why we want to strongly
>>> discourage accessing another API's struct members directly. There is no
>>> API contract for them.
>
> Indeed, please don't peek API internals. If you need additional functionality,
> please send a patch adding a supported API for the component instead.
>
> Drivers messing with component internals makes impossible to maintain them in
> the long term.
>
>>> And a proper API function rarely changes its interface, and if it does,
>>> it's easy to find for the contributor where drivers need to be
>>> adjusted. If we were all following that rule, you wouldn't even have to
>>> bother with patches #1 and #2.
>>>
>>> That said, I see two proper solutions for your problem:
>>>
>>> A. amdgpu is the one stopping the entities anyways, isn't it? It
>>> knows which entities it has killed. So that information could be
>>> stored in struct amdgpu_vm.
>>
>> No, it's the scheduler which decides when entities are stopped.
>
> Can you please show me the code where the scheduler calls any of
> drm_sched_entity_fini(), drm_sched_entity_flush(), drm_sched_entity_destroy()?
It's this code here in drm_sched_entity_flush() which decides if an entity should be killed or not:
/* For a killed process disallow further enqueueing of jobs. */
last_user = cmpxchg(&entity->last_user, current->group_leader, NULL);
if ((!last_user || last_user == current->group_leader) &&
(current->flags & PF_EXITING) && (current->exit_code == SIGKILL))
drm_sched_entity_kill(entity);
Regards,
Christian.
>
> Or are you referring the broken hack in drm_sched_fini() (introduced by commit
> c61cdbdbffc1 ("drm/scheduler: Fix hang when sched_entity released")) where it is
> just ignored that we need to take the entity lock as well, because it
> inconviniently would lead to lock inversion?
>
> spin_lock(&rq->lock);
> list_for_each_entry(s_entity, &rq->entities, list)
> /*
> * Prevents reinsertion and marks job_queue as idle,
> * it will be removed from the rq in drm_sched_entity_fini()
> * eventually
> */
> s_entity->stopped = true;
> spin_unlock(&rq->lock);
>
> The patch description that introduced the hack says:
>
> If scheduler is already stopped by the time sched_entity
> is released and entity's job_queue not empty I encountred
> a hang in drm_sched_entity_flush.
>
> But this sounds to me as if amdgpu simply doesn't implement the correct shutdown
> ordering. Why do nouveau, Xe and other drivers don't have this problem? Why do
> we need to solve it in the scheduler instead?
>
> Maybe there are reasonable answers to that. And assuming there are, it still
> isn't a justification for building on top of a broken workaround. :(
^ permalink raw reply [flat|nested] 14+ messages in thread
* FAILED: patch "[PATCH] drm/sched: Fix race in drm_sched_entity_select_rq()" failed to apply to 6.12-stable tree
@ 2025-11-03 0:50 gregkh
2025-11-03 12:44 ` [PATCH 6.12.y 1/3] drm/sched: Optimise drm_sched_entity_push_job Sasha Levin
0 siblings, 1 reply; 14+ messages in thread
From: gregkh @ 2025-11-03 0:50 UTC (permalink / raw)
To: phasta, tvrtko.ursulin; +Cc: stable
The patch below does not apply to the 6.12-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable@vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.12.y
git checkout FETCH_HEAD
git cherry-pick -x d25e3a610bae03bffc5c14b5d944a5d0cd844678
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable@vger.kernel.org>' --in-reply-to '2025110342-exhume-mankind-5952@gregkh' --subject-prefix 'PATCH 6.12.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d25e3a610bae03bffc5c14b5d944a5d0cd844678 Mon Sep 17 00:00:00 2001
From: Philipp Stanner <phasta@kernel.org>
Date: Wed, 22 Oct 2025 08:34:03 +0200
Subject: [PATCH] drm/sched: Fix race in drm_sched_entity_select_rq()
In a past bug fix it was forgotten that entity access must be protected
by the entity lock. That's a data race and potentially UB.
Move the spin_unlock() to the appropriate position.
Cc: stable@vger.kernel.org # v5.13+
Fixes: ac4eb83ab255 ("drm/sched: select new rq even if there is only one v3")
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Link: https://patch.msgid.link/20251022063402.87318-2-phasta@kernel.org
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 5a4697f636f2..aa222166de58 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -552,10 +552,11 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
drm_sched_rq_remove_entity(entity->rq, entity);
entity->rq = rq;
}
- spin_unlock(&entity->lock);
if (entity->num_sched_list == 1)
entity->sched_list = NULL;
+
+ spin_unlock(&entity->lock);
}
/**
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 6.12.y 1/3] drm/sched: Optimise drm_sched_entity_push_job
2025-11-03 0:50 FAILED: patch "[PATCH] drm/sched: Fix race in drm_sched_entity_select_rq()" failed to apply to 6.12-stable tree gregkh
@ 2025-11-03 12:44 ` Sasha Levin
2025-11-03 12:44 ` [PATCH 6.12.y 2/3] drm/sched: Re-group and rename the entity run-queue lock Sasha Levin
2025-11-03 12:44 ` [PATCH 6.12.y 3/3] drm/sched: Fix race in drm_sched_entity_select_rq() Sasha Levin
0 siblings, 2 replies; 14+ messages in thread
From: Sasha Levin @ 2025-11-03 12:44 UTC (permalink / raw)
To: stable
Cc: Tvrtko Ursulin, Christian König, Alex Deucher, Luben Tuikov,
Matthew Brost, Philipp Stanner, Sasha Levin
From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
[ Upstream commit d42a254633c773921884a19e8a1a0f53a31150c3 ]
In FIFO mode (which is the default), both drm_sched_entity_push_job() and
drm_sched_rq_update_fifo(), where the latter calls the former, are
currently taking and releasing the same entity->rq_lock.
We can avoid that design inelegance, and also have a miniscule
efficiency improvement on the submit from idle path, by introducing a new
drm_sched_rq_update_fifo_locked() helper and pulling up the lock taking to
its callers.
v2:
* Remove drm_sched_rq_update_fifo() altogether. (Christian)
v3:
* Improved commit message. (Philipp)
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Luben Tuikov <ltuikov89@gmail.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Philipp Stanner <pstanner@redhat.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Philipp Stanner <pstanner@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241016122013.7857-2-tursulin@igalia.com
Stable-dep-of: d25e3a610bae ("drm/sched: Fix race in drm_sched_entity_select_rq()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
drivers/gpu/drm/scheduler/sched_entity.c | 13 +++++++++----
drivers/gpu/drm/scheduler/sched_main.c | 6 +++---
include/drm/gpu_scheduler.h | 2 +-
3 files changed, 13 insertions(+), 8 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 3e75fc1f66072..9dbae7b08bc90 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -505,8 +505,12 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
struct drm_sched_job *next;
next = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
- if (next)
- drm_sched_rq_update_fifo(entity, next->submit_ts);
+ if (next) {
+ spin_lock(&entity->rq_lock);
+ drm_sched_rq_update_fifo_locked(entity,
+ next->submit_ts);
+ spin_unlock(&entity->rq_lock);
+ }
}
/* Jobs and entities might have different lifecycles. Since we're
@@ -606,10 +610,11 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
sched = rq->sched;
drm_sched_rq_add_entity(rq, entity);
- spin_unlock(&entity->rq_lock);
if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
- drm_sched_rq_update_fifo(entity, submit_ts);
+ drm_sched_rq_update_fifo_locked(entity, submit_ts);
+
+ spin_unlock(&entity->rq_lock);
drm_sched_wakeup(sched);
}
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index d5260cb1ed0ec..0b7976c908dde 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -169,14 +169,15 @@ static inline void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *enti
}
}
-void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts)
+void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, ktime_t ts)
{
/*
* Both locks need to be grabbed, one to protect from entity->rq change
* for entity from within concurrent drm_sched_entity_select_rq and the
* other to update the rb tree structure.
*/
- spin_lock(&entity->rq_lock);
+ lockdep_assert_held(&entity->rq_lock);
+
spin_lock(&entity->rq->lock);
drm_sched_rq_remove_fifo_locked(entity);
@@ -187,7 +188,6 @@ void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts)
drm_sched_entity_compare_before);
spin_unlock(&entity->rq->lock);
- spin_unlock(&entity->rq_lock);
}
/**
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 9c437a057e5de..346a3c261b437 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -593,7 +593,7 @@ void drm_sched_rq_add_entity(struct drm_sched_rq *rq,
void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
struct drm_sched_entity *entity);
-void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts);
+void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, ktime_t ts);
int drm_sched_entity_init(struct drm_sched_entity *entity,
enum drm_sched_priority priority,
--
2.51.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 6.12.y 2/3] drm/sched: Re-group and rename the entity run-queue lock
2025-11-03 12:44 ` [PATCH 6.12.y 1/3] drm/sched: Optimise drm_sched_entity_push_job Sasha Levin
@ 2025-11-03 12:44 ` Sasha Levin
2025-11-03 12:44 ` [PATCH 6.12.y 3/3] drm/sched: Fix race in drm_sched_entity_select_rq() Sasha Levin
1 sibling, 0 replies; 14+ messages in thread
From: Sasha Levin @ 2025-11-03 12:44 UTC (permalink / raw)
To: stable
Cc: Tvrtko Ursulin, Christian König, Alex Deucher, Luben Tuikov,
Matthew Brost, Philipp Stanner, Sasha Levin
From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
[ Upstream commit f93126f5d55920d1447ef00a3fbe6706f40f53de ]
When writing to a drm_sched_entity's run-queue, writers are protected
through the lock drm_sched_entity.rq_lock. This naming, however,
frequently collides with the separate internal lock of struct
drm_sched_rq, resulting in uses like this:
spin_lock(&entity->rq_lock);
spin_lock(&entity->rq->lock);
Rename drm_sched_entity.rq_lock to improve readability. While at it,
re-order that struct's members to make it more obvious what the lock
protects.
v2:
* Rename some rq_lock straddlers in kerneldoc, improve commit text. (Philipp)
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Luben Tuikov <ltuikov89@gmail.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Philipp Stanner <pstanner@redhat.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
[pstanner: Fix typo in docstring]
Signed-off-by: Philipp Stanner <pstanner@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241016122013.7857-5-tursulin@igalia.com
Stable-dep-of: d25e3a610bae ("drm/sched: Fix race in drm_sched_entity_select_rq()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
drivers/gpu/drm/scheduler/sched_entity.c | 28 ++++++++++++------------
drivers/gpu/drm/scheduler/sched_main.c | 2 +-
include/drm/gpu_scheduler.h | 21 +++++++++---------
3 files changed, 26 insertions(+), 25 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 9dbae7b08bc90..089e8ba0435b8 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -105,7 +105,7 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
/* We start in an idle state. */
complete_all(&entity->entity_idle);
- spin_lock_init(&entity->rq_lock);
+ spin_lock_init(&entity->lock);
spsc_queue_init(&entity->job_queue);
atomic_set(&entity->fence_seq, 0);
@@ -133,10 +133,10 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
{
WARN_ON(!num_sched_list || !sched_list);
- spin_lock(&entity->rq_lock);
+ spin_lock(&entity->lock);
entity->sched_list = sched_list;
entity->num_sched_list = num_sched_list;
- spin_unlock(&entity->rq_lock);
+ spin_unlock(&entity->lock);
}
EXPORT_SYMBOL(drm_sched_entity_modify_sched);
@@ -245,10 +245,10 @@ static void drm_sched_entity_kill(struct drm_sched_entity *entity)
if (!entity->rq)
return;
- spin_lock(&entity->rq_lock);
+ spin_lock(&entity->lock);
entity->stopped = true;
drm_sched_rq_remove_entity(entity->rq, entity);
- spin_unlock(&entity->rq_lock);
+ spin_unlock(&entity->lock);
/* Make sure this entity is not used by the scheduler at the moment */
wait_for_completion(&entity->entity_idle);
@@ -394,9 +394,9 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
void drm_sched_entity_set_priority(struct drm_sched_entity *entity,
enum drm_sched_priority priority)
{
- spin_lock(&entity->rq_lock);
+ spin_lock(&entity->lock);
entity->priority = priority;
- spin_unlock(&entity->rq_lock);
+ spin_unlock(&entity->lock);
}
EXPORT_SYMBOL(drm_sched_entity_set_priority);
@@ -506,10 +506,10 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
next = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
if (next) {
- spin_lock(&entity->rq_lock);
+ spin_lock(&entity->lock);
drm_sched_rq_update_fifo_locked(entity,
next->submit_ts);
- spin_unlock(&entity->rq_lock);
+ spin_unlock(&entity->lock);
}
}
@@ -550,14 +550,14 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
if (fence && !dma_fence_is_signaled(fence))
return;
- spin_lock(&entity->rq_lock);
+ spin_lock(&entity->lock);
sched = drm_sched_pick_best(entity->sched_list, entity->num_sched_list);
rq = sched ? sched->sched_rq[entity->priority] : NULL;
if (rq != entity->rq) {
drm_sched_rq_remove_entity(entity->rq, entity);
entity->rq = rq;
}
- spin_unlock(&entity->rq_lock);
+ spin_unlock(&entity->lock);
if (entity->num_sched_list == 1)
entity->sched_list = NULL;
@@ -598,9 +598,9 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
struct drm_sched_rq *rq;
/* Add the entity to the run queue */
- spin_lock(&entity->rq_lock);
+ spin_lock(&entity->lock);
if (entity->stopped) {
- spin_unlock(&entity->rq_lock);
+ spin_unlock(&entity->lock);
DRM_ERROR("Trying to push to a killed entity\n");
return;
@@ -614,7 +614,7 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
drm_sched_rq_update_fifo_locked(entity, submit_ts);
- spin_unlock(&entity->rq_lock);
+ spin_unlock(&entity->lock);
drm_sched_wakeup(sched);
}
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 0b7976c908dde..4dde0dc525ce5 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -176,7 +176,7 @@ void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, ktime_t ts
* for entity from within concurrent drm_sched_entity_select_rq and the
* other to update the rb tree structure.
*/
- lockdep_assert_held(&entity->rq_lock);
+ lockdep_assert_held(&entity->lock);
spin_lock(&entity->rq->lock);
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 346a3c261b437..e78adc7a91951 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -96,14 +96,22 @@ struct drm_sched_entity {
*/
struct list_head list;
+ /**
+ * @lock:
+ *
+ * Lock protecting the run-queue (@rq) to which this entity belongs,
+ * @priority and the list of schedulers (@sched_list, @num_sched_list).
+ */
+ spinlock_t lock;
+
/**
* @rq:
*
* Runqueue on which this entity is currently scheduled.
*
* FIXME: Locking is very unclear for this. Writers are protected by
- * @rq_lock, but readers are generally lockless and seem to just race
- * with not even a READ_ONCE.
+ * @lock, but readers are generally lockless and seem to just race with
+ * not even a READ_ONCE.
*/
struct drm_sched_rq *rq;
@@ -136,17 +144,10 @@ struct drm_sched_entity {
* @priority:
*
* Priority of the entity. This can be modified by calling
- * drm_sched_entity_set_priority(). Protected by &rq_lock.
+ * drm_sched_entity_set_priority(). Protected by @lock.
*/
enum drm_sched_priority priority;
- /**
- * @rq_lock:
- *
- * Lock to modify the runqueue to which this entity belongs.
- */
- spinlock_t rq_lock;
-
/**
* @job_queue: the list of jobs of this entity.
*/
--
2.51.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 6.12.y 3/3] drm/sched: Fix race in drm_sched_entity_select_rq()
2025-11-03 12:44 ` [PATCH 6.12.y 1/3] drm/sched: Optimise drm_sched_entity_push_job Sasha Levin
2025-11-03 12:44 ` [PATCH 6.12.y 2/3] drm/sched: Re-group and rename the entity run-queue lock Sasha Levin
@ 2025-11-03 12:44 ` Sasha Levin
1 sibling, 0 replies; 14+ messages in thread
From: Sasha Levin @ 2025-11-03 12:44 UTC (permalink / raw)
To: stable; +Cc: Philipp Stanner, Tvrtko Ursulin, Sasha Levin
From: Philipp Stanner <phasta@kernel.org>
[ Upstream commit d25e3a610bae03bffc5c14b5d944a5d0cd844678 ]
In a past bug fix it was forgotten that entity access must be protected
by the entity lock. That's a data race and potentially UB.
Move the spin_unlock() to the appropriate position.
Cc: stable@vger.kernel.org # v5.13+
Fixes: ac4eb83ab255 ("drm/sched: select new rq even if there is only one v3")
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Link: https://patch.msgid.link/20251022063402.87318-2-phasta@kernel.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
drivers/gpu/drm/scheduler/sched_entity.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 089e8ba0435b8..f5b5729433cbb 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -557,10 +557,11 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
drm_sched_rq_remove_entity(entity->rq, entity);
entity->rq = rq;
}
- spin_unlock(&entity->lock);
if (entity->num_sched_list == 1)
entity->sched_list = NULL;
+
+ spin_unlock(&entity->lock);
}
/**
--
2.51.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
end of thread, other threads:[~2025-11-03 12:44 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-03 0:50 FAILED: patch "[PATCH] drm/sched: Fix race in drm_sched_entity_select_rq()" failed to apply to 6.12-stable tree gregkh
2025-11-03 12:44 ` [PATCH 6.12.y 1/3] drm/sched: Optimise drm_sched_entity_push_job Sasha Levin
2025-11-03 12:44 ` [PATCH 6.12.y 2/3] drm/sched: Re-group and rename the entity run-queue lock Sasha Levin
2025-11-03 12:44 ` [PATCH 6.12.y 3/3] drm/sched: Fix race in drm_sched_entity_select_rq() Sasha Levin
-- strict thread matches above, loose matches on Subject: below --
2025-09-22 13:09 [PATCH 6.12.y 1/3] drm/sched: Optimise drm_sched_entity_push_job Jules Maselbas
2025-09-22 15:30 ` Philipp Stanner
2025-09-22 17:39 ` Christian König
2025-09-22 20:50 ` Jules Maselbas
2025-09-23 12:08 ` Philipp Stanner
2025-09-23 12:33 ` Christian König
2025-09-23 13:10 ` Philipp Stanner
2025-09-23 13:18 ` Christian König
2025-09-24 11:00 ` Danilo Krummrich
2025-09-24 12:22 ` Christian König
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox