From: "Danilo Krummrich" <dakr@kernel.org>
To: "Christian König" <christian.koenig@amd.com>
Cc: "Philipp Stanner" <pstanner@redhat.com>,
"Jules Maselbas" <jmaselbas@zdiv.net>, <stable@vger.kernel.org>,
<gregkh@linuxfoundation.org>,
"Tvrtko Ursulin" <tvrtko.ursulin@igalia.com>,
"Alex Deucher" <alexander.deucher@amd.com>,
"Luben Tuikov" <ltuikov89@gmail.com>,
"Matthew Brost" <matthew.brost@intel.com>
Subject: Re: [PATCH 6.12.y 1/3] drm/sched: Optimise drm_sched_entity_push_job
Date: Wed, 24 Sep 2025 13:00:12 +0200 [thread overview]
Message-ID: <DD0Z8GX3Z56G.3VLLBSJUG05WK@kernel.org> (raw)
In-Reply-To: <76c94ee6-ba28-4517-8b6c-35658ac95d3b@amd.com>
On Tue Sep 23, 2025 at 2:33 PM CEST, Christian König wrote:
> On 23.09.25 14:08, Philipp Stanner wrote:
>> You know folks, situations like that are why we want to strongly
>> discourage accessing another API's struct members directly. There is no
>> API contract for them.
Indeed, please don't peek API internals. If you need additional functionality,
please send a patch adding a supported API for the component instead.
Drivers messing with component internals makes impossible to maintain them in
the long term.
>> And a proper API function rarely changes its interface, and if it does,
>> it's easy to find for the contributor where drivers need to be
>> adjusted. If we were all following that rule, you wouldn't even have to
>> bother with patches #1 and #2.
>>
>> That said, I see two proper solutions for your problem:
>>
>> A. amdgpu is the one stopping the entities anyways, isn't it? It
>> knows which entities it has killed. So that information could be
>> stored in struct amdgpu_vm.
>
> No, it's the scheduler which decides when entities are stopped.
Can you please show me the code where the scheduler calls any of
drm_sched_entity_fini(), drm_sched_entity_flush(), drm_sched_entity_destroy()?
Or are you referring the broken hack in drm_sched_fini() (introduced by commit
c61cdbdbffc1 ("drm/scheduler: Fix hang when sched_entity released")) where it is
just ignored that we need to take the entity lock as well, because it
inconviniently would lead to lock inversion?
spin_lock(&rq->lock);
list_for_each_entry(s_entity, &rq->entities, list)
/*
* Prevents reinsertion and marks job_queue as idle,
* it will be removed from the rq in drm_sched_entity_fini()
* eventually
*/
s_entity->stopped = true;
spin_unlock(&rq->lock);
The patch description that introduced the hack says:
If scheduler is already stopped by the time sched_entity
is released and entity's job_queue not empty I encountred
a hang in drm_sched_entity_flush.
But this sounds to me as if amdgpu simply doesn't implement the correct shutdown
ordering. Why do nouveau, Xe and other drivers don't have this problem? Why do
we need to solve it in the scheduler instead?
Maybe there are reasonable answers to that. And assuming there are, it still
isn't a justification for building on top of a broken workaround. :(
next prev parent reply other threads:[~2025-09-24 11:00 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-22 13:09 [PATCH 6.12.y 1/3] drm/sched: Optimise drm_sched_entity_push_job Jules Maselbas
2025-09-22 13:09 ` [PATCH 6.12.y 2/3] drm/sched: Re-group and rename the entity run-queue lock Jules Maselbas
2025-09-22 13:09 ` [PATCH 6.12.y 3/3] drm/amdgpu: fix task hang from failed job submission during process kill Jules Maselbas
2025-09-22 15:30 ` [PATCH 6.12.y 1/3] drm/sched: Optimise drm_sched_entity_push_job Philipp Stanner
2025-09-22 17:39 ` Christian König
2025-09-22 20:50 ` Jules Maselbas
2025-09-23 12:08 ` Philipp Stanner
2025-09-23 12:33 ` Christian König
2025-09-23 13:10 ` Philipp Stanner
2025-09-23 13:18 ` Christian König
2025-09-24 11:00 ` Danilo Krummrich [this message]
2025-09-24 12:22 ` Christian König
-- strict thread matches above, loose matches on Subject: below --
2025-11-03 0:50 FAILED: patch "[PATCH] drm/sched: Fix race in drm_sched_entity_select_rq()" failed to apply to 6.12-stable tree gregkh
2025-11-03 12:44 ` [PATCH 6.12.y 1/3] drm/sched: Optimise drm_sched_entity_push_job Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=DD0Z8GX3Z56G.3VLLBSJUG05WK@kernel.org \
--to=dakr@kernel.org \
--cc=alexander.deucher@amd.com \
--cc=christian.koenig@amd.com \
--cc=gregkh@linuxfoundation.org \
--cc=jmaselbas@zdiv.net \
--cc=ltuikov89@gmail.com \
--cc=matthew.brost@intel.com \
--cc=pstanner@redhat.com \
--cc=stable@vger.kernel.org \
--cc=tvrtko.ursulin@igalia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox