* [PATCH 1/3] drm/sched: Remove out of place resubmit docu
@ 2025-10-17 13:47 Philipp Stanner
2025-10-17 13:47 ` [PATCH 2/3] drm/sched: Add TODO file with first entry Philipp Stanner
2025-10-17 13:47 ` [PATCH 3/3] drm/sched: Add TODO entry for missing runqueue locks Philipp Stanner
0 siblings, 2 replies; 3+ messages in thread
From: Philipp Stanner @ 2025-10-17 13:47 UTC (permalink / raw)
To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
Christian König, David Airlie, Simona Vetter, Sumit Semwal
Cc: dri-devel, linux-kernel, linux-media
The documentation for drm_sched_backend_ops.run_job() details that that
callback can be invoked multiple times by the deprecated function
drm_sched_resubmit_jobs(). It also contains an unresolved TODO.
It is not useful to document side effects of a different, deprecated
function in the docu of run_job(): Existing users won't re-evaluate
their usage of the deprecated function by reading the non-deprecated
one, and new users must not use the deprecated function in the first
place.
Remove the out of place documentation.
Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
include/drm/gpu_scheduler.h | 10 ----------
1 file changed, 10 deletions(-)
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index fb88301b3c45..9c629bbc0684 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -429,16 +429,6 @@ struct drm_sched_backend_ops {
*
* @sched_job: the job to run
*
- * The deprecated drm_sched_resubmit_jobs() (called by &struct
- * drm_sched_backend_ops.timedout_job) can invoke this again with the
- * same parameters. Using this is discouraged because it violates
- * dma_fence rules, notably dma_fence_init() has to be called on
- * already initialized fences for a second time. Moreover, this is
- * dangerous because attempts to allocate memory might deadlock with
- * memory management code waiting for the reset to complete.
- *
- * TODO: Document what drivers should do / use instead.
- *
* This method is called in a workqueue context - either from the
* submit_wq the driver passed through drm_sched_init(), or, if the
* driver passed NULL, a separate, ordered workqueue the scheduler
--
2.49.0
^ permalink raw reply related [flat|nested] 3+ messages in thread
* [PATCH 2/3] drm/sched: Add TODO file with first entry
2025-10-17 13:47 [PATCH 1/3] drm/sched: Remove out of place resubmit docu Philipp Stanner
@ 2025-10-17 13:47 ` Philipp Stanner
2025-10-17 13:47 ` [PATCH 3/3] drm/sched: Add TODO entry for missing runqueue locks Philipp Stanner
1 sibling, 0 replies; 3+ messages in thread
From: Philipp Stanner @ 2025-10-17 13:47 UTC (permalink / raw)
To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
Christian König, David Airlie, Simona Vetter, Sumit Semwal
Cc: dri-devel, linux-kernel, linux-media
Add a drm_sched TODO file with open tasks, contact info, difficulty
level and a job description.
Add the missing successor of drm_sched_resubmit_jobs() as a first task.
Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
drivers/gpu/drm/scheduler/TODO | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)
create mode 100644 drivers/gpu/drm/scheduler/TODO
diff --git a/drivers/gpu/drm/scheduler/TODO b/drivers/gpu/drm/scheduler/TODO
new file mode 100644
index 000000000000..6a06e2858dd6
--- /dev/null
+++ b/drivers/gpu/drm/scheduler/TODO
@@ -0,0 +1,27 @@
+=== drm_sched TODO list ===
+
+* GPU job resubmits
+ - Difficulty: hard
+ - Contact:
+ - Christian König <ckoenig.leichtzumerken@gmail.com>
+ - Philipp Stanner <phasta@kernel.org>
+ - Description:
+ drm_sched_resubmit_jobs() is deprecated. Main reason being that it leads to
+ reinitializing dma_fences. See that function's docu for details. The better
+ approach for valid resubmissions by amdgpu and Xe is (apparently) to figure
+ out which job (and, through association: which entity) caused the hang. Then,
+ the job's buffer data, together with all other jobs' buffer data currently
+ in the same hardware ring, must be invalidated. This can for example be done
+ by overwriting it.
+ amdgpu currently determines which jobs are in the ring and need to be
+ overwritten by keeping copies of the job. Xe obtains that information by
+ directly accessing drm_sched's pending_list.
+ - Tasks:
+ 1. implement scheduler functionality through which
+ the driver can obtain the information which *broken* jobs are currently in
+ the hardware ring.
+ 2. Such infrastructure would then typically be used in
+ drm_sched_backend_ops.timedout_job(). Document that.
+ 3. Port a driver as first user.
+ 3. Document the new alternative in the docu of deprecated
+ drm_sched_resubmit_jobs().
--
2.49.0
^ permalink raw reply related [flat|nested] 3+ messages in thread
* [PATCH 3/3] drm/sched: Add TODO entry for missing runqueue locks
2025-10-17 13:47 [PATCH 1/3] drm/sched: Remove out of place resubmit docu Philipp Stanner
2025-10-17 13:47 ` [PATCH 2/3] drm/sched: Add TODO file with first entry Philipp Stanner
@ 2025-10-17 13:47 ` Philipp Stanner
1 sibling, 0 replies; 3+ messages in thread
From: Philipp Stanner @ 2025-10-17 13:47 UTC (permalink / raw)
To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
Christian König, David Airlie, Simona Vetter, Sumit Semwal
Cc: dri-devel, linux-kernel, linux-media
struct drm_sched_rq is not being locked at many places throughout the
scheduler, at least for readers. This was documented in a FIXME added
in:
commit 981b04d96856 ("drm/sched: improve docs around drm_sched_entity")
Add a TODO entry for that problem.
Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
drivers/gpu/drm/scheduler/TODO | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/drivers/gpu/drm/scheduler/TODO b/drivers/gpu/drm/scheduler/TODO
index 6a06e2858dd6..f4b5bee8e3eb 100644
--- a/drivers/gpu/drm/scheduler/TODO
+++ b/drivers/gpu/drm/scheduler/TODO
@@ -25,3 +25,16 @@
3. Port a driver as first user.
3. Document the new alternative in the docu of deprecated
drm_sched_resubmit_jobs().
+
+* Unlocked readers for runqueues
+ - Difficulty: medium
+ - Contact: Philipp Stanner <phasta@kernel.org>
+ - Description:
+ There is an old FIXME by Sima in include/drm/gpu_scheduler.h. It details
+ that struct drm_sched_rq is read at many places without any locks, not even
+ with a READ_ONCE. At XDC 2025 no one could really tell why that is the case,
+ whether locks are needed and whether they could be added. (But for real,
+ that should probably be locked!).
+ - Tasks:
+ 1. Check whether locks for runqueue readers can be added.
+ 2. If yes, add the locks.
--
2.49.0
^ permalink raw reply related [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-10-17 13:47 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-17 13:47 [PATCH 1/3] drm/sched: Remove out of place resubmit docu Philipp Stanner
2025-10-17 13:47 ` [PATCH 2/3] drm/sched: Add TODO file with first entry Philipp Stanner
2025-10-17 13:47 ` [PATCH 3/3] drm/sched: Add TODO entry for missing runqueue locks Philipp Stanner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox