public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/3] drm/sched: Remove out of place resubmit docu
@ 2025-10-17 13:47 Philipp Stanner
  2025-10-17 13:47 ` [PATCH 2/3] drm/sched: Add TODO file with first entry Philipp Stanner
  2025-10-17 13:47 ` [PATCH 3/3] drm/sched: Add TODO entry for missing runqueue locks Philipp Stanner
  0 siblings, 2 replies; 3+ messages in thread
From: Philipp Stanner @ 2025-10-17 13:47 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, David Airlie, Simona Vetter, Sumit Semwal
  Cc: dri-devel, linux-kernel, linux-media

The documentation for drm_sched_backend_ops.run_job() details that that
callback can be invoked multiple times by the deprecated function
drm_sched_resubmit_jobs(). It also contains an unresolved TODO.

It is not useful to document side effects of a different, deprecated
function in the docu of run_job(): Existing users won't re-evaluate
their usage  of the deprecated function by reading the non-deprecated
one, and new users must not use the deprecated function in the first
place.

Remove the out of place documentation.

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 include/drm/gpu_scheduler.h | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index fb88301b3c45..9c629bbc0684 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -429,16 +429,6 @@ struct drm_sched_backend_ops {
 	 *
 	 * @sched_job: the job to run
 	 *
-	 * The deprecated drm_sched_resubmit_jobs() (called by &struct
-	 * drm_sched_backend_ops.timedout_job) can invoke this again with the
-	 * same parameters. Using this is discouraged because it violates
-	 * dma_fence rules, notably dma_fence_init() has to be called on
-	 * already initialized fences for a second time. Moreover, this is
-	 * dangerous because attempts to allocate memory might deadlock with
-	 * memory management code waiting for the reset to complete.
-	 *
-	 * TODO: Document what drivers should do / use instead.
-	 *
 	 * This method is called in a workqueue context - either from the
 	 * submit_wq the driver passed through drm_sched_init(), or, if the
 	 * driver passed NULL, a separate, ordered workqueue the scheduler
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH 2/3] drm/sched: Add TODO file with first entry
  2025-10-17 13:47 [PATCH 1/3] drm/sched: Remove out of place resubmit docu Philipp Stanner
@ 2025-10-17 13:47 ` Philipp Stanner
  2025-10-17 13:47 ` [PATCH 3/3] drm/sched: Add TODO entry for missing runqueue locks Philipp Stanner
  1 sibling, 0 replies; 3+ messages in thread
From: Philipp Stanner @ 2025-10-17 13:47 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, David Airlie, Simona Vetter, Sumit Semwal
  Cc: dri-devel, linux-kernel, linux-media

Add a drm_sched TODO file with open tasks, contact info, difficulty
level and a job description.

Add the missing successor of drm_sched_resubmit_jobs() as a first task.

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/scheduler/TODO | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)
 create mode 100644 drivers/gpu/drm/scheduler/TODO

diff --git a/drivers/gpu/drm/scheduler/TODO b/drivers/gpu/drm/scheduler/TODO
new file mode 100644
index 000000000000..6a06e2858dd6
--- /dev/null
+++ b/drivers/gpu/drm/scheduler/TODO
@@ -0,0 +1,27 @@
+=== drm_sched TODO list ===
+
+* GPU job resubmits
+  - Difficulty: hard
+  - Contact:
+    - Christian König <ckoenig.leichtzumerken@gmail.com>
+    - Philipp Stanner <phasta@kernel.org>
+  - Description:
+    drm_sched_resubmit_jobs() is deprecated. Main reason being that it leads to
+    reinitializing dma_fences. See that function's docu for details. The better
+    approach for valid resubmissions by amdgpu and Xe is (apparently) to figure
+    out which job (and, through association: which entity) caused the hang. Then,
+    the job's buffer data, together with all other jobs' buffer data currently
+    in the same hardware ring, must be invalidated. This can for example be done
+    by overwriting it.
+    amdgpu currently determines which jobs are in the ring and need to be
+    overwritten by keeping copies of the job. Xe obtains that information by
+    directly accessing drm_sched's pending_list.
+  - Tasks:
+	1. implement scheduler functionality through which
+	   the driver can obtain the information which *broken* jobs are currently in
+	   the hardware ring.
+	2. Such infrastructure would then typically be used in
+	   drm_sched_backend_ops.timedout_job(). Document that.
+	3. Port a driver as first user.
+	3. Document the new alternative in the docu of deprecated
+	   drm_sched_resubmit_jobs().
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH 3/3] drm/sched: Add TODO entry for missing runqueue locks
  2025-10-17 13:47 [PATCH 1/3] drm/sched: Remove out of place resubmit docu Philipp Stanner
  2025-10-17 13:47 ` [PATCH 2/3] drm/sched: Add TODO file with first entry Philipp Stanner
@ 2025-10-17 13:47 ` Philipp Stanner
  1 sibling, 0 replies; 3+ messages in thread
From: Philipp Stanner @ 2025-10-17 13:47 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, David Airlie, Simona Vetter, Sumit Semwal
  Cc: dri-devel, linux-kernel, linux-media

struct drm_sched_rq is not being locked at many places throughout the
scheduler, at least for readers. This was documented in a FIXME added
in:

commit 981b04d96856 ("drm/sched: improve docs around drm_sched_entity")

Add a TODO entry for that problem.

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/scheduler/TODO | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/TODO b/drivers/gpu/drm/scheduler/TODO
index 6a06e2858dd6..f4b5bee8e3eb 100644
--- a/drivers/gpu/drm/scheduler/TODO
+++ b/drivers/gpu/drm/scheduler/TODO
@@ -25,3 +25,16 @@
 	3. Port a driver as first user.
 	3. Document the new alternative in the docu of deprecated
 	   drm_sched_resubmit_jobs().
+
+* Unlocked readers for runqueues
+  - Difficulty: medium
+  - Contact: Philipp Stanner <phasta@kernel.org>
+  - Description:
+    There is an old FIXME by Sima in include/drm/gpu_scheduler.h. It details
+    that struct drm_sched_rq is read at many places without any locks, not even
+    with a READ_ONCE. At XDC 2025 no one could really tell why that is the case,
+    whether locks are needed and whether they could be added. (But for real,
+    that should probably be locked!).
+  - Tasks:
+	1. Check whether locks for runqueue readers can be added.
+	2. If yes, add the locks.
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-10-17 13:47 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-17 13:47 [PATCH 1/3] drm/sched: Remove out of place resubmit docu Philipp Stanner
2025-10-17 13:47 ` [PATCH 2/3] drm/sched: Add TODO file with first entry Philipp Stanner
2025-10-17 13:47 ` [PATCH 3/3] drm/sched: Add TODO entry for missing runqueue locks Philipp Stanner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox