linux-media.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] Add TODO list (+ small docu change)
@ 2025-10-23 14:30 Philipp Stanner
  2025-10-23 14:30 ` [PATCH v2 1/4] drm/sched: Remove out of place resubmit docu Philipp Stanner
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Philipp Stanner @ 2025-10-23 14:30 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, David Airlie, Simona Vetter, Tvrtko Ursulin
  Cc: dri-devel, linux-kernel, linux-media

drm_sched has so many problems that we should have our own TODO list for
contributors who might wanna help.

Changes in v2:
  - Add generic TODO list example that can stay in the file forever.

Philipp Stanner (4):
  drm/sched: Remove out of place resubmit docu
  drm/sched: Add a TODO list
  drm/sched: Add TODO entry for resubmitting jobs
  drm/sched: Add TODO entry for missing runqueue locks

 drivers/gpu/drm/scheduler/TODO | 51 ++++++++++++++++++++++++++++++++++
 include/drm/gpu_scheduler.h    | 10 -------
 2 files changed, 51 insertions(+), 10 deletions(-)
 create mode 100644 drivers/gpu/drm/scheduler/TODO

-- 
2.49.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 1/4] drm/sched: Remove out of place resubmit docu
  2025-10-23 14:30 [PATCH v2 0/4] Add TODO list (+ small docu change) Philipp Stanner
@ 2025-10-23 14:30 ` Philipp Stanner
  2025-10-23 14:30 ` [PATCH v2 2/4] drm/sched: Add a TODO list Philipp Stanner
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Philipp Stanner @ 2025-10-23 14:30 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, David Airlie, Simona Vetter, Tvrtko Ursulin
  Cc: dri-devel, linux-kernel, linux-media

The documentation for drm_sched_backend_ops.run_job() details that that
callback can be invoked multiple times by the deprecated function
drm_sched_resubmit_jobs(). It also contains an unresolved TODO.

It is not useful to document side effects of a different, deprecated
function in the docu of run_job(): Existing users won't re-evaluate
their usage  of the deprecated function by reading the non-deprecated
one, and new users must not use the deprecated function in the first
place.

Remove the out of place documentation.

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 include/drm/gpu_scheduler.h | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index fb88301b3c45..9c629bbc0684 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -429,16 +429,6 @@ struct drm_sched_backend_ops {
 	 *
 	 * @sched_job: the job to run
 	 *
-	 * The deprecated drm_sched_resubmit_jobs() (called by &struct
-	 * drm_sched_backend_ops.timedout_job) can invoke this again with the
-	 * same parameters. Using this is discouraged because it violates
-	 * dma_fence rules, notably dma_fence_init() has to be called on
-	 * already initialized fences for a second time. Moreover, this is
-	 * dangerous because attempts to allocate memory might deadlock with
-	 * memory management code waiting for the reset to complete.
-	 *
-	 * TODO: Document what drivers should do / use instead.
-	 *
 	 * This method is called in a workqueue context - either from the
 	 * submit_wq the driver passed through drm_sched_init(), or, if the
 	 * driver passed NULL, a separate, ordered workqueue the scheduler
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v2 2/4] drm/sched: Add a TODO list
  2025-10-23 14:30 [PATCH v2 0/4] Add TODO list (+ small docu change) Philipp Stanner
  2025-10-23 14:30 ` [PATCH v2 1/4] drm/sched: Remove out of place resubmit docu Philipp Stanner
@ 2025-10-23 14:30 ` Philipp Stanner
  2025-10-23 14:30 ` [PATCH v2 3/4] drm/sched: Add TODO entry for resubmitting jobs Philipp Stanner
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Philipp Stanner @ 2025-10-23 14:30 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, David Airlie, Simona Vetter, Tvrtko Ursulin
  Cc: dri-devel, linux-kernel, linux-media

The DRM GPU scheduler contains a huge number of problems. These should
be documented for (new) contributors to pick up work.

Add a TODO list with an example entry.

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/scheduler/TODO | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 create mode 100644 drivers/gpu/drm/scheduler/TODO

diff --git a/drivers/gpu/drm/scheduler/TODO b/drivers/gpu/drm/scheduler/TODO
new file mode 100644
index 000000000000..79044adb7d01
--- /dev/null
+++ b/drivers/gpu/drm/scheduler/TODO
@@ -0,0 +1,12 @@
+=== drm_sched TODO list ===
+
+* Example Entry
+  - Difficulty: hard
+  - Contact:
+    - Danilo Krummrich <dakr@kernel.org>
+    - Philipp Stanner <phasta@kernel.org>
+  - Description:
+    This is an example.
+  - Tasks:
+	1. Read the example entry.
+	2. Remove the entry once solved (never in this case)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v2 3/4] drm/sched: Add TODO entry for resubmitting jobs
  2025-10-23 14:30 [PATCH v2 0/4] Add TODO list (+ small docu change) Philipp Stanner
  2025-10-23 14:30 ` [PATCH v2 1/4] drm/sched: Remove out of place resubmit docu Philipp Stanner
  2025-10-23 14:30 ` [PATCH v2 2/4] drm/sched: Add a TODO list Philipp Stanner
@ 2025-10-23 14:30 ` Philipp Stanner
  2025-10-23 14:30 ` [PATCH v2 4/4] drm/sched: Add TODO entry for missing runqueue locks Philipp Stanner
  2025-10-30 11:00 ` [PATCH v2 0/4] Add TODO list (+ small docu change) Danilo Krummrich
  4 siblings, 0 replies; 6+ messages in thread
From: Philipp Stanner @ 2025-10-23 14:30 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, David Airlie, Simona Vetter, Tvrtko Ursulin
  Cc: dri-devel, linux-kernel, linux-media

Add the issue of a successor of drm_sched_resubmit_jobs() missing to the
TODO file.

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/scheduler/TODO | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/TODO b/drivers/gpu/drm/scheduler/TODO
index 79044adb7d01..713dd62c58da 100644
--- a/drivers/gpu/drm/scheduler/TODO
+++ b/drivers/gpu/drm/scheduler/TODO
@@ -10,3 +10,29 @@
   - Tasks:
 	1. Read the example entry.
 	2. Remove the entry once solved (never in this case)
+
+* GPU job resubmits
+  - Difficulty: hard
+  - Contact:
+    - Christian König <ckoenig.leichtzumerken@gmail.com>
+    - Philipp Stanner <phasta@kernel.org>
+  - Description:
+    drm_sched_resubmit_jobs() is deprecated. Main reason being that it leads to
+    reinitializing dma_fences. See that function's docu for details. The better
+    approach for valid resubmissions by amdgpu and Xe is (apparently) to figure
+    out which job (and, through association: which entity) caused the hang. Then,
+    the job's buffer data, together with all other jobs' buffer data currently
+    in the same hardware ring, must be invalidated. This can for example be done
+    by overwriting it.
+    amdgpu currently determines which jobs are in the ring and need to be
+    overwritten by keeping copies of the job. Xe obtains that information by
+    directly accessing drm_sched's pending_list.
+  - Tasks:
+	1. implement scheduler functionality through which
+	   the driver can obtain the information which *broken* jobs are currently in
+	   the hardware ring.
+	2. Such infrastructure would then typically be used in
+	   drm_sched_backend_ops.timedout_job(). Document that.
+	3. Port a driver as first user.
+	3. Document the new alternative in the docu of deprecated
+	   drm_sched_resubmit_jobs().
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v2 4/4] drm/sched: Add TODO entry for missing runqueue locks
  2025-10-23 14:30 [PATCH v2 0/4] Add TODO list (+ small docu change) Philipp Stanner
                   ` (2 preceding siblings ...)
  2025-10-23 14:30 ` [PATCH v2 3/4] drm/sched: Add TODO entry for resubmitting jobs Philipp Stanner
@ 2025-10-23 14:30 ` Philipp Stanner
  2025-10-30 11:00 ` [PATCH v2 0/4] Add TODO list (+ small docu change) Danilo Krummrich
  4 siblings, 0 replies; 6+ messages in thread
From: Philipp Stanner @ 2025-10-23 14:30 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, David Airlie, Simona Vetter, Tvrtko Ursulin
  Cc: dri-devel, linux-kernel, linux-media

struct drm_sched_rq is not being locked at many places throughout the
scheduler, at least for readers. This was documented in a FIXME added
in:

commit 981b04d96856 ("drm/sched: improve docs around drm_sched_entity")

Add a TODO entry for that problem.

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/scheduler/TODO | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/TODO b/drivers/gpu/drm/scheduler/TODO
index 713dd62c58da..263ce2deb69a 100644
--- a/drivers/gpu/drm/scheduler/TODO
+++ b/drivers/gpu/drm/scheduler/TODO
@@ -36,3 +36,16 @@
 	3. Port a driver as first user.
 	3. Document the new alternative in the docu of deprecated
 	   drm_sched_resubmit_jobs().
+
+* Unlocked readers for runqueues
+  - Difficulty: medium
+  - Contact: Philipp Stanner <phasta@kernel.org>
+  - Description:
+    There is an old FIXME by Sima in include/drm/gpu_scheduler.h. It details
+    that struct drm_sched_rq is read at many places without any locks, not even
+    with a READ_ONCE. At XDC 2025 no one could really tell why that is the case,
+    whether locks are needed and whether they could be added. (But for real,
+    that should probably be locked!).
+  - Tasks:
+	1. Check whether locks for runqueue readers can be added.
+	2. If yes, add the locks.
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 0/4] Add TODO list (+ small docu change)
  2025-10-23 14:30 [PATCH v2 0/4] Add TODO list (+ small docu change) Philipp Stanner
                   ` (3 preceding siblings ...)
  2025-10-23 14:30 ` [PATCH v2 4/4] drm/sched: Add TODO entry for missing runqueue locks Philipp Stanner
@ 2025-10-30 11:00 ` Danilo Krummrich
  4 siblings, 0 replies; 6+ messages in thread
From: Danilo Krummrich @ 2025-10-30 11:00 UTC (permalink / raw)
  To: Philipp Stanner
  Cc: Matthew Brost, Christian König, David Airlie, Simona Vetter,
	Tvrtko Ursulin, dri-devel, linux-kernel, linux-media

On 10/23/25 4:30 PM, Philipp Stanner wrote:
> drm_sched has so many problems that we should have our own TODO list for
> contributors who might wanna help.

Looks good,

Acked-by: Danilo Krummrich <dakr@kernel.org>
>  drivers/gpu/drm/scheduler/TODO | 51 ++++++++++++++++++++++++++++++++++

I'd move this into Documentation/, just like we did it for Nova [1].

[1] https://docs.kernel.org/gpu/nova/core/todo.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-10-30 11:00 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-23 14:30 [PATCH v2 0/4] Add TODO list (+ small docu change) Philipp Stanner
2025-10-23 14:30 ` [PATCH v2 1/4] drm/sched: Remove out of place resubmit docu Philipp Stanner
2025-10-23 14:30 ` [PATCH v2 2/4] drm/sched: Add a TODO list Philipp Stanner
2025-10-23 14:30 ` [PATCH v2 3/4] drm/sched: Add TODO entry for resubmitting jobs Philipp Stanner
2025-10-23 14:30 ` [PATCH v2 4/4] drm/sched: Add TODO entry for missing runqueue locks Philipp Stanner
2025-10-30 11:00 ` [PATCH v2 0/4] Add TODO list (+ small docu change) Danilo Krummrich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).