[PATCH v5 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_NO

dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_NO_HANG
@ 2025-07-08 13:25 Maíra Canal
  2025-07-08 13:25 ` [PATCH v5 1/8] drm/sched: Rename DRM_GPU_SCHED_STAT_NOMINAL to DRM_GPU_SCHED_STAT_RESET Maíra Canal
                   ` (8 more replies)
  0 siblings, 9 replies; 23+ messages in thread
From: Maíra Canal @ 2025-07-08 13:25 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Tvrtko Ursulin, Simona Vetter, David Airlie,
	Melissa Wen, Lucas Stach, Russell King, Christian Gmeiner,
	Lucas De Marchi, Thomas Hellström, Rodrigo Vivi,
	Boris Brezillon, Rob Herring, Steven Price, Liviu Dudau
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe, Maíra Canal,
	Min Ma, Lizhi Hou, Oded Gabbay, Frank Binns, Matt Coster,
	Qiang Yu, Lyude Paul, Alex Deucher, Christian König

TL;DR: The only two patches that are lacking R-b's are:

[PATCH 2/8] drm/sched: Allow drivers to skip the reset and keep on running
[PATCH 7/8] drm/xe: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset
  -> If Intel CI succeeds, it's Reviewed-by: Matthew Brost <matthew.brost@intel.com>

For those two patches, it would be great to gather feedback and/or R-b's,
particularly from the Intel folks.

Thanks for all the reviews so far!

---
When the DRM scheduler times out, it's possible that the GPU isn't hung;
instead, a job may still be running, and there may be no valid reason to
reset the hardware. This can occur in two situations:

  1. The GPU exposes some mechanism that ensures the GPU is still making
     progress. By checking this mechanism, the driver can safely skip the
     reset, re-arm the timeout, and allow the job to continue running until
     completion. This is the case for v3d, Etnaviv, and Xe.

  2. Timeout has fired before the free-job worker. Consequently, the
     scheduler calls `timedout_job()` for a job that isn't timed out.

These two scenarios are problematic because the job was removed from the
`sched->pending_list` before calling `sched->ops->timedout_job()`, which
means that when the job finishes, it won't be freed by the scheduler
though `sched->ops->free_job()`. As a result, the job and its resources
won't be freed, leading to a memory leak.

For v3d specifically, we have observed that these memory leaks can be
significant in certain scenarios, as reported by users in [1][2]. To
address this situation, I submitted a patch similar to commit 704d3d60fec4
("drm/etnaviv: don't block scheduler when GPU is still active") for v3d [3].
This patch has already landed in drm-misc-fixes and successfully resolved
the users' issues.

However, as I mentioned in [3], exposing the scheduler's internals within
the drivers isn't ideal and I believe this specific situation can be
addressed within the DRM scheduler framework.

This series aims to resolve this issue by adding a new DRM sched status
that allows a driver to skip the reset. This new status will indicate that
the job should be reinserted into the pending list, and the driver will
still signal its completion.

[1] https://gitlab.freedesktop.org/mesa/mesa/-/issues/12227
[2] https://github.com/raspberrypi/linux/issues/6817
[3] https://lore.kernel.org/dri-devel/20250430210643.57924-1-mcanal@igalia.com/T/

Best Regards,
- Maíra

---
v1 -> v2:

- Fix several grammar nits across the documentation and commit messages.
- Drop "drm/sched: Always free the job after the timeout" (Tvrtko Ursulin)
- [1/8] NEW PATCH: Rename DRM_GPU_SCHED_STAT_NOMINAL to a more semantic
	name (Tvrtko Ursulin, Philipp Stanner)
- [2/8] Rename DRM_GPU_SCHED_STAT_RUNNING to DRM_GPU_SCHED_STAT_NO_HANG (Tvrtko Ursulin, Philipp Stanner)
- [2/8] Requeue free-job work after reinserting the job to the pending list (Matthew Brost)
- [2/8] Create a helper function to reinsert the job (Philipp Stanner)
- [2/8] Rewrite the commit message (Philipp Stanner)
- [2/8] Add a comment to `drm_sched_start()` documentation, similar to what
	was commented in `drm_sched_stop()` (Philipp Stanner)
- [3/8] Keep HZ as timeout for `drm_mock_sched_job_wait_scheduled()` (Tvrtko Ursulin)
- [4/8] Use a job flag to indicate that `timedout_job()` should skip the
	reset (Tvrtko Ursulin)
- [7/8] Use DRM_GPU_SCHED_STAT_NO_HANG to re-arm the timer in other cases
	as well (Matthew Brost)
- Link to v1: https://lore.kernel.org/r/20250503-sched-skip-reset-v1-0-ed0d6701a3fe@igalia.com

v2 -> v3:

- [2/8] Address comments about the commit message (Philipp Stanner)
- [2/8] Improve comments and documentation style (Philipp Stanner)
- [3/8] Rename the commit title to "drm/sched: Make timeout KUnit tests faster" (Philipp Stanner)
- [3/8] Add Tvrtko's R-b (Tvrtko Ursulin)
- [4/8] Instead of setting up a job duration, advance it manually (Tvrtko Ursulin)
- [4/8] Wait for 2 * MOCK_TIMEOUT instead of MOCK_TIMEOUT (Tvrtko Ursulin)
- [5/8, 6/8, 7/8, 8/8] Use Philipp's suggestion to improve the commit messages (Philipp Stanner)
- Link to v2: https://lore.kernel.org/r/20250530-sched-skip-reset-v2-0-c40a8d2d8daa@igalia.com

v3 -> v4:

- [1/8] s/betters/better and Philipp's R-b (Philipp Stanner)
- [2/8] Apply some documentation nits (Philipp Stanner)
- [3/8] Add Philipp's A-b  (Philipp Stanner)
- [4/8, 5/8] Add Tvrtko's R-b (Tvrtko Ursulin)
- [6/8] Add Lucas' R-b (Lucas Stach)
- Link to v3: https://lore.kernel.org/r/20250618-sched-skip-reset-v3-0-8be5cca2725d@igalia.com

v4 -> v5:

- Rebased on top of drm-tip (for Intel CI)
- [2/8] Reword the commit message (Philipp Stanner)
- [2/8] Reword several comments (Philipp Stanner)
- [4/8] Add Philipp's R-b (Philipp Stanner)
- Link to v4: https://lore.kernel.org/r/20250707-sched-skip-reset-v4-0-036c0f0f584f@igalia.com

---
Maíra Canal (8):
      drm/sched: Rename DRM_GPU_SCHED_STAT_NOMINAL to DRM_GPU_SCHED_STAT_RESET
      drm/sched: Allow drivers to skip the reset and keep on running
      drm/sched: Make timeout KUnit tests faster
      drm/sched: Add new test for DRM_GPU_SCHED_STAT_NO_HANG
      drm/v3d: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset
      drm/etnaviv: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset
      drm/xe: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset
      drm/panfrost: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset

 drivers/accel/amdxdna/aie2_ctx.c                 |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c          |  2 +-
 drivers/gpu/drm/etnaviv/etnaviv_sched.c          | 16 +++----
 drivers/gpu/drm/imagination/pvr_queue.c          |  4 +-
 drivers/gpu/drm/lima/lima_sched.c                |  6 +--
 drivers/gpu/drm/nouveau/nouveau_exec.c           |  2 +-
 drivers/gpu/drm/nouveau/nouveau_sched.c          |  2 +-
 drivers/gpu/drm/panfrost/panfrost_job.c          | 10 ++---
 drivers/gpu/drm/panthor/panthor_mmu.c            |  2 +-
 drivers/gpu/drm/panthor/panthor_sched.c          |  2 +-
 drivers/gpu/drm/scheduler/sched_main.c           | 48 +++++++++++++++++++--
 drivers/gpu/drm/scheduler/tests/mock_scheduler.c |  7 ++-
 drivers/gpu/drm/scheduler/tests/sched_tests.h    |  1 +
 drivers/gpu/drm/scheduler/tests/tests_basic.c    | 55 ++++++++++++++++++++++--
 drivers/gpu/drm/v3d/v3d_sched.c                  | 18 ++------
 drivers/gpu/drm/xe/xe_guc_submit.c               | 14 ++----
 include/drm/gpu_scheduler.h                      |  7 ++-
 17 files changed, 137 insertions(+), 61 deletions(-)
---
base-commit: 8b32b5509128873da8ecfc06beefcb58927eb50b
change-id: 20250502-sched-skip-reset-bf7c163233da


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v5 1/8] drm/sched: Rename DRM_GPU_SCHED_STAT_NOMINAL to DRM_GPU_SCHED_STAT_RESET
  2025-07-08 13:25 [PATCH v5 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_NO_HANG Maíra Canal
@ 2025-07-08 13:25 ` Maíra Canal
  2025-07-08 13:25 ` [PATCH v5 2/8] drm/sched: Allow drivers to skip the reset and keep on running Maíra Canal
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Maíra Canal @ 2025-07-08 13:25 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Tvrtko Ursulin, Simona Vetter, David Airlie,
	Melissa Wen, Lucas Stach, Russell King, Christian Gmeiner,
	Lucas De Marchi, Thomas Hellström, Rodrigo Vivi,
	Boris Brezillon, Rob Herring, Steven Price, Liviu Dudau
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe, Maíra Canal,
	Min Ma, Lizhi Hou, Oded Gabbay, Frank Binns, Matt Coster,
	Qiang Yu, Lyude Paul, Alex Deucher, Christian König

Among the scheduler's statuses, the only one that indicates an error is
DRM_GPU_SCHED_STAT_ENODEV. Any status other than DRM_GPU_SCHED_STAT_ENODEV
signifies that the operation succeeded and the GPU is in a nominal state.

However, to provide more information about the GPU's status, it is needed
to convey more information than just "OK".

Therefore, rename DRM_GPU_SCHED_STAT_NOMINAL to
DRM_GPU_SCHED_STAT_RESET, which better communicates the meaning of this
status. The status DRM_GPU_SCHED_STAT_RESET indicates that the GPU has
hung, but it has been successfully reset and is now in a nominal state
again.

Signed-off-by: Maíra Canal <mcanal@igalia.com>
Reviewed-by: Philipp Stanner <phasta@kernel.org>
---
To: Min Ma <min.ma@amd.com>
To: Lizhi Hou <lizhi.hou@amd.com>
To: Oded Gabbay <ogabbay@kernel.org>
To: Frank Binns <frank.binns@imgtec.com>
To: Matt Coster <matt.coster@imgtec.com>
To: Qiang Yu <yuq825@gmail.com>
To: Lyude Paul <lyude@redhat.com>
To: Alex Deucher <alexander.deucher@amd.com>
To: Christian König <christian.koenig@amd.com>
---
 drivers/accel/amdxdna/aie2_ctx.c                 | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c          | 2 +-
 drivers/gpu/drm/etnaviv/etnaviv_sched.c          | 4 ++--
 drivers/gpu/drm/imagination/pvr_queue.c          | 4 ++--
 drivers/gpu/drm/lima/lima_sched.c                | 6 +++---
 drivers/gpu/drm/nouveau/nouveau_exec.c           | 2 +-
 drivers/gpu/drm/nouveau/nouveau_sched.c          | 2 +-
 drivers/gpu/drm/panfrost/panfrost_job.c          | 6 +++---
 drivers/gpu/drm/panthor/panthor_mmu.c            | 2 +-
 drivers/gpu/drm/panthor/panthor_sched.c          | 2 +-
 drivers/gpu/drm/scheduler/sched_main.c           | 2 +-
 drivers/gpu/drm/scheduler/tests/mock_scheduler.c | 2 +-
 drivers/gpu/drm/v3d/v3d_sched.c                  | 6 +++---
 drivers/gpu/drm/xe/xe_guc_submit.c               | 6 +++---
 include/drm/gpu_scheduler.h                      | 4 ++--
 15 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/drivers/accel/amdxdna/aie2_ctx.c b/drivers/accel/amdxdna/aie2_ctx.c
index f20999f2d66864fd4a6b7069e866727c37befb39..2cff5419bd2facb59ff5df6388aba0512fd45d5c 100644
--- a/drivers/accel/amdxdna/aie2_ctx.c
+++ b/drivers/accel/amdxdna/aie2_ctx.c
@@ -361,7 +361,7 @@ aie2_sched_job_timedout(struct drm_sched_job *sched_job)
 	aie2_hwctx_restart(xdna, hwctx);
 	mutex_unlock(&xdna->dev_lock);
 
-	return DRM_GPU_SCHED_STAT_NOMINAL;
+	return DRM_GPU_SCHED_STAT_RESET;
 }
 
 static const struct drm_sched_backend_ops sched_ops = {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 2b58e353cca154223ee5314f0285cc1f805430f6..b9ecb13930894430c206a7d80a9c870ec2635b68 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -176,7 +176,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 
 exit:
 	drm_dev_exit(idx);
-	return DRM_GPU_SCHED_STAT_NOMINAL;
+	return DRM_GPU_SCHED_STAT_RESET;
 }
 
 int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
index 71e2e6b9d71393d5f81eadd109a50e1b83f85e5f..030e6c3233bed9edd5d9c2f49f842f604e57fc1c 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
@@ -87,13 +87,13 @@ static enum drm_gpu_sched_stat etnaviv_sched_timedout_job(struct drm_sched_job
 	drm_sched_resubmit_jobs(&gpu->sched);
 
 	drm_sched_start(&gpu->sched, 0);
-	return DRM_GPU_SCHED_STAT_NOMINAL;
+	return DRM_GPU_SCHED_STAT_RESET;
 
 out_no_timeout:
 	spin_lock(&sched->job_list_lock);
 	list_add(&sched_job->list, &sched->pending_list);
 	spin_unlock(&sched->job_list_lock);
-	return DRM_GPU_SCHED_STAT_NOMINAL;
+	return DRM_GPU_SCHED_STAT_RESET;
 }
 
 static void etnaviv_sched_free_job(struct drm_sched_job *sched_job)
diff --git a/drivers/gpu/drm/imagination/pvr_queue.c b/drivers/gpu/drm/imagination/pvr_queue.c
index 5a41ee79fed646a86344cd16e78efdb45ff02e43..fc415dd0d7a73631bd4144c9f35b9b294c625a12 100644
--- a/drivers/gpu/drm/imagination/pvr_queue.c
+++ b/drivers/gpu/drm/imagination/pvr_queue.c
@@ -803,7 +803,7 @@ static void pvr_queue_start(struct pvr_queue *queue)
  * the scheduler, and re-assign parent fences in the middle.
  *
  * Return:
- *  * DRM_GPU_SCHED_STAT_NOMINAL.
+ *  * DRM_GPU_SCHED_STAT_RESET.
  */
 static enum drm_gpu_sched_stat
 pvr_queue_timedout_job(struct drm_sched_job *s_job)
@@ -854,7 +854,7 @@ pvr_queue_timedout_job(struct drm_sched_job *s_job)
 
 	drm_sched_start(sched, 0);
 
-	return DRM_GPU_SCHED_STAT_NOMINAL;
+	return DRM_GPU_SCHED_STAT_RESET;
 }
 
 /**
diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
index 954f4325b859b2977a2cc608a99a6ebb642f1000..739e8c6c6d909aa4263bad8a12ec07f0c6607bb2 100644
--- a/drivers/gpu/drm/lima/lima_sched.c
+++ b/drivers/gpu/drm/lima/lima_sched.c
@@ -412,7 +412,7 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
 	 */
 	if (dma_fence_is_signaled(task->fence)) {
 		DRM_WARN("%s spurious timeout\n", lima_ip_name(ip));
-		return DRM_GPU_SCHED_STAT_NOMINAL;
+		return DRM_GPU_SCHED_STAT_RESET;
 	}
 
 	/*
@@ -429,7 +429,7 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
 
 	if (dma_fence_is_signaled(task->fence)) {
 		DRM_WARN("%s unexpectedly high interrupt latency\n", lima_ip_name(ip));
-		return DRM_GPU_SCHED_STAT_NOMINAL;
+		return DRM_GPU_SCHED_STAT_RESET;
 	}
 
 	/*
@@ -467,7 +467,7 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
 	drm_sched_resubmit_jobs(&pipe->base);
 	drm_sched_start(&pipe->base, 0);
 
-	return DRM_GPU_SCHED_STAT_NOMINAL;
+	return DRM_GPU_SCHED_STAT_RESET;
 }
 
 static void lima_sched_free_job(struct drm_sched_job *job)
diff --git a/drivers/gpu/drm/nouveau/nouveau_exec.c b/drivers/gpu/drm/nouveau/nouveau_exec.c
index 41b7c608c9054869ddadfe17c96100266e44c254..edbbda78bac90432c4877aa39a9587cf976705c7 100644
--- a/drivers/gpu/drm/nouveau/nouveau_exec.c
+++ b/drivers/gpu/drm/nouveau/nouveau_exec.c
@@ -189,7 +189,7 @@ nouveau_exec_job_timeout(struct nouveau_job *job)
 	NV_PRINTK(warn, job->cli, "job timeout, channel %d killed!\n",
 		  chan->chid);
 
-	return DRM_GPU_SCHED_STAT_NOMINAL;
+	return DRM_GPU_SCHED_STAT_RESET;
 }
 
 static const struct nouveau_job_ops nouveau_exec_job_ops = {
diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
index 460a5fb024129a0557f2b55008278e1378019d89..e60f7892f5ce9aff0c5fa1908c1a0445891927ed 100644
--- a/drivers/gpu/drm/nouveau/nouveau_sched.c
+++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
@@ -371,7 +371,7 @@ nouveau_sched_timedout_job(struct drm_sched_job *sched_job)
 {
 	struct drm_gpu_scheduler *sched = sched_job->sched;
 	struct nouveau_job *job = to_nouveau_job(sched_job);
-	enum drm_gpu_sched_stat stat = DRM_GPU_SCHED_STAT_NOMINAL;
+	enum drm_gpu_sched_stat stat = DRM_GPU_SCHED_STAT_RESET;
 
 	drm_sched_stop(sched, sched_job);
 
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
index 5657106c2f7d0a0ca6162850767f58f3200cce13..afcffe7f8fe9e11f84e4ab7e8f5a72f7bf583690 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -755,7 +755,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job
 	 * spurious. Bail out.
 	 */
 	if (dma_fence_is_signaled(job->done_fence))
-		return DRM_GPU_SCHED_STAT_NOMINAL;
+		return DRM_GPU_SCHED_STAT_RESET;
 
 	/*
 	 * Panfrost IRQ handler may take a long time to process an interrupt
@@ -770,7 +770,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job
 
 	if (dma_fence_is_signaled(job->done_fence)) {
 		dev_warn(pfdev->dev, "unexpectedly high interrupt latency\n");
-		return DRM_GPU_SCHED_STAT_NOMINAL;
+		return DRM_GPU_SCHED_STAT_RESET;
 	}
 
 	dev_err(pfdev->dev, "gpu sched timeout, js=%d, config=0x%x, status=0x%x, head=0x%x, tail=0x%x, sched_job=%p",
@@ -786,7 +786,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job
 	atomic_set(&pfdev->reset.pending, 1);
 	panfrost_reset(pfdev, sched_job);
 
-	return DRM_GPU_SCHED_STAT_NOMINAL;
+	return DRM_GPU_SCHED_STAT_RESET;
 }
 
 static void panfrost_reset_work(struct work_struct *work)
diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c b/drivers/gpu/drm/panthor/panthor_mmu.c
index b39ea6acc6a9681f2ffa7d52248b6d2c119d1f1b..d0ae462015510bd0c25aaffc059d28084e600372 100644
--- a/drivers/gpu/drm/panthor/panthor_mmu.c
+++ b/drivers/gpu/drm/panthor/panthor_mmu.c
@@ -2270,7 +2270,7 @@ static enum drm_gpu_sched_stat
 panthor_vm_bind_timedout_job(struct drm_sched_job *sched_job)
 {
 	WARN(1, "VM_BIND ops are synchronous for now, there should be no timeout!");
-	return DRM_GPU_SCHED_STAT_NOMINAL;
+	return DRM_GPU_SCHED_STAT_RESET;
 }
 
 static const struct drm_sched_backend_ops panthor_vm_bind_ops = {
diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index a2248f692a030c1c84869b9a1948ad1cb0c0b490..8f17394cc82aad9eaf01e473cd9d3dea46fa3d61 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -3241,7 +3241,7 @@ queue_timedout_job(struct drm_sched_job *sched_job)
 
 	queue_start(queue);
 
-	return DRM_GPU_SCHED_STAT_NOMINAL;
+	return DRM_GPU_SCHED_STAT_RESET;
 }
 
 static void queue_free_job(struct drm_sched_job *sched_job)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 648b5d2feff886bc12e81af3c35335fa4e5dd050..0f32e2cb43d6af294408968a970990f9f5c47bee 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -535,7 +535,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
 {
 	struct drm_gpu_scheduler *sched;
 	struct drm_sched_job *job;
-	enum drm_gpu_sched_stat status = DRM_GPU_SCHED_STAT_NOMINAL;
+	enum drm_gpu_sched_stat status = DRM_GPU_SCHED_STAT_RESET;
 
 	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
 
diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
index 49d067fecd67f705f7a7b957ce04699e3cc7af76..998162202972eb5919dfff4c8784ecc22c00ec9d 100644
--- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
+++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
@@ -231,7 +231,7 @@ mock_sched_timedout_job(struct drm_sched_job *sched_job)
 	drm_sched_job_cleanup(sched_job);
 	/* Mock job itself is freed by the kunit framework. */
 
-	return DRM_GPU_SCHED_STAT_NOMINAL;
+	return DRM_GPU_SCHED_STAT_RESET;
 }
 
 static void mock_sched_free_job(struct drm_sched_job *sched_job)
diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
index 42df9d3567e79e54bf16045f31d8785d0c765670..ebbafedef952a575a3ec2e690c1391d3385782a1 100644
--- a/drivers/gpu/drm/v3d/v3d_sched.c
+++ b/drivers/gpu/drm/v3d/v3d_sched.c
@@ -745,7 +745,7 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, struct drm_sched_job *sched_job)
 
 	mutex_unlock(&v3d->reset_lock);
 
-	return DRM_GPU_SCHED_STAT_NOMINAL;
+	return DRM_GPU_SCHED_STAT_RESET;
 }
 
 static void
@@ -777,7 +777,7 @@ v3d_cl_job_timedout(struct drm_sched_job *sched_job, enum v3d_queue q,
 		*timedout_ctra = ctra;
 
 		v3d_sched_skip_reset(sched_job);
-		return DRM_GPU_SCHED_STAT_NOMINAL;
+		return DRM_GPU_SCHED_STAT_RESET;
 	}
 
 	return v3d_gpu_reset_for_timeout(v3d, sched_job);
@@ -823,7 +823,7 @@ v3d_csd_job_timedout(struct drm_sched_job *sched_job)
 		job->timedout_batches = batches;
 
 		v3d_sched_skip_reset(sched_job);
-		return DRM_GPU_SCHED_STAT_NOMINAL;
+		return DRM_GPU_SCHED_STAT_RESET;
 	}
 
 	return v3d_gpu_reset_for_timeout(v3d, sched_job);
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 5f91b9a6ab7d73aa1dceede7a76919fa36a194a8..1430b58d096b03a78292e523e3ee7c5dddd7efdd 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -1097,7 +1097,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 		xe_sched_add_pending_job(sched, job);
 		xe_sched_submission_start(sched);
 
-		return DRM_GPU_SCHED_STAT_NOMINAL;
+		return DRM_GPU_SCHED_STAT_RESET;
 	}
 
 	/* Kill the run_job entry point */
@@ -1267,7 +1267,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 	/* Start fence signaling */
 	xe_hw_fence_irq_start(q->fence_irq);
 
-	return DRM_GPU_SCHED_STAT_NOMINAL;
+	return DRM_GPU_SCHED_STAT_RESET;
 
 sched_enable:
 	enable_scheduling(q);
@@ -1280,7 +1280,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 	xe_sched_add_pending_job(sched, job);
 	xe_sched_submission_start(sched);
 
-	return DRM_GPU_SCHED_STAT_NOMINAL;
+	return DRM_GPU_SCHED_STAT_RESET;
 }
 
 static void __guc_exec_queue_fini_async(struct work_struct *w)
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index e62a7214e05217d72de5c6e5168544d47099090a..83e5c00d8dd9a83ab20547a93d6fc572de97616e 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -391,12 +391,12 @@ struct drm_sched_job {
  * enum drm_gpu_sched_stat - the scheduler's status
  *
  * @DRM_GPU_SCHED_STAT_NONE: Reserved. Do not use.
- * @DRM_GPU_SCHED_STAT_NOMINAL: Operation succeeded.
+ * @DRM_GPU_SCHED_STAT_RESET: The GPU hung and successfully reset.
  * @DRM_GPU_SCHED_STAT_ENODEV: Error: Device is not available anymore.
  */
 enum drm_gpu_sched_stat {
 	DRM_GPU_SCHED_STAT_NONE,
-	DRM_GPU_SCHED_STAT_NOMINAL,
+	DRM_GPU_SCHED_STAT_RESET,
 	DRM_GPU_SCHED_STAT_ENODEV,
 };
 

-- 
2.50.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 2/8] drm/sched: Allow drivers to skip the reset and keep on running
  2025-07-08 13:25 [PATCH v5 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_NO_HANG Maíra Canal
  2025-07-08 13:25 ` [PATCH v5 1/8] drm/sched: Rename DRM_GPU_SCHED_STAT_NOMINAL to DRM_GPU_SCHED_STAT_RESET Maíra Canal
@ 2025-07-08 13:25 ` Maíra Canal
  2025-07-09 13:08   ` Philipp Stanner
  2025-07-11 13:22   ` Christian König
  2025-07-08 13:25 ` [PATCH v5 3/8] drm/sched: Make timeout KUnit tests faster Maíra Canal
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 23+ messages in thread
From: Maíra Canal @ 2025-07-08 13:25 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Tvrtko Ursulin, Simona Vetter, David Airlie,
	Melissa Wen, Lucas Stach, Russell King, Christian Gmeiner,
	Lucas De Marchi, Thomas Hellström, Rodrigo Vivi,
	Boris Brezillon, Rob Herring, Steven Price, Liviu Dudau
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe, Maíra Canal

When the DRM scheduler times out, it's possible that the GPU isn't hung;
instead, a job just took unusually long (longer than the timeout) but is
still running, and there is, thus, no reason to reset the hardware. This
can occur in two scenarios:

  1. The job is taking longer than the timeout, but the driver determined
     through a GPU-specific mechanism that the hardware is still making
     progress. Hence, the driver would like the scheduler to skip the
     timeout and treat the job as still pending from then onward. This
     happens in v3d, Etnaviv, and Xe.
  2. Timeout has fired before the free-job worker. Consequently, the
     scheduler calls `sched->ops->timedout_job()` for a job that isn't
     timed out.

These two scenarios are problematic because the job was removed from the
`sched->pending_list` before calling `sched->ops->timedout_job()`, which
means that when the job finishes, it won't be freed by the scheduler
though `sched->ops->free_job()` - leading to a memory leak.

To solve these problems, create a new `drm_gpu_sched_stat`, called
DRM_GPU_SCHED_STAT_NO_HANG, which allows a driver to skip the reset. The
new status will indicate that the job must be reinserted into
`sched->pending_list`, and the hardware / driver will still complete that
job.

Signed-off-by: Maíra Canal <mcanal@igalia.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 46 ++++++++++++++++++++++++++++++++--
 include/drm/gpu_scheduler.h            |  3 +++
 2 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 0f32e2cb43d6af294408968a970990f9f5c47bee..657846d56dacd4f26fffc954fc3d025c1e6bfc9f 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -374,11 +374,16 @@ static void drm_sched_run_free_queue(struct drm_gpu_scheduler *sched)
 {
 	struct drm_sched_job *job;
 
-	spin_lock(&sched->job_list_lock);
 	job = list_first_entry_or_null(&sched->pending_list,
 				       struct drm_sched_job, list);
 	if (job && dma_fence_is_signaled(&job->s_fence->finished))
 		__drm_sched_run_free_queue(sched);
+}
+
+static void drm_sched_run_free_queue_unlocked(struct drm_gpu_scheduler *sched)
+{
+	spin_lock(&sched->job_list_lock);
+	drm_sched_run_free_queue(sched);
 	spin_unlock(&sched->job_list_lock);
 }
 
@@ -531,6 +536,32 @@ static void drm_sched_job_begin(struct drm_sched_job *s_job)
 	spin_unlock(&sched->job_list_lock);
 }
 
+/**
+ * drm_sched_job_reinsert_on_false_timeout - reinsert the job on a false timeout
+ * @sched: scheduler instance
+ * @job: job to be reinserted on the pending list
+ *
+ * In the case of a "false timeout" - when a timeout occurs but the GPU isn't
+ * hung and is making progress, the scheduler must reinsert the job back into
+ * @sched->pending_list. Otherwise, the job and its resources won't be freed
+ * through the &struct drm_sched_backend_ops.free_job callback.
+ *
+ * This function must be used in "false timeout" cases only.
+ */
+static void drm_sched_job_reinsert_on_false_timeout(struct drm_gpu_scheduler *sched,
+						    struct drm_sched_job *job)
+{
+	spin_lock(&sched->job_list_lock);
+	list_add(&job->list, &sched->pending_list);
+
+	/* After reinserting the job, the scheduler enqueues the free-job work
+	 * again if ready. Otherwise, a signaled job could be added to the
+	 * pending list, but never freed.
+	 */
+	drm_sched_run_free_queue(sched);
+	spin_unlock(&sched->job_list_lock);
+}
+
 static void drm_sched_job_timedout(struct work_struct *work)
 {
 	struct drm_gpu_scheduler *sched;
@@ -564,6 +595,9 @@ static void drm_sched_job_timedout(struct work_struct *work)
 			job->sched->ops->free_job(job);
 			sched->free_guilty = false;
 		}
+
+		if (status == DRM_GPU_SCHED_STAT_NO_HANG)
+			drm_sched_job_reinsert_on_false_timeout(sched, job);
 	} else {
 		spin_unlock(&sched->job_list_lock);
 	}
@@ -586,6 +620,10 @@ static void drm_sched_job_timedout(struct work_struct *work)
  * This function is typically used for reset recovery (see the docu of
  * drm_sched_backend_ops.timedout_job() for details). Do not call it for
  * scheduler teardown, i.e., before calling drm_sched_fini().
+ *
+ * As it's only used for reset recovery, drivers must not call this function
+ * in their &struct drm_sched_backend_ops.timedout_job callback when they
+ * skip a reset using &enum drm_gpu_sched_stat.DRM_GPU_SCHED_STAT_NO_HANG.
  */
 void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
 {
@@ -671,6 +709,10 @@ EXPORT_SYMBOL(drm_sched_stop);
  * drm_sched_backend_ops.timedout_job() for details). Do not call it for
  * scheduler startup. The scheduler itself is fully operational after
  * drm_sched_init() succeeded.
+ *
+ * As it's only used for reset recovery, drivers must not call this function
+ * in their &struct drm_sched_backend_ops.timedout_job callback when they
+ * skip a reset using &enum drm_gpu_sched_stat.DRM_GPU_SCHED_STAT_NO_HANG.
  */
 void drm_sched_start(struct drm_gpu_scheduler *sched, int errno)
 {
@@ -1192,7 +1234,7 @@ static void drm_sched_free_job_work(struct work_struct *w)
 	if (job)
 		sched->ops->free_job(job);
 
-	drm_sched_run_free_queue(sched);
+	drm_sched_run_free_queue_unlocked(sched);
 	drm_sched_run_job_queue(sched);
 }
 
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 83e5c00d8dd9a83ab20547a93d6fc572de97616e..257d21d8d1d2c4f035d6d4882e159de59b263c76 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -393,11 +393,14 @@ struct drm_sched_job {
  * @DRM_GPU_SCHED_STAT_NONE: Reserved. Do not use.
  * @DRM_GPU_SCHED_STAT_RESET: The GPU hung and successfully reset.
  * @DRM_GPU_SCHED_STAT_ENODEV: Error: Device is not available anymore.
+ * @DRM_GPU_SCHED_STAT_NO_HANG: Contrary to scheduler's assumption, the GPU
+ * did not hang and is still running.
  */
 enum drm_gpu_sched_stat {
 	DRM_GPU_SCHED_STAT_NONE,
 	DRM_GPU_SCHED_STAT_RESET,
 	DRM_GPU_SCHED_STAT_ENODEV,
+	DRM_GPU_SCHED_STAT_NO_HANG,
 };
 
 /**

-- 
2.50.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 3/8] drm/sched: Make timeout KUnit tests faster
  2025-07-08 13:25 [PATCH v5 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_NO_HANG Maíra Canal
  2025-07-08 13:25 ` [PATCH v5 1/8] drm/sched: Rename DRM_GPU_SCHED_STAT_NOMINAL to DRM_GPU_SCHED_STAT_RESET Maíra Canal
  2025-07-08 13:25 ` [PATCH v5 2/8] drm/sched: Allow drivers to skip the reset and keep on running Maíra Canal
@ 2025-07-08 13:25 ` Maíra Canal
  2025-07-08 13:25 ` [PATCH v5 4/8] drm/sched: Add new test for DRM_GPU_SCHED_STAT_NO_HANG Maíra Canal
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Maíra Canal @ 2025-07-08 13:25 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Tvrtko Ursulin, Simona Vetter, David Airlie,
	Melissa Wen, Lucas Stach, Russell King, Christian Gmeiner,
	Lucas De Marchi, Thomas Hellström, Rodrigo Vivi,
	Boris Brezillon, Rob Herring, Steven Price, Liviu Dudau
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe, Maíra Canal

As more KUnit tests are introduced to evaluate the basic capabilities of
the `timedout_job()` hook, the test suite will continue to increase in
duration. To reduce the overall running time of the test suite, decrease
the scheduler's timeout for the timeout tests.

Before this commit:

[15:42:26] Elapsed time: 15.637s total, 0.002s configuring, 10.387s building, 5.229s running

After this commit:

[15:45:26] Elapsed time: 9.263s total, 0.002s configuring, 5.168s building, 4.037s running

Signed-off-by: Maíra Canal <mcanal@igalia.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Acked-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/scheduler/tests/tests_basic.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/tests/tests_basic.c b/drivers/gpu/drm/scheduler/tests/tests_basic.c
index 7230057e0594c6246f02608f07fcb1f8d738ac75..41c648782f4548e202bd8711b45d28eead9bd0b2 100644
--- a/drivers/gpu/drm/scheduler/tests/tests_basic.c
+++ b/drivers/gpu/drm/scheduler/tests/tests_basic.c
@@ -5,6 +5,8 @@
 
 #include "sched_tests.h"
 
+#define MOCK_TIMEOUT (HZ / 5)
+
 /*
  * DRM scheduler basic tests should check the basic functional correctness of
  * the scheduler, including some very light smoke testing. More targeted tests,
@@ -28,7 +30,7 @@ static void drm_sched_basic_exit(struct kunit *test)
 
 static int drm_sched_timeout_init(struct kunit *test)
 {
-	test->priv = drm_mock_sched_new(test, HZ);
+	test->priv = drm_mock_sched_new(test, MOCK_TIMEOUT);
 
 	return 0;
 }
@@ -227,14 +229,14 @@ static void drm_sched_basic_timeout(struct kunit *test)
 	done = drm_mock_sched_job_wait_scheduled(job, HZ);
 	KUNIT_ASSERT_TRUE(test, done);
 
-	done = drm_mock_sched_job_wait_finished(job, HZ / 2);
+	done = drm_mock_sched_job_wait_finished(job, MOCK_TIMEOUT / 2);
 	KUNIT_ASSERT_FALSE(test, done);
 
 	KUNIT_ASSERT_EQ(test,
 			job->flags & DRM_MOCK_SCHED_JOB_TIMEDOUT,
 			0);
 
-	done = drm_mock_sched_job_wait_finished(job, HZ);
+	done = drm_mock_sched_job_wait_finished(job, MOCK_TIMEOUT);
 	KUNIT_ASSERT_FALSE(test, done);
 
 	KUNIT_ASSERT_EQ(test,

-- 
2.50.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 4/8] drm/sched: Add new test for DRM_GPU_SCHED_STAT_NO_HANG
  2025-07-08 13:25 [PATCH v5 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_NO_HANG Maíra Canal
                   ` (2 preceding siblings ...)
  2025-07-08 13:25 ` [PATCH v5 3/8] drm/sched: Make timeout KUnit tests faster Maíra Canal
@ 2025-07-08 13:25 ` Maíra Canal
  2025-07-08 13:25 ` [PATCH v5 5/8] drm/v3d: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset Maíra Canal
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Maíra Canal @ 2025-07-08 13:25 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Tvrtko Ursulin, Simona Vetter, David Airlie,
	Melissa Wen, Lucas Stach, Russell King, Christian Gmeiner,
	Lucas De Marchi, Thomas Hellström, Rodrigo Vivi,
	Boris Brezillon, Rob Herring, Steven Price, Liviu Dudau
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe, Maíra Canal

Add a test to submit a single job against a scheduler with the timeout
configured and verify that if the job is still running, the timeout
handler will skip the reset and allow the job to complete.

Signed-off-by: Maíra Canal <mcanal@igalia.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Reviewed-by: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/scheduler/tests/mock_scheduler.c |  5 +++
 drivers/gpu/drm/scheduler/tests/sched_tests.h    |  1 +
 drivers/gpu/drm/scheduler/tests/tests_basic.c    | 47 ++++++++++++++++++++++++
 3 files changed, 53 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
index 998162202972eb5919dfff4c8784ecc22c00ec9d..b3b33f85b7ae30c8e6bba97866a74978b0a96fa7 100644
--- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
+++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
@@ -231,6 +231,11 @@ mock_sched_timedout_job(struct drm_sched_job *sched_job)
 	drm_sched_job_cleanup(sched_job);
 	/* Mock job itself is freed by the kunit framework. */
 
+	if (job->flags & DRM_MOCK_SCHED_JOB_DONT_RESET) {
+		job->flags &= ~DRM_MOCK_SCHED_JOB_DONT_RESET;
+		return DRM_GPU_SCHED_STAT_NO_HANG;
+	}
+
 	return DRM_GPU_SCHED_STAT_RESET;
 }
 
diff --git a/drivers/gpu/drm/scheduler/tests/sched_tests.h b/drivers/gpu/drm/scheduler/tests/sched_tests.h
index fbba38137f0c324cf2472fe5b3a8a78ec016e829..4adf961e1930203fe94241a8a0ae5f7817874a39 100644
--- a/drivers/gpu/drm/scheduler/tests/sched_tests.h
+++ b/drivers/gpu/drm/scheduler/tests/sched_tests.h
@@ -98,6 +98,7 @@ struct drm_mock_sched_job {
 
 #define DRM_MOCK_SCHED_JOB_DONE		0x1
 #define DRM_MOCK_SCHED_JOB_TIMEDOUT	0x2
+#define DRM_MOCK_SCHED_JOB_DONT_RESET	0x4
 	unsigned long		flags;
 
 	struct list_head	link;
diff --git a/drivers/gpu/drm/scheduler/tests/tests_basic.c b/drivers/gpu/drm/scheduler/tests/tests_basic.c
index 41c648782f4548e202bd8711b45d28eead9bd0b2..91c0449590ed24c3da18ab7d930cca47d7c317c7 100644
--- a/drivers/gpu/drm/scheduler/tests/tests_basic.c
+++ b/drivers/gpu/drm/scheduler/tests/tests_basic.c
@@ -246,8 +246,55 @@ static void drm_sched_basic_timeout(struct kunit *test)
 	drm_mock_sched_entity_free(entity);
 }
 
+static void drm_sched_skip_reset(struct kunit *test)
+{
+	struct drm_mock_scheduler *sched = test->priv;
+	struct drm_mock_sched_entity *entity;
+	struct drm_mock_sched_job *job;
+	unsigned int i;
+	bool done;
+
+	/*
+	 * Submit a single job against a scheduler with the timeout configured
+	 * and verify that if the job is still running, the timeout handler
+	 * will skip the reset and allow the job to complete.
+	 */
+
+	entity = drm_mock_sched_entity_new(test,
+					   DRM_SCHED_PRIORITY_NORMAL,
+					   sched);
+	job = drm_mock_sched_job_new(test, entity);
+
+	job->flags = DRM_MOCK_SCHED_JOB_DONT_RESET;
+
+	drm_mock_sched_job_submit(job);
+
+	done = drm_mock_sched_job_wait_scheduled(job, HZ);
+	KUNIT_ASSERT_TRUE(test, done);
+
+	done = drm_mock_sched_job_wait_finished(job, 2 * MOCK_TIMEOUT);
+	KUNIT_ASSERT_FALSE(test, done);
+
+	KUNIT_ASSERT_EQ(test,
+			job->flags & DRM_MOCK_SCHED_JOB_TIMEDOUT,
+			DRM_MOCK_SCHED_JOB_TIMEDOUT);
+
+	KUNIT_ASSERT_EQ(test,
+			job->flags & DRM_MOCK_SCHED_JOB_DONT_RESET,
+			0);
+
+	i = drm_mock_sched_advance(sched, 1);
+	KUNIT_ASSERT_EQ(test, i, 1);
+
+	done = drm_mock_sched_job_wait_finished(job, HZ);
+	KUNIT_ASSERT_TRUE(test, done);
+
+	drm_mock_sched_entity_free(entity);
+}
+
 static struct kunit_case drm_sched_timeout_tests[] = {
 	KUNIT_CASE(drm_sched_basic_timeout),
+	KUNIT_CASE(drm_sched_skip_reset),
 	{}
 };
 

-- 
2.50.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 5/8] drm/v3d: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset
  2025-07-08 13:25 [PATCH v5 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_NO_HANG Maíra Canal
                   ` (3 preceding siblings ...)
  2025-07-08 13:25 ` [PATCH v5 4/8] drm/sched: Add new test for DRM_GPU_SCHED_STAT_NO_HANG Maíra Canal
@ 2025-07-08 13:25 ` Maíra Canal
  2025-07-08 13:25 ` [PATCH v5 6/8] drm/etnaviv: " Maíra Canal
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Maíra Canal @ 2025-07-08 13:25 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Tvrtko Ursulin, Simona Vetter, David Airlie,
	Melissa Wen, Lucas Stach, Russell King, Christian Gmeiner,
	Lucas De Marchi, Thomas Hellström, Rodrigo Vivi,
	Boris Brezillon, Rob Herring, Steven Price, Liviu Dudau
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe, Maíra Canal

When a CL/CSD job times out, we check if the GPU has made any progress
since the last timeout. If so, instead of resetting the hardware, we skip
the reset and allow the timer to be rearmed. This gives long-running jobs
a chance to complete.

Instead of manipulating scheduler's internals, inform the scheduler that
the job did not actually timeout and no reset was performed through
the new status code DRM_GPU_SCHED_STAT_NO_HANG.

Signed-off-by: Maíra Canal <mcanal@igalia.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
---
 drivers/gpu/drm/v3d/v3d_sched.c | 16 ++--------------
 1 file changed, 2 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
index ebbafedef952a575a3ec2e690c1391d3385782a1..cb9df8822472a4602a5cf7a029ee2ca0a9abc28c 100644
--- a/drivers/gpu/drm/v3d/v3d_sched.c
+++ b/drivers/gpu/drm/v3d/v3d_sched.c
@@ -748,16 +748,6 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, struct drm_sched_job *sched_job)
 	return DRM_GPU_SCHED_STAT_RESET;
 }
 
-static void
-v3d_sched_skip_reset(struct drm_sched_job *sched_job)
-{
-	struct drm_gpu_scheduler *sched = sched_job->sched;
-
-	spin_lock(&sched->job_list_lock);
-	list_add(&sched_job->list, &sched->pending_list);
-	spin_unlock(&sched->job_list_lock);
-}
-
 static enum drm_gpu_sched_stat
 v3d_cl_job_timedout(struct drm_sched_job *sched_job, enum v3d_queue q,
 		    u32 *timedout_ctca, u32 *timedout_ctra)
@@ -776,8 +766,7 @@ v3d_cl_job_timedout(struct drm_sched_job *sched_job, enum v3d_queue q,
 		*timedout_ctca = ctca;
 		*timedout_ctra = ctra;
 
-		v3d_sched_skip_reset(sched_job);
-		return DRM_GPU_SCHED_STAT_RESET;
+		return DRM_GPU_SCHED_STAT_NO_HANG;
 	}
 
 	return v3d_gpu_reset_for_timeout(v3d, sched_job);
@@ -822,8 +811,7 @@ v3d_csd_job_timedout(struct drm_sched_job *sched_job)
 	if (job->timedout_batches != batches) {
 		job->timedout_batches = batches;
 
-		v3d_sched_skip_reset(sched_job);
-		return DRM_GPU_SCHED_STAT_RESET;
+		return DRM_GPU_SCHED_STAT_NO_HANG;
 	}
 
 	return v3d_gpu_reset_for_timeout(v3d, sched_job);

-- 
2.50.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 6/8] drm/etnaviv: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset
  2025-07-08 13:25 [PATCH v5 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_NO_HANG Maíra Canal
                   ` (4 preceding siblings ...)
  2025-07-08 13:25 ` [PATCH v5 5/8] drm/v3d: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset Maíra Canal
@ 2025-07-08 13:25 ` Maíra Canal
  2025-07-08 13:25 ` [PATCH v5 7/8] drm/xe: " Maíra Canal
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Maíra Canal @ 2025-07-08 13:25 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Tvrtko Ursulin, Simona Vetter, David Airlie,
	Melissa Wen, Lucas Stach, Russell King, Christian Gmeiner,
	Lucas De Marchi, Thomas Hellström, Rodrigo Vivi,
	Boris Brezillon, Rob Herring, Steven Price, Liviu Dudau
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe, Maíra Canal

Etnaviv can skip a hardware reset in two situations:

  1. TDR has fired before the free-job worker and the timeout is spurious.
  2. The GPU is still making progress on the front-end and we can give
     the job a chance to complete.

Instead of manipulating scheduler's internals, inform the scheduler that
the job did not actually timeout and no reset was performed through
the new status code DRM_GPU_SCHED_STAT_NO_HANG.

Signed-off-by: Maíra Canal <mcanal@igalia.com>
Reviewed-by: Lucas Stach <l.stach@pengutronix.de>
---
 drivers/gpu/drm/etnaviv/etnaviv_sched.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
index 030e6c3233bed9edd5d9c2f49f842f604e57fc1c..46f5391e84a12232b247886cf1311f8e09f42f04 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
@@ -35,17 +35,16 @@ static enum drm_gpu_sched_stat etnaviv_sched_timedout_job(struct drm_sched_job
 							  *sched_job)
 {
 	struct etnaviv_gem_submit *submit = to_etnaviv_submit(sched_job);
-	struct drm_gpu_scheduler *sched = sched_job->sched;
 	struct etnaviv_gpu *gpu = submit->gpu;
 	u32 dma_addr, primid = 0;
 	int change;
 
 	/*
-	 * If the GPU managed to complete this jobs fence, the timout is
-	 * spurious. Bail out.
+	 * If the GPU managed to complete this jobs fence, the timeout has
+	 * fired before free-job worker. The timeout is spurious, so bail out.
 	 */
 	if (dma_fence_is_signaled(submit->out_fence))
-		goto out_no_timeout;
+		return DRM_GPU_SCHED_STAT_NO_HANG;
 
 	/*
 	 * If the GPU is still making forward progress on the front-end (which
@@ -71,7 +70,7 @@ static enum drm_gpu_sched_stat etnaviv_sched_timedout_job(struct drm_sched_job
 		gpu->hangcheck_dma_addr = dma_addr;
 		gpu->hangcheck_primid = primid;
 		gpu->hangcheck_fence = gpu->completed_fence;
-		goto out_no_timeout;
+		return DRM_GPU_SCHED_STAT_NO_HANG;
 	}
 
 	/* block scheduler */
@@ -87,12 +86,7 @@ static enum drm_gpu_sched_stat etnaviv_sched_timedout_job(struct drm_sched_job
 	drm_sched_resubmit_jobs(&gpu->sched);
 
 	drm_sched_start(&gpu->sched, 0);
-	return DRM_GPU_SCHED_STAT_RESET;
 
-out_no_timeout:
-	spin_lock(&sched->job_list_lock);
-	list_add(&sched_job->list, &sched->pending_list);
-	spin_unlock(&sched->job_list_lock);
 	return DRM_GPU_SCHED_STAT_RESET;
 }
 

-- 
2.50.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 7/8] drm/xe: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset
  2025-07-08 13:25 [PATCH v5 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_NO_HANG Maíra Canal
                   ` (5 preceding siblings ...)
  2025-07-08 13:25 ` [PATCH v5 6/8] drm/etnaviv: " Maíra Canal
@ 2025-07-08 13:25 ` Maíra Canal
  2025-07-08 18:35   ` Matthew Brost
  2025-07-08 13:25 ` [PATCH v5 8/8] drm/panfrost: " Maíra Canal
  2025-07-09 13:14 ` [PATCH v5 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_NO_HANG Philipp Stanner
  8 siblings, 1 reply; 23+ messages in thread
From: Maíra Canal @ 2025-07-08 13:25 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Tvrtko Ursulin, Simona Vetter, David Airlie,
	Melissa Wen, Lucas Stach, Russell King, Christian Gmeiner,
	Lucas De Marchi, Thomas Hellström, Rodrigo Vivi,
	Boris Brezillon, Rob Herring, Steven Price, Liviu Dudau
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe, Maíra Canal

Xe can skip the reset if TDR has fired before the free job worker and can
also re-arm the timeout timer in some scenarios. Instead of manipulating
scheduler's internals, inform the scheduler that the job did not actually
timeout and no reset was performed through the new status code
DRM_GPU_SCHED_STAT_NO_HANG.

Note that, in the first case, there is no need to restart submission if it
hasn't been stopped.

Signed-off-by: Maíra Canal <mcanal@igalia.com>
---
 drivers/gpu/drm/xe/xe_guc_submit.c | 12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 1430b58d096b03a78292e523e3ee7c5dddd7efdd..cafb47711e9b3fab3b4b4197965835197caabe9b 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -1093,12 +1093,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 	 * list so job can be freed and kick scheduler ensuring free job is not
 	 * lost.
 	 */
-	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &job->fence->flags)) {
-		xe_sched_add_pending_job(sched, job);
-		xe_sched_submission_start(sched);
-
-		return DRM_GPU_SCHED_STAT_RESET;
-	}
+	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &job->fence->flags))
+		return DRM_GPU_SCHED_STAT_NO_HANG;
 
 	/* Kill the run_job entry point */
 	xe_sched_submission_stop(sched);
@@ -1277,10 +1273,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 	 * but there is not currently an easy way to do in DRM scheduler. With
 	 * some thought, do this in a follow up.
 	 */
-	xe_sched_add_pending_job(sched, job);
 	xe_sched_submission_start(sched);
-
-	return DRM_GPU_SCHED_STAT_RESET;
+	return DRM_GPU_SCHED_STAT_NO_HANG;
 }
 
 static void __guc_exec_queue_fini_async(struct work_struct *w)

-- 
2.50.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 8/8] drm/panfrost: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset
  2025-07-08 13:25 [PATCH v5 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_NO_HANG Maíra Canal
                   ` (6 preceding siblings ...)
  2025-07-08 13:25 ` [PATCH v5 7/8] drm/xe: " Maíra Canal
@ 2025-07-08 13:25 ` Maíra Canal
  2025-07-09 13:14 ` [PATCH v5 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_NO_HANG Philipp Stanner
  8 siblings, 0 replies; 23+ messages in thread
From: Maíra Canal @ 2025-07-08 13:25 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich, Philipp Stanner,
	Christian König, Tvrtko Ursulin, Simona Vetter, David Airlie,
	Melissa Wen, Lucas Stach, Russell King, Christian Gmeiner,
	Lucas De Marchi, Thomas Hellström, Rodrigo Vivi,
	Boris Brezillon, Rob Herring, Steven Price, Liviu Dudau
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe, Maíra Canal

Panfrost can skip the reset if TDR has fired before the free-job worker.
Currently, since Panfrost doesn't take any action on these scenarios, the
job is being leaked, considering that `free_job()` won't be called.

To avoid such leaks, inform the scheduler that the job did not actually
timeout and no reset was performed through the new status code
DRM_GPU_SCHED_STAT_NO_HANG.

Signed-off-by: Maíra Canal <mcanal@igalia.com>
Reviewed-by: Steven Price <steven.price@arm.com>
---
 drivers/gpu/drm/panfrost/panfrost_job.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
index afcffe7f8fe9e11f84e4ab7e8f5a72f7bf583690..842e012cdc68e130a13e08ffae3b7fdf5f8e1acc 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -751,11 +751,11 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job
 	int js = panfrost_job_get_slot(job);
 
 	/*
-	 * If the GPU managed to complete this jobs fence, the timeout is
-	 * spurious. Bail out.
+	 * If the GPU managed to complete this jobs fence, the timeout has
+	 * fired before free-job worker. The timeout is spurious, so bail out.
 	 */
 	if (dma_fence_is_signaled(job->done_fence))
-		return DRM_GPU_SCHED_STAT_RESET;
+		return DRM_GPU_SCHED_STAT_NO_HANG;
 
 	/*
 	 * Panfrost IRQ handler may take a long time to process an interrupt
@@ -770,7 +770,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job
 
 	if (dma_fence_is_signaled(job->done_fence)) {
 		dev_warn(pfdev->dev, "unexpectedly high interrupt latency\n");
-		return DRM_GPU_SCHED_STAT_RESET;
+		return DRM_GPU_SCHED_STAT_NO_HANG;
 	}
 
 	dev_err(pfdev->dev, "gpu sched timeout, js=%d, config=0x%x, status=0x%x, head=0x%x, tail=0x%x, sched_job=%p",

-- 
2.50.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 7/8] drm/xe: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset
  2025-07-08 13:25 ` [PATCH v5 7/8] drm/xe: " Maíra Canal
@ 2025-07-08 18:35   ` Matthew Brost
  0 siblings, 0 replies; 23+ messages in thread
From: Matthew Brost @ 2025-07-08 18:35 UTC (permalink / raw)
  To: Maíra Canal
  Cc: Danilo Krummrich, Philipp Stanner, Christian König,
	Tvrtko Ursulin, Simona Vetter, David Airlie, Melissa Wen,
	Lucas Stach, Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price, Liviu Dudau, kernel-dev, dri-devel, etnaviv,
	intel-xe

On Tue, Jul 08, 2025 at 10:25:47AM -0300, Maíra Canal wrote:
> Xe can skip the reset if TDR has fired before the free job worker and can
> also re-arm the timeout timer in some scenarios. Instead of manipulating
> scheduler's internals, inform the scheduler that the job did not actually
> timeout and no reset was performed through the new status code
> DRM_GPU_SCHED_STAT_NO_HANG.
> 
> Note that, in the first case, there is no need to restart submission if it
> hasn't been stopped.
> 
> Signed-off-by: Maíra Canal <mcanal@igalia.com>

Reviewed-by: Matthew Brost <matthew.brost@intel.com>

> ---
>  drivers/gpu/drm/xe/xe_guc_submit.c | 12 +++---------
>  1 file changed, 3 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 1430b58d096b03a78292e523e3ee7c5dddd7efdd..cafb47711e9b3fab3b4b4197965835197caabe9b 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -1093,12 +1093,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
>  	 * list so job can be freed and kick scheduler ensuring free job is not
>  	 * lost.
>  	 */
> -	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &job->fence->flags)) {
> -		xe_sched_add_pending_job(sched, job);
> -		xe_sched_submission_start(sched);
> -
> -		return DRM_GPU_SCHED_STAT_RESET;
> -	}
> +	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &job->fence->flags))
> +		return DRM_GPU_SCHED_STAT_NO_HANG;
>  
>  	/* Kill the run_job entry point */
>  	xe_sched_submission_stop(sched);
> @@ -1277,10 +1273,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
>  	 * but there is not currently an easy way to do in DRM scheduler. With
>  	 * some thought, do this in a follow up.
>  	 */
> -	xe_sched_add_pending_job(sched, job);
>  	xe_sched_submission_start(sched);
> -
> -	return DRM_GPU_SCHED_STAT_RESET;
> +	return DRM_GPU_SCHED_STAT_NO_HANG;
>  }
>  
>  static void __guc_exec_queue_fini_async(struct work_struct *w)
> 
> -- 
> 2.50.0
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 2/8] drm/sched: Allow drivers to skip the reset and keep on running
  2025-07-08 13:25 ` [PATCH v5 2/8] drm/sched: Allow drivers to skip the reset and keep on running Maíra Canal
@ 2025-07-09 13:08   ` Philipp Stanner
  2025-07-11 13:22   ` Christian König
  1 sibling, 0 replies; 23+ messages in thread
From: Philipp Stanner @ 2025-07-09 13:08 UTC (permalink / raw)
  To: Maíra Canal, Matthew Brost, Danilo Krummrich,
	Philipp Stanner, Christian König, Tvrtko Ursulin,
	Simona Vetter, David Airlie, Melissa Wen, Lucas Stach,
	Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price, Liviu Dudau
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe

On Tue, 2025-07-08 at 10:25 -0300, Maíra Canal wrote:
> When the DRM scheduler times out, it's possible that the GPU isn't
> hung;
> instead, a job just took unusually long (longer than the timeout) but
> is
> still running, and there is, thus, no reason to reset the hardware.
> This
> can occur in two scenarios:
> 
>   1. The job is taking longer than the timeout, but the driver
> determined
>      through a GPU-specific mechanism that the hardware is still
> making
>      progress. Hence, the driver would like the scheduler to skip the
>      timeout and treat the job as still pending from then onward.
> This
>      happens in v3d, Etnaviv, and Xe.
>   2. Timeout has fired before the free-job worker. Consequently, the
>      scheduler calls `sched->ops->timedout_job()` for a job that
> isn't
>      timed out.
> 
> These two scenarios are problematic because the job was removed from
> the
> `sched->pending_list` before calling `sched->ops->timedout_job()`,
> which
> means that when the job finishes, it won't be freed by the scheduler
> though `sched->ops->free_job()` - leading to a memory leak.
> 
> To solve these problems, create a new `drm_gpu_sched_stat`, called
> DRM_GPU_SCHED_STAT_NO_HANG, which allows a driver to skip the reset.
> The
> new status will indicate that the job must be reinserted into
> `sched->pending_list`, and the hardware / driver will still complete
> that
> job.
> 
> Signed-off-by: Maíra Canal <mcanal@igalia.com>

Reviewed-by: Philipp Stanner <phasta@kernel.org>

> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 46
> ++++++++++++++++++++++++++++++++--
>  include/drm/gpu_scheduler.h            |  3 +++
>  2 files changed, 47 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index
> 0f32e2cb43d6af294408968a970990f9f5c47bee..657846d56dacd4f26fffc954fc3
> d025c1e6bfc9f 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -374,11 +374,16 @@ static void drm_sched_run_free_queue(struct
> drm_gpu_scheduler *sched)
>  {
>  	struct drm_sched_job *job;
>  
> -	spin_lock(&sched->job_list_lock);
>  	job = list_first_entry_or_null(&sched->pending_list,
>  				       struct drm_sched_job, list);
>  	if (job && dma_fence_is_signaled(&job->s_fence->finished))
>  		__drm_sched_run_free_queue(sched);
> +}
> +
> +static void drm_sched_run_free_queue_unlocked(struct
> drm_gpu_scheduler *sched)
> +{
> +	spin_lock(&sched->job_list_lock);
> +	drm_sched_run_free_queue(sched);
>  	spin_unlock(&sched->job_list_lock);
>  }
>  
> @@ -531,6 +536,32 @@ static void drm_sched_job_begin(struct
> drm_sched_job *s_job)
>  	spin_unlock(&sched->job_list_lock);
>  }
>  
> +/**
> + * drm_sched_job_reinsert_on_false_timeout - reinsert the job on a
> false timeout
> + * @sched: scheduler instance
> + * @job: job to be reinserted on the pending list
> + *
> + * In the case of a "false timeout" - when a timeout occurs but the
> GPU isn't
> + * hung and is making progress, the scheduler must reinsert the job
> back into
> + * @sched->pending_list. Otherwise, the job and its resources won't
> be freed
> + * through the &struct drm_sched_backend_ops.free_job callback.
> + *
> + * This function must be used in "false timeout" cases only.
> + */
> +static void drm_sched_job_reinsert_on_false_timeout(struct
> drm_gpu_scheduler *sched,
> +						    struct
> drm_sched_job *job)
> +{
> +	spin_lock(&sched->job_list_lock);
> +	list_add(&job->list, &sched->pending_list);
> +
> +	/* After reinserting the job, the scheduler enqueues the
> free-job work
> +	 * again if ready. Otherwise, a signaled job could be added
> to the
> +	 * pending list, but never freed.
> +	 */
> +	drm_sched_run_free_queue(sched);
> +	spin_unlock(&sched->job_list_lock);
> +}
> +
>  static void drm_sched_job_timedout(struct work_struct *work)
>  {
>  	struct drm_gpu_scheduler *sched;
> @@ -564,6 +595,9 @@ static void drm_sched_job_timedout(struct
> work_struct *work)
>  			job->sched->ops->free_job(job);
>  			sched->free_guilty = false;
>  		}
> +
> +		if (status == DRM_GPU_SCHED_STAT_NO_HANG)
> +			drm_sched_job_reinsert_on_false_timeout(sche
> d, job);
>  	} else {
>  		spin_unlock(&sched->job_list_lock);
>  	}
> @@ -586,6 +620,10 @@ static void drm_sched_job_timedout(struct
> work_struct *work)
>   * This function is typically used for reset recovery (see the docu
> of
>   * drm_sched_backend_ops.timedout_job() for details). Do not call it
> for
>   * scheduler teardown, i.e., before calling drm_sched_fini().
> + *
> + * As it's only used for reset recovery, drivers must not call this
> function
> + * in their &struct drm_sched_backend_ops.timedout_job callback when
> they
> + * skip a reset using &enum
> drm_gpu_sched_stat.DRM_GPU_SCHED_STAT_NO_HANG.
>   */
>  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct
> drm_sched_job *bad)
>  {
> @@ -671,6 +709,10 @@ EXPORT_SYMBOL(drm_sched_stop);
>   * drm_sched_backend_ops.timedout_job() for details). Do not call it
> for
>   * scheduler startup. The scheduler itself is fully operational
> after
>   * drm_sched_init() succeeded.
> + *
> + * As it's only used for reset recovery, drivers must not call this
> function
> + * in their &struct drm_sched_backend_ops.timedout_job callback when
> they
> + * skip a reset using &enum
> drm_gpu_sched_stat.DRM_GPU_SCHED_STAT_NO_HANG.
>   */
>  void drm_sched_start(struct drm_gpu_scheduler *sched, int errno)
>  {
> @@ -1192,7 +1234,7 @@ static void drm_sched_free_job_work(struct
> work_struct *w)
>  	if (job)
>  		sched->ops->free_job(job);
>  
> -	drm_sched_run_free_queue(sched);
> +	drm_sched_run_free_queue_unlocked(sched);
>  	drm_sched_run_job_queue(sched);
>  }
>  
> diff --git a/include/drm/gpu_scheduler.h
> b/include/drm/gpu_scheduler.h
> index
> 83e5c00d8dd9a83ab20547a93d6fc572de97616e..257d21d8d1d2c4f035d6d4882e1
> 59de59b263c76 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -393,11 +393,14 @@ struct drm_sched_job {
>   * @DRM_GPU_SCHED_STAT_NONE: Reserved. Do not use.
>   * @DRM_GPU_SCHED_STAT_RESET: The GPU hung and successfully reset.
>   * @DRM_GPU_SCHED_STAT_ENODEV: Error: Device is not available
> anymore.
> + * @DRM_GPU_SCHED_STAT_NO_HANG: Contrary to scheduler's assumption,
> the GPU
> + * did not hang and is still running.
>   */
>  enum drm_gpu_sched_stat {
>  	DRM_GPU_SCHED_STAT_NONE,
>  	DRM_GPU_SCHED_STAT_RESET,
>  	DRM_GPU_SCHED_STAT_ENODEV,
> +	DRM_GPU_SCHED_STAT_NO_HANG,
>  };
>  
>  /**
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_NO_HANG
  2025-07-08 13:25 [PATCH v5 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_NO_HANG Maíra Canal
                   ` (7 preceding siblings ...)
  2025-07-08 13:25 ` [PATCH v5 8/8] drm/panfrost: " Maíra Canal
@ 2025-07-09 13:14 ` Philipp Stanner
  2025-07-10 11:27   ` Maíra Canal
  8 siblings, 1 reply; 23+ messages in thread
From: Philipp Stanner @ 2025-07-09 13:14 UTC (permalink / raw)
  To: Maíra Canal, Matthew Brost, Danilo Krummrich,
	Philipp Stanner, Christian König, Tvrtko Ursulin,
	Simona Vetter, David Airlie, Melissa Wen, Lucas Stach,
	Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price, Liviu Dudau
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe, Min Ma, Lizhi Hou,
	Oded Gabbay, Frank Binns, Matt Coster, Qiang Yu, Lyude Paul,
	Alex Deucher, Christian König

On Tue, 2025-07-08 at 10:25 -0300, Maíra Canal wrote:
> TL;DR: The only two patches that are lacking R-b's are:
> 
> [PATCH 2/8] drm/sched: Allow drivers to skip the reset and keep on running
> [PATCH 7/8] drm/xe: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset
>   -> If Intel CI succeeds, it's Reviewed-by: Matthew Brost <matthew.brost@intel.com>
> 
> For those two patches, it would be great to gather feedback and/or R-b's,
> particularly from the Intel folks.
> 
> Thanks for all the reviews so far!
> 
> ---
> When the DRM scheduler times out, it's possible that the GPU isn't hung;
> instead, a job may still be running, and there may be no valid reason to
> reset the hardware. This can occur in two situations:
> 
>   1. The GPU exposes some mechanism that ensures the GPU is still making
>      progress. By checking this mechanism, the driver can safely skip the
>      reset, re-arm the timeout, and allow the job to continue running until
>      completion. This is the case for v3d, Etnaviv, and Xe.
> 
>   2. Timeout has fired before the free-job worker. Consequently, the
>      scheduler calls `timedout_job()` for a job that isn't timed out.
> 
> These two scenarios are problematic because the job was removed from the
> `sched->pending_list` before calling `sched->ops->timedout_job()`, which
> means that when the job finishes, it won't be freed by the scheduler
> though `sched->ops->free_job()`. As a result, the job and its resources
> won't be freed, leading to a memory leak.
> 
> For v3d specifically, we have observed that these memory leaks can be
> significant in certain scenarios, as reported by users in [1][2]. To
> address this situation, I submitted a patch similar to commit 704d3d60fec4
> ("drm/etnaviv: don't block scheduler when GPU is still active") for v3d [3].
> This patch has already landed in drm-misc-fixes and successfully resolved
> the users' issues.
> 
> However, as I mentioned in [3], exposing the scheduler's internals within
> the drivers isn't ideal and I believe this specific situation can be
> addressed within the DRM scheduler framework.
> 
> This series aims to resolve this issue by adding a new DRM sched status
> that allows a driver to skip the reset. This new status will indicate that
> the job should be reinserted into the pending list, and the driver will
> still signal its completion.
> 
> [1] https://gitlab.freedesktop.org/mesa/mesa/-/issues/12227
> [2] https://github.com/raspberrypi/linux/issues/6817
> [3] https://lore.kernel.org/dri-devel/20250430210643.57924-1-mcanal@igalia.com/T/
> 
> Best Regards,
> - Maíra
> 
> ---
> v1 -> v2:
> 
> - Fix several grammar nits across the documentation and commit messages.
> - Drop "drm/sched: Always free the job after the timeout" (Tvrtko Ursulin)
> - [1/8] NEW PATCH: Rename DRM_GPU_SCHED_STAT_NOMINAL to a more semantic
> 	name (Tvrtko Ursulin, Philipp Stanner)
> - [2/8] Rename DRM_GPU_SCHED_STAT_RUNNING to DRM_GPU_SCHED_STAT_NO_HANG (Tvrtko Ursulin, Philipp Stanner)
> - [2/8] Requeue free-job work after reinserting the job to the pending list (Matthew Brost)
> - [2/8] Create a helper function to reinsert the job (Philipp Stanner)
> - [2/8] Rewrite the commit message (Philipp Stanner)
> - [2/8] Add a comment to `drm_sched_start()` documentation, similar to what
> 	was commented in `drm_sched_stop()` (Philipp Stanner)
> - [3/8] Keep HZ as timeout for `drm_mock_sched_job_wait_scheduled()` (Tvrtko Ursulin)
> - [4/8] Use a job flag to indicate that `timedout_job()` should skip the
> 	reset (Tvrtko Ursulin)
> - [7/8] Use DRM_GPU_SCHED_STAT_NO_HANG to re-arm the timer in other cases
> 	as well (Matthew Brost)
> - Link to v1: https://lore.kernel.org/r/20250503-sched-skip-reset-v1-0-ed0d6701a3fe@igalia.com
> 
> v2 -> v3:
> 
> - [2/8] Address comments about the commit message (Philipp Stanner)
> - [2/8] Improve comments and documentation style (Philipp Stanner)
> - [3/8] Rename the commit title to "drm/sched: Make timeout KUnit tests faster" (Philipp Stanner)
> - [3/8] Add Tvrtko's R-b (Tvrtko Ursulin)
> - [4/8] Instead of setting up a job duration, advance it manually (Tvrtko Ursulin)
> - [4/8] Wait for 2 * MOCK_TIMEOUT instead of MOCK_TIMEOUT (Tvrtko Ursulin)
> - [5/8, 6/8, 7/8, 8/8] Use Philipp's suggestion to improve the commit messages (Philipp Stanner)
> - Link to v2: https://lore.kernel.org/r/20250530-sched-skip-reset-v2-0-c40a8d2d8daa@igalia.com
> 
> v3 -> v4:
> 
> - [1/8] s/betters/better and Philipp's R-b (Philipp Stanner)
> - [2/8] Apply some documentation nits (Philipp Stanner)
> - [3/8] Add Philipp's A-b  (Philipp Stanner)
> - [4/8, 5/8] Add Tvrtko's R-b (Tvrtko Ursulin)
> - [6/8] Add Lucas' R-b (Lucas Stach)
> - Link to v3: https://lore.kernel.org/r/20250618-sched-skip-reset-v3-0-8be5cca2725d@igalia.com
> 
> v4 -> v5:
> 
> - Rebased on top of drm-tip (for Intel CI)
> - [2/8] Reword the commit message (Philipp Stanner)
> - [2/8] Reword several comments (Philipp Stanner)
> - [4/8] Add Philipp's R-b (Philipp Stanner)
> - Link to v4: https://lore.kernel.org/r/20250707-sched-skip-reset-v4-0-036c0f0f584f@igalia.com
> 
> ---
> Maíra Canal (8):
>       drm/sched: Rename DRM_GPU_SCHED_STAT_NOMINAL to DRM_GPU_SCHED_STAT_RESET
>       drm/sched: Allow drivers to skip the reset and keep on running
>       drm/sched: Make timeout KUnit tests faster
>       drm/sched: Add new test for DRM_GPU_SCHED_STAT_NO_HANG
>       drm/v3d: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset
>       drm/etnaviv: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset
>       drm/xe: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset
>       drm/panfrost: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset
> 
>  drivers/accel/amdxdna/aie2_ctx.c                 |  2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c          |  2 +-
>  drivers/gpu/drm/etnaviv/etnaviv_sched.c          | 16 +++----
>  drivers/gpu/drm/imagination/pvr_queue.c          |  4 +-
>  drivers/gpu/drm/lima/lima_sched.c                |  6 +--
>  drivers/gpu/drm/nouveau/nouveau_exec.c           |  2 +-
>  drivers/gpu/drm/nouveau/nouveau_sched.c          |  2 +-
>  drivers/gpu/drm/panfrost/panfrost_job.c          | 10 ++---
>  drivers/gpu/drm/panthor/panthor_mmu.c            |  2 +-
>  drivers/gpu/drm/panthor/panthor_sched.c          |  2 +-
>  drivers/gpu/drm/scheduler/sched_main.c           | 48 +++++++++++++++++++--
>  drivers/gpu/drm/scheduler/tests/mock_scheduler.c |  7 ++-
>  drivers/gpu/drm/scheduler/tests/sched_tests.h    |  1 +
>  drivers/gpu/drm/scheduler/tests/tests_basic.c    | 55 ++++++++++++++++++++++--
>  drivers/gpu/drm/v3d/v3d_sched.c                  | 18 ++------
>  drivers/gpu/drm/xe/xe_guc_submit.c               | 14 ++----
>  include/drm/gpu_scheduler.h                      |  7 ++-
>  17 files changed, 137 insertions(+), 61 deletions(-)

Does not apply to drm-misc-next:

Applying: drm/sched: Rename DRM_GPU_SCHED_STAT_NOMINAL to DRM_GPU_SCHED_STAT_RESET
error: patch failed: drivers/gpu/drm/etnaviv/etnaviv_sched.c:87
error: drivers/gpu/drm/etnaviv/etnaviv_sched.c: patch does not apply
Patch failed at 0001 drm/sched: Rename DRM_GPU_SCHED_STAT_NOMINAL to DRM_GPU_SCHED_STAT_RESET

Awkward. That file has last been touched months ago. On what branch is
your series based?

Can you rebase?

From my POV you could also apply it yourself. Looks all good.

P.



> ---
> base-commit: 8b32b5509128873da8ecfc06beefcb58927eb50b
> change-id: 20250502-sched-skip-reset-bf7c163233da
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_NO_HANG
  2025-07-09 13:14 ` [PATCH v5 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_NO_HANG Philipp Stanner
@ 2025-07-10 11:27   ` Maíra Canal
  0 siblings, 0 replies; 23+ messages in thread
From: Maíra Canal @ 2025-07-10 11:27 UTC (permalink / raw)
  To: phasta, Matthew Brost, Danilo Krummrich, Christian König,
	Tvrtko Ursulin, Simona Vetter, David Airlie, Melissa Wen,
	Lucas Stach, Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price, Liviu Dudau, Thomas Zimmermann, Maxime Ripard
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe, Min Ma, Lizhi Hou,
	Oded Gabbay, Frank Binns, Matt Coster, Qiang Yu, Lyude Paul,
	Alex Deucher, Christian König

+cc Maxime, Thomas

Hi Philipp,

On 09/07/25 10:14, Philipp Stanner wrote:
> On Tue, 2025-07-08 at 10:25 -0300, Maíra Canal wrote:
>> TL;DR: The only two patches that are lacking R-b's are:
>>
>> [PATCH 2/8] drm/sched: Allow drivers to skip the reset and keep on running
>> [PATCH 7/8] drm/xe: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset
>>    -> If Intel CI succeeds, it's Reviewed-by: Matthew Brost <matthew.brost@intel.com>
>>
>> For those two patches, it would be great to gather feedback and/or R-b's,
>> particularly from the Intel folks.
>>
>> Thanks for all the reviews so far!
>>
>> ---

[...]

>>        drm/sched: Rename DRM_GPU_SCHED_STAT_NOMINAL to DRM_GPU_SCHED_STAT_RESET
>>        drm/sched: Allow drivers to skip the reset and keep on running
>>        drm/sched: Make timeout KUnit tests faster
>>        drm/sched: Add new test for DRM_GPU_SCHED_STAT_NO_HANG
>>        drm/v3d: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset
>>        drm/etnaviv: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset
>>        drm/xe: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset
>>        drm/panfrost: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset
>>
>>   drivers/accel/amdxdna/aie2_ctx.c                 |  2 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c          |  2 +-
>>   drivers/gpu/drm/etnaviv/etnaviv_sched.c          | 16 +++----
>>   drivers/gpu/drm/imagination/pvr_queue.c          |  4 +-
>>   drivers/gpu/drm/lima/lima_sched.c                |  6 +--
>>   drivers/gpu/drm/nouveau/nouveau_exec.c           |  2 +-
>>   drivers/gpu/drm/nouveau/nouveau_sched.c          |  2 +-
>>   drivers/gpu/drm/panfrost/panfrost_job.c          | 10 ++---
>>   drivers/gpu/drm/panthor/panthor_mmu.c            |  2 +-
>>   drivers/gpu/drm/panthor/panthor_sched.c          |  2 +-
>>   drivers/gpu/drm/scheduler/sched_main.c           | 48 +++++++++++++++++++--
>>   drivers/gpu/drm/scheduler/tests/mock_scheduler.c |  7 ++-
>>   drivers/gpu/drm/scheduler/tests/sched_tests.h    |  1 +
>>   drivers/gpu/drm/scheduler/tests/tests_basic.c    | 55 ++++++++++++++++++++++--
>>   drivers/gpu/drm/v3d/v3d_sched.c                  | 18 ++------
>>   drivers/gpu/drm/xe/xe_guc_submit.c               | 14 ++----
>>   include/drm/gpu_scheduler.h                      |  7 ++-
>>   17 files changed, 137 insertions(+), 61 deletions(-)
> 
> Does not apply to drm-misc-next:
> 
> Applying: drm/sched: Rename DRM_GPU_SCHED_STAT_NOMINAL to DRM_GPU_SCHED_STAT_RESET
> error: patch failed: drivers/gpu/drm/etnaviv/etnaviv_sched.c:87
> error: drivers/gpu/drm/etnaviv/etnaviv_sched.c: patch does not apply
> Patch failed at 0001 drm/sched: Rename DRM_GPU_SCHED_STAT_NOMINAL to DRM_GPU_SCHED_STAT_RESET
> 
> Awkward. That file has last been touched months ago. On what branch is
> your series based?

It's based on drm-tip, otherwise, Intel CI wouldn't be able to apply the
series. drm-tip has that Etnaviv fix (commit 61ee19dedb8d "drm/etnaviv:
Protect the scheduler's pending list with its lock"), which drm-misc-
next doesn't have yet.

> 
> Can you rebase?
> 
>  From my POV you could also apply it yourself. Looks all good.

I believe I can apply the series to drm-misc-next and solve the
conflicts in drm-tip with `dim rebuild-tip`.

@drm-misc maintainers, would it be an issue if I apply this series to
drm-misc-next and solve the conflicts?

Best Regards,
- Maíra

> 
> P.
> 
> 
> 
>> ---
>> base-commit: 8b32b5509128873da8ecfc06beefcb58927eb50b
>> change-id: 20250502-sched-skip-reset-bf7c163233da
>>
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 2/8] drm/sched: Allow drivers to skip the reset and keep on running
  2025-07-08 13:25 ` [PATCH v5 2/8] drm/sched: Allow drivers to skip the reset and keep on running Maíra Canal
  2025-07-09 13:08   ` Philipp Stanner
@ 2025-07-11 13:22   ` Christian König
  2025-07-11 13:37     ` Philipp Stanner
  2025-07-11 14:35     ` Maíra Canal
  1 sibling, 2 replies; 23+ messages in thread
From: Christian König @ 2025-07-11 13:22 UTC (permalink / raw)
  To: Maíra Canal, Matthew Brost, Danilo Krummrich,
	Philipp Stanner, Tvrtko Ursulin, Simona Vetter, David Airlie,
	Melissa Wen, Lucas Stach, Russell King, Christian Gmeiner,
	Lucas De Marchi, Thomas Hellström, Rodrigo Vivi,
	Boris Brezillon, Rob Herring, Steven Price, Liviu Dudau
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe



On 08.07.25 15:25, Maíra Canal wrote:
> When the DRM scheduler times out, it's possible that the GPU isn't hung;
> instead, a job just took unusually long (longer than the timeout) but is
> still running, and there is, thus, no reason to reset the hardware. This
> can occur in two scenarios:
> 
>   1. The job is taking longer than the timeout, but the driver determined
>      through a GPU-specific mechanism that the hardware is still making
>      progress. Hence, the driver would like the scheduler to skip the
>      timeout and treat the job as still pending from then onward. This
>      happens in v3d, Etnaviv, and Xe.
>   2. Timeout has fired before the free-job worker. Consequently, the
>      scheduler calls `sched->ops->timedout_job()` for a job that isn't
>      timed out.
> 
> These two scenarios are problematic because the job was removed from the
> `sched->pending_list` before calling `sched->ops->timedout_job()`, which
> means that when the job finishes, it won't be freed by the scheduler
> though `sched->ops->free_job()` - leading to a memory leak.

Yeah, that is unfortunately intentional.

> To solve these problems, create a new `drm_gpu_sched_stat`, called
> DRM_GPU_SCHED_STAT_NO_HANG, which allows a driver to skip the reset. The
> new status will indicate that the job must be reinserted into
> `sched->pending_list`, and the hardware / driver will still complete that
> job.

Well long story short we have already tried this and the whole approach doesn't work correctly in all cases. See the git history around how we used to destroy the jobs.

The basic problem is that you can always race between timing out and Signaling/destroying the job. This is the long lasting job lifetime issue we already discussed more than once.

If you want to fix this I think the correct approach is to completely drop tracking jobs in the scheduler at all.

Instead we should track the HW fences (or maybe the scheduler fences which point to the HW fence) the scheduler waits for.

This HW fence is then given as a parameter to the driver when we run into a timeout.

This has the clear advantage that dma_fence objects have a well defined livetime and necessary state transition. E.g. you can check at all times if the fence is signaled or not.

Regards,
Christian.

>  
> Signed-off-by: Maíra Canal <mcanal@igalia.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 46 ++++++++++++++++++++++++++++++++--
>  include/drm/gpu_scheduler.h            |  3 +++
>  2 files changed, 47 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 0f32e2cb43d6af294408968a970990f9f5c47bee..657846d56dacd4f26fffc954fc3d025c1e6bfc9f 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -374,11 +374,16 @@ static void drm_sched_run_free_queue(struct drm_gpu_scheduler *sched)
>  {
>  	struct drm_sched_job *job;
>  
> -	spin_lock(&sched->job_list_lock);
>  	job = list_first_entry_or_null(&sched->pending_list,
>  				       struct drm_sched_job, list);
>  	if (job && dma_fence_is_signaled(&job->s_fence->finished))
>  		__drm_sched_run_free_queue(sched);
> +}
> +
> +static void drm_sched_run_free_queue_unlocked(struct drm_gpu_scheduler *sched)
> +{
> +	spin_lock(&sched->job_list_lock);
> +	drm_sched_run_free_queue(sched);
>  	spin_unlock(&sched->job_list_lock);
>  }
>  
> @@ -531,6 +536,32 @@ static void drm_sched_job_begin(struct drm_sched_job *s_job)
>  	spin_unlock(&sched->job_list_lock);
>  }
>  
> +/**
> + * drm_sched_job_reinsert_on_false_timeout - reinsert the job on a false timeout
> + * @sched: scheduler instance
> + * @job: job to be reinserted on the pending list
> + *
> + * In the case of a "false timeout" - when a timeout occurs but the GPU isn't
> + * hung and is making progress, the scheduler must reinsert the job back into
> + * @sched->pending_list. Otherwise, the job and its resources won't be freed
> + * through the &struct drm_sched_backend_ops.free_job callback.
> + *
> + * This function must be used in "false timeout" cases only.
> + */
> +static void drm_sched_job_reinsert_on_false_timeout(struct drm_gpu_scheduler *sched,
> +						    struct drm_sched_job *job)
> +{
> +	spin_lock(&sched->job_list_lock);
> +	list_add(&job->list, &sched->pending_list);
> +
> +	/* After reinserting the job, the scheduler enqueues the free-job work
> +	 * again if ready. Otherwise, a signaled job could be added to the
> +	 * pending list, but never freed.
> +	 */
> +	drm_sched_run_free_queue(sched);
> +	spin_unlock(&sched->job_list_lock);
> +}
> +
>  static void drm_sched_job_timedout(struct work_struct *work)
>  {
>  	struct drm_gpu_scheduler *sched;
> @@ -564,6 +595,9 @@ static void drm_sched_job_timedout(struct work_struct *work)
>  			job->sched->ops->free_job(job);
>  			sched->free_guilty = false;
>  		}
> +
> +		if (status == DRM_GPU_SCHED_STAT_NO_HANG)
> +			drm_sched_job_reinsert_on_false_timeout(sched, job);
>  	} else {
>  		spin_unlock(&sched->job_list_lock);
>  	}
> @@ -586,6 +620,10 @@ static void drm_sched_job_timedout(struct work_struct *work)
>   * This function is typically used for reset recovery (see the docu of
>   * drm_sched_backend_ops.timedout_job() for details). Do not call it for
>   * scheduler teardown, i.e., before calling drm_sched_fini().
> + *
> + * As it's only used for reset recovery, drivers must not call this function
> + * in their &struct drm_sched_backend_ops.timedout_job callback when they
> + * skip a reset using &enum drm_gpu_sched_stat.DRM_GPU_SCHED_STAT_NO_HANG.
>   */
>  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>  {
> @@ -671,6 +709,10 @@ EXPORT_SYMBOL(drm_sched_stop);
>   * drm_sched_backend_ops.timedout_job() for details). Do not call it for
>   * scheduler startup. The scheduler itself is fully operational after
>   * drm_sched_init() succeeded.
> + *
> + * As it's only used for reset recovery, drivers must not call this function
> + * in their &struct drm_sched_backend_ops.timedout_job callback when they
> + * skip a reset using &enum drm_gpu_sched_stat.DRM_GPU_SCHED_STAT_NO_HANG.
>   */
>  void drm_sched_start(struct drm_gpu_scheduler *sched, int errno)
>  {
> @@ -1192,7 +1234,7 @@ static void drm_sched_free_job_work(struct work_struct *w)
>  	if (job)
>  		sched->ops->free_job(job);
>  
> -	drm_sched_run_free_queue(sched);
> +	drm_sched_run_free_queue_unlocked(sched);
>  	drm_sched_run_job_queue(sched);
>  }
>  
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 83e5c00d8dd9a83ab20547a93d6fc572de97616e..257d21d8d1d2c4f035d6d4882e159de59b263c76 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -393,11 +393,14 @@ struct drm_sched_job {
>   * @DRM_GPU_SCHED_STAT_NONE: Reserved. Do not use.
>   * @DRM_GPU_SCHED_STAT_RESET: The GPU hung and successfully reset.
>   * @DRM_GPU_SCHED_STAT_ENODEV: Error: Device is not available anymore.
> + * @DRM_GPU_SCHED_STAT_NO_HANG: Contrary to scheduler's assumption, the GPU
> + * did not hang and is still running.
>   */
>  enum drm_gpu_sched_stat {
>  	DRM_GPU_SCHED_STAT_NONE,
>  	DRM_GPU_SCHED_STAT_RESET,
>  	DRM_GPU_SCHED_STAT_ENODEV,
> +	DRM_GPU_SCHED_STAT_NO_HANG,
>  };
>  
>  /**
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 2/8] drm/sched: Allow drivers to skip the reset and keep on running
  2025-07-11 13:22   ` Christian König
@ 2025-07-11 13:37     ` Philipp Stanner
  2025-07-11 15:20       ` Christian König
  2025-07-11 14:35     ` Maíra Canal
  1 sibling, 1 reply; 23+ messages in thread
From: Philipp Stanner @ 2025-07-11 13:37 UTC (permalink / raw)
  To: Christian König, Maíra Canal, Matthew Brost,
	Danilo Krummrich, Philipp Stanner, Tvrtko Ursulin, Simona Vetter,
	David Airlie, Melissa Wen, Lucas Stach, Russell King,
	Christian Gmeiner, Lucas De Marchi, Thomas Hellström,
	Rodrigo Vivi, Boris Brezillon, Rob Herring, Steven Price,
	Liviu Dudau
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe

On Fri, 2025-07-11 at 15:22 +0200, Christian König wrote:
> 
> 
> On 08.07.25 15:25, Maíra Canal wrote:
> > When the DRM scheduler times out, it's possible that the GPU isn't hung;
> > instead, a job just took unusually long (longer than the timeout) but is
> > still running, and there is, thus, no reason to reset the hardware. This
> > can occur in two scenarios:
> > 
> >   1. The job is taking longer than the timeout, but the driver determined
> >      through a GPU-specific mechanism that the hardware is still making
> >      progress. Hence, the driver would like the scheduler to skip the
> >      timeout and treat the job as still pending from then onward. This
> >      happens in v3d, Etnaviv, and Xe.
> >   2. Timeout has fired before the free-job worker. Consequently, the
> >      scheduler calls `sched->ops->timedout_job()` for a job that isn't
> >      timed out.
> > 
> > These two scenarios are problematic because the job was removed from the
> > `sched->pending_list` before calling `sched->ops->timedout_job()`, which
> > means that when the job finishes, it won't be freed by the scheduler
> > though `sched->ops->free_job()` - leading to a memory leak.
> 
> Yeah, that is unfortunately intentional.
> 
> > To solve these problems, create a new `drm_gpu_sched_stat`, called
> > DRM_GPU_SCHED_STAT_NO_HANG, which allows a driver to skip the reset. The
> > new status will indicate that the job must be reinserted into
> > `sched->pending_list`, and the hardware / driver will still complete that
> > job.
> 
> Well long story short we have already tried this and the whole approach doesn't work correctly in all cases. See the git history around how we used to destroy the jobs.
> 
> The basic problem is that you can always race between timing out and Signaling/destroying the job. This is the long lasting job lifetime issue we already discussed more than once.

The scheduler destroys the job, through free_job().
I think we have agreed that for now the scheduler is the party
responsible for the job lifetime.

> 
> If you want to fix this I think the correct approach is to completely drop tracking jobs in the scheduler at all.

I don't see how this series introduces a problem?

The fact is that drivers are abusing the API by just firing jobs back
into the scheduler's job list. This series legalizes the abuse by
providing scheduler functionality for that.

IOW, the series improves the situation but does not add a *new*
problem. Even less so as driver's aren't forced to use the new status
code, but can continue having job completion race with timeout
handlers.

> 
> Instead we should track the HW fences (or maybe the scheduler fences which point to the HW fence) the scheduler waits for.
> 
> This HW fence is then given as a parameter to the driver when we run into a timeout.
> 
> This has the clear advantage that dma_fence objects have a well defined livetime and necessary state transition. E.g. you can check at all times if the fence is signaled or not.

I'd say that centralizing job handling in the scheduler is a
prerequisite of what you suggest. You can't clean up anything while
certain drivers (sometimes without even locking!) just willy-nilly re-
add jobs to the pending_list.


P.

> 
> Regards,
> Christian.
> 
> >  
> > Signed-off-by: Maíra Canal <mcanal@igalia.com>
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 46 ++++++++++++++++++++++++++++++++--
> >  include/drm/gpu_scheduler.h            |  3 +++
> >  2 files changed, 47 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index 0f32e2cb43d6af294408968a970990f9f5c47bee..657846d56dacd4f26fffc954fc3d025c1e6bfc9f 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -374,11 +374,16 @@ static void drm_sched_run_free_queue(struct drm_gpu_scheduler *sched)
> >  {
> >  	struct drm_sched_job *job;
> >  
> > -	spin_lock(&sched->job_list_lock);
> >  	job = list_first_entry_or_null(&sched->pending_list,
> >  				       struct drm_sched_job, list);
> >  	if (job && dma_fence_is_signaled(&job->s_fence->finished))
> >  		__drm_sched_run_free_queue(sched);
> > +}
> > +
> > +static void drm_sched_run_free_queue_unlocked(struct drm_gpu_scheduler *sched)
> > +{
> > +	spin_lock(&sched->job_list_lock);
> > +	drm_sched_run_free_queue(sched);
> >  	spin_unlock(&sched->job_list_lock);
> >  }
> >  
> > @@ -531,6 +536,32 @@ static void drm_sched_job_begin(struct drm_sched_job *s_job)
> >  	spin_unlock(&sched->job_list_lock);
> >  }
> >  
> > +/**
> > + * drm_sched_job_reinsert_on_false_timeout - reinsert the job on a false timeout
> > + * @sched: scheduler instance
> > + * @job: job to be reinserted on the pending list
> > + *
> > + * In the case of a "false timeout" - when a timeout occurs but the GPU isn't
> > + * hung and is making progress, the scheduler must reinsert the job back into
> > + * @sched->pending_list. Otherwise, the job and its resources won't be freed
> > + * through the &struct drm_sched_backend_ops.free_job callback.
> > + *
> > + * This function must be used in "false timeout" cases only.
> > + */
> > +static void drm_sched_job_reinsert_on_false_timeout(struct drm_gpu_scheduler *sched,
> > +						    struct drm_sched_job *job)
> > +{
> > +	spin_lock(&sched->job_list_lock);
> > +	list_add(&job->list, &sched->pending_list);
> > +
> > +	/* After reinserting the job, the scheduler enqueues the free-job work
> > +	 * again if ready. Otherwise, a signaled job could be added to the
> > +	 * pending list, but never freed.
> > +	 */
> > +	drm_sched_run_free_queue(sched);
> > +	spin_unlock(&sched->job_list_lock);
> > +}
> > +
> >  static void drm_sched_job_timedout(struct work_struct *work)
> >  {
> >  	struct drm_gpu_scheduler *sched;
> > @@ -564,6 +595,9 @@ static void drm_sched_job_timedout(struct work_struct *work)
> >  			job->sched->ops->free_job(job);
> >  			sched->free_guilty = false;
> >  		}
> > +
> > +		if (status == DRM_GPU_SCHED_STAT_NO_HANG)
> > +			drm_sched_job_reinsert_on_false_timeout(sched, job);
> >  	} else {
> >  		spin_unlock(&sched->job_list_lock);
> >  	}
> > @@ -586,6 +620,10 @@ static void drm_sched_job_timedout(struct work_struct *work)
> >   * This function is typically used for reset recovery (see the docu of
> >   * drm_sched_backend_ops.timedout_job() for details). Do not call it for
> >   * scheduler teardown, i.e., before calling drm_sched_fini().
> > + *
> > + * As it's only used for reset recovery, drivers must not call this function
> > + * in their &struct drm_sched_backend_ops.timedout_job callback when they
> > + * skip a reset using &enum drm_gpu_sched_stat.DRM_GPU_SCHED_STAT_NO_HANG.
> >   */
> >  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> >  {
> > @@ -671,6 +709,10 @@ EXPORT_SYMBOL(drm_sched_stop);
> >   * drm_sched_backend_ops.timedout_job() for details). Do not call it for
> >   * scheduler startup. The scheduler itself is fully operational after
> >   * drm_sched_init() succeeded.
> > + *
> > + * As it's only used for reset recovery, drivers must not call this function
> > + * in their &struct drm_sched_backend_ops.timedout_job callback when they
> > + * skip a reset using &enum drm_gpu_sched_stat.DRM_GPU_SCHED_STAT_NO_HANG.
> >   */
> >  void drm_sched_start(struct drm_gpu_scheduler *sched, int errno)
> >  {
> > @@ -1192,7 +1234,7 @@ static void drm_sched_free_job_work(struct work_struct *w)
> >  	if (job)
> >  		sched->ops->free_job(job);
> >  
> > -	drm_sched_run_free_queue(sched);
> > +	drm_sched_run_free_queue_unlocked(sched);
> >  	drm_sched_run_job_queue(sched);
> >  }
> >  
> > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > index 83e5c00d8dd9a83ab20547a93d6fc572de97616e..257d21d8d1d2c4f035d6d4882e159de59b263c76 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -393,11 +393,14 @@ struct drm_sched_job {
> >   * @DRM_GPU_SCHED_STAT_NONE: Reserved. Do not use.
> >   * @DRM_GPU_SCHED_STAT_RESET: The GPU hung and successfully reset.
> >   * @DRM_GPU_SCHED_STAT_ENODEV: Error: Device is not available anymore.
> > + * @DRM_GPU_SCHED_STAT_NO_HANG: Contrary to scheduler's assumption, the GPU
> > + * did not hang and is still running.
> >   */
> >  enum drm_gpu_sched_stat {
> >  	DRM_GPU_SCHED_STAT_NONE,
> >  	DRM_GPU_SCHED_STAT_RESET,
> >  	DRM_GPU_SCHED_STAT_ENODEV,
> > +	DRM_GPU_SCHED_STAT_NO_HANG,
> >  };
> >  
> >  /**
> > 
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 2/8] drm/sched: Allow drivers to skip the reset and keep on running
  2025-07-11 13:22   ` Christian König
  2025-07-11 13:37     ` Philipp Stanner
@ 2025-07-11 14:35     ` Maíra Canal
  1 sibling, 0 replies; 23+ messages in thread
From: Maíra Canal @ 2025-07-11 14:35 UTC (permalink / raw)
  To: Christian König, Matthew Brost, Danilo Krummrich,
	Philipp Stanner, Tvrtko Ursulin, Simona Vetter, David Airlie,
	Melissa Wen, Lucas Stach, Russell King, Christian Gmeiner,
	Lucas De Marchi, Thomas Hellström, Rodrigo Vivi,
	Boris Brezillon, Rob Herring, Steven Price, Liviu Dudau
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe

Hi Christian,

On 11/07/25 10:22, Christian König wrote:
> 
> 
> On 08.07.25 15:25, Maíra Canal wrote:
>> When the DRM scheduler times out, it's possible that the GPU isn't hung;
>> instead, a job just took unusually long (longer than the timeout) but is
>> still running, and there is, thus, no reason to reset the hardware. This
>> can occur in two scenarios:
>>
>>    1. The job is taking longer than the timeout, but the driver determined
>>       through a GPU-specific mechanism that the hardware is still making
>>       progress. Hence, the driver would like the scheduler to skip the
>>       timeout and treat the job as still pending from then onward. This
>>       happens in v3d, Etnaviv, and Xe.
>>    2. Timeout has fired before the free-job worker. Consequently, the
>>       scheduler calls `sched->ops->timedout_job()` for a job that isn't
>>       timed out.
>>
>> These two scenarios are problematic because the job was removed from the
>> `sched->pending_list` before calling `sched->ops->timedout_job()`, which
>> means that when the job finishes, it won't be freed by the scheduler
>> though `sched->ops->free_job()` - leading to a memory leak.
> 
> Yeah, that is unfortunately intentional.
> 
>> To solve these problems, create a new `drm_gpu_sched_stat`, called
>> DRM_GPU_SCHED_STAT_NO_HANG, which allows a driver to skip the reset. The
>> new status will indicate that the job must be reinserted into
>> `sched->pending_list`, and the hardware / driver will still complete that
>> job.
> 
> Well long story short we have already tried this and the whole approach doesn't work correctly in all cases. See the git history around how we used to destroy the jobs.
> 
> The basic problem is that you can always race between timing out and Signaling/destroying the job. This is the long lasting job lifetime issue we already discussed more than once.

I do understand that we have a race between timing out and signaling the
job. However, I believe we are taking measures to mitigate issues.

We are re-adding the job to the pending list (after it was removed in
the beginning of the timeout, so we aren't running with
drm_sched_get_finished_job()) and right after, the scheduler enqueues
the free-job work again if ready, which guarantees that if a signaled
job could be added to the pending list, it'll be freed.

Now, the drivers have the option to bail out of a reset if the timeout
has fired before the free-job worker, and most importantly, without
leaks.

Apart from that, +1 to Philipp's answer. This series is just
incorporating a common use-case to the scheduler's code (which we work
to improve later on) and it's use isn't mandatory by the drivers.

Best Regards,
- Maíra

> 
> If you want to fix this I think the correct approach is to completely drop tracking jobs in the scheduler at all.
> 
> Instead we should track the HW fences (or maybe the scheduler fences which point to the HW fence) the scheduler waits for.
> 
> This HW fence is then given as a parameter to the driver when we run into a timeout.
> 
> This has the clear advantage that dma_fence objects have a well defined livetime and necessary state transition. E.g. you can check at all times if the fence is signaled or not.
> 
> Regards,
> Christian.
> 
>>   
>> Signed-off-by: Maíra Canal <mcanal@igalia.com>
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 46 ++++++++++++++++++++++++++++++++--
>>   include/drm/gpu_scheduler.h            |  3 +++
>>   2 files changed, 47 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index 0f32e2cb43d6af294408968a970990f9f5c47bee..657846d56dacd4f26fffc954fc3d025c1e6bfc9f 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -374,11 +374,16 @@ static void drm_sched_run_free_queue(struct drm_gpu_scheduler *sched)
>>   {
>>   	struct drm_sched_job *job;
>>   
>> -	spin_lock(&sched->job_list_lock);
>>   	job = list_first_entry_or_null(&sched->pending_list,
>>   				       struct drm_sched_job, list);
>>   	if (job && dma_fence_is_signaled(&job->s_fence->finished))
>>   		__drm_sched_run_free_queue(sched);
>> +}
>> +
>> +static void drm_sched_run_free_queue_unlocked(struct drm_gpu_scheduler *sched)
>> +{
>> +	spin_lock(&sched->job_list_lock);
>> +	drm_sched_run_free_queue(sched);
>>   	spin_unlock(&sched->job_list_lock);
>>   }
>>   
>> @@ -531,6 +536,32 @@ static void drm_sched_job_begin(struct drm_sched_job *s_job)
>>   	spin_unlock(&sched->job_list_lock);
>>   }
>>   
>> +/**
>> + * drm_sched_job_reinsert_on_false_timeout - reinsert the job on a false timeout
>> + * @sched: scheduler instance
>> + * @job: job to be reinserted on the pending list
>> + *
>> + * In the case of a "false timeout" - when a timeout occurs but the GPU isn't
>> + * hung and is making progress, the scheduler must reinsert the job back into
>> + * @sched->pending_list. Otherwise, the job and its resources won't be freed
>> + * through the &struct drm_sched_backend_ops.free_job callback.
>> + *
>> + * This function must be used in "false timeout" cases only.
>> + */
>> +static void drm_sched_job_reinsert_on_false_timeout(struct drm_gpu_scheduler *sched,
>> +						    struct drm_sched_job *job)
>> +{
>> +	spin_lock(&sched->job_list_lock);
>> +	list_add(&job->list, &sched->pending_list);
>> +
>> +	/* After reinserting the job, the scheduler enqueues the free-job work
>> +	 * again if ready. Otherwise, a signaled job could be added to the
>> +	 * pending list, but never freed.
>> +	 */
>> +	drm_sched_run_free_queue(sched);
>> +	spin_unlock(&sched->job_list_lock);
>> +}
>> +
>>   static void drm_sched_job_timedout(struct work_struct *work)
>>   {
>>   	struct drm_gpu_scheduler *sched;
>> @@ -564,6 +595,9 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>   			job->sched->ops->free_job(job);
>>   			sched->free_guilty = false;
>>   		}
>> +
>> +		if (status == DRM_GPU_SCHED_STAT_NO_HANG)
>> +			drm_sched_job_reinsert_on_false_timeout(sched, job);
>>   	} else {
>>   		spin_unlock(&sched->job_list_lock);
>>   	}
>> @@ -586,6 +620,10 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>    * This function is typically used for reset recovery (see the docu of
>>    * drm_sched_backend_ops.timedout_job() for details). Do not call it for
>>    * scheduler teardown, i.e., before calling drm_sched_fini().
>> + *
>> + * As it's only used for reset recovery, drivers must not call this function
>> + * in their &struct drm_sched_backend_ops.timedout_job callback when they
>> + * skip a reset using &enum drm_gpu_sched_stat.DRM_GPU_SCHED_STAT_NO_HANG.
>>    */
>>   void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>   {
>> @@ -671,6 +709,10 @@ EXPORT_SYMBOL(drm_sched_stop);
>>    * drm_sched_backend_ops.timedout_job() for details). Do not call it for
>>    * scheduler startup. The scheduler itself is fully operational after
>>    * drm_sched_init() succeeded.
>> + *
>> + * As it's only used for reset recovery, drivers must not call this function
>> + * in their &struct drm_sched_backend_ops.timedout_job callback when they
>> + * skip a reset using &enum drm_gpu_sched_stat.DRM_GPU_SCHED_STAT_NO_HANG.
>>    */
>>   void drm_sched_start(struct drm_gpu_scheduler *sched, int errno)
>>   {
>> @@ -1192,7 +1234,7 @@ static void drm_sched_free_job_work(struct work_struct *w)
>>   	if (job)
>>   		sched->ops->free_job(job);
>>   
>> -	drm_sched_run_free_queue(sched);
>> +	drm_sched_run_free_queue_unlocked(sched);
>>   	drm_sched_run_job_queue(sched);
>>   }
>>   
>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>> index 83e5c00d8dd9a83ab20547a93d6fc572de97616e..257d21d8d1d2c4f035d6d4882e159de59b263c76 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -393,11 +393,14 @@ struct drm_sched_job {
>>    * @DRM_GPU_SCHED_STAT_NONE: Reserved. Do not use.
>>    * @DRM_GPU_SCHED_STAT_RESET: The GPU hung and successfully reset.
>>    * @DRM_GPU_SCHED_STAT_ENODEV: Error: Device is not available anymore.
>> + * @DRM_GPU_SCHED_STAT_NO_HANG: Contrary to scheduler's assumption, the GPU
>> + * did not hang and is still running.
>>    */
>>   enum drm_gpu_sched_stat {
>>   	DRM_GPU_SCHED_STAT_NONE,
>>   	DRM_GPU_SCHED_STAT_RESET,
>>   	DRM_GPU_SCHED_STAT_ENODEV,
>> +	DRM_GPU_SCHED_STAT_NO_HANG,
>>   };
>>   
>>   /**
>>
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 2/8] drm/sched: Allow drivers to skip the reset and keep on running
  2025-07-11 13:37     ` Philipp Stanner
@ 2025-07-11 15:20       ` Christian König
  2025-07-11 17:23         ` Matthew Brost
  2025-07-13 19:03         ` Maíra Canal
  0 siblings, 2 replies; 23+ messages in thread
From: Christian König @ 2025-07-11 15:20 UTC (permalink / raw)
  To: phasta, Maíra Canal, Matthew Brost, Danilo Krummrich,
	Tvrtko Ursulin, Simona Vetter, David Airlie, Melissa Wen,
	Lucas Stach, Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price, Liviu Dudau
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe

On 11.07.25 15:37, Philipp Stanner wrote:
> On Fri, 2025-07-11 at 15:22 +0200, Christian König wrote:
>>
>>
>> On 08.07.25 15:25, Maíra Canal wrote:
>>> When the DRM scheduler times out, it's possible that the GPU isn't hung;
>>> instead, a job just took unusually long (longer than the timeout) but is
>>> still running, and there is, thus, no reason to reset the hardware. This
>>> can occur in two scenarios:
>>>
>>>   1. The job is taking longer than the timeout, but the driver determined
>>>      through a GPU-specific mechanism that the hardware is still making
>>>      progress. Hence, the driver would like the scheduler to skip the
>>>      timeout and treat the job as still pending from then onward. This
>>>      happens in v3d, Etnaviv, and Xe.
>>>   2. Timeout has fired before the free-job worker. Consequently, the
>>>      scheduler calls `sched->ops->timedout_job()` for a job that isn't
>>>      timed out.
>>>
>>> These two scenarios are problematic because the job was removed from the
>>> `sched->pending_list` before calling `sched->ops->timedout_job()`, which
>>> means that when the job finishes, it won't be freed by the scheduler
>>> though `sched->ops->free_job()` - leading to a memory leak.
>>
>> Yeah, that is unfortunately intentional.
>>
>>> To solve these problems, create a new `drm_gpu_sched_stat`, called
>>> DRM_GPU_SCHED_STAT_NO_HANG, which allows a driver to skip the reset. The
>>> new status will indicate that the job must be reinserted into
>>> `sched->pending_list`, and the hardware / driver will still complete that
>>> job.
>>
>> Well long story short we have already tried this and the whole approach doesn't work correctly in all cases. See the git history around how we used to destroy the jobs.
>>
>> The basic problem is that you can always race between timing out and Signaling/destroying the job. This is the long lasting job lifetime issue we already discussed more than once.
> 
> The scheduler destroys the job, through free_job().
> I think we have agreed that for now the scheduler is the party
> responsible for the job lifetime.

That's what I strongly disagree on. The job is just a state bag between the submission and scheduling state of a submission.

For the scheduler the control starts when it is pushed into the entity and ends when run_job is called.

The real representation of the submission is the scheduler fence and that object has a perfectly defined lifetime, state and error handling.

>>
>> If you want to fix this I think the correct approach is to completely drop tracking jobs in the scheduler at all.
> 
> I don't see how this series introduces a problem?
> 
> The fact is that drivers are abusing the API by just firing jobs back
> into the scheduler's job list. This series legalizes the abuse by
> providing scheduler functionality for that.
> 
> IOW, the series improves the situation but does not add a *new*
> problem. Even less so as driver's aren't forced to use the new status
> code, but can continue having job completion race with timeout
> handlers.

Maybe yes, but I'm really not sure about it.

Take a look at the git history or job destruction, we already had exactly that approach, removed it and said that leaking memory is at least better than an use after free issue.

>>
>> Instead we should track the HW fences (or maybe the scheduler fences which point to the HW fence) the scheduler waits for.
>>
>> This HW fence is then given as a parameter to the driver when we run into a timeout.
>>
>> This has the clear advantage that dma_fence objects have a well defined livetime and necessary state transition. E.g. you can check at all times if the fence is signaled or not.
> 
> I'd say that centralizing job handling in the scheduler is a
> prerequisite of what you suggest. You can't clean up anything while
> certain drivers (sometimes without even locking!) just willy-nilly re-
> add jobs to the pending_list.

The point is we should get completely rid of the pending list.

Then the whole question of removing or re-adding anything to the pending list doesn't appear in the first place.

The issue with that is that we need to change the timeout callback to take the fence and not the job. And that is a rather big change affecting all drivers using the scheduler.

Regards,
Christian. 

> 
> 
> P.
> 
>>
>> Regards,
>> Christian.
>>
>>>  
>>> Signed-off-by: Maíra Canal <mcanal@igalia.com>
>>> ---
>>>  drivers/gpu/drm/scheduler/sched_main.c | 46 ++++++++++++++++++++++++++++++++--
>>>  include/drm/gpu_scheduler.h            |  3 +++
>>>  2 files changed, 47 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 0f32e2cb43d6af294408968a970990f9f5c47bee..657846d56dacd4f26fffc954fc3d025c1e6bfc9f 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -374,11 +374,16 @@ static void drm_sched_run_free_queue(struct drm_gpu_scheduler *sched)
>>>  {
>>>  	struct drm_sched_job *job;
>>>  
>>> -	spin_lock(&sched->job_list_lock);
>>>  	job = list_first_entry_or_null(&sched->pending_list,
>>>  				       struct drm_sched_job, list);
>>>  	if (job && dma_fence_is_signaled(&job->s_fence->finished))
>>>  		__drm_sched_run_free_queue(sched);
>>> +}
>>> +
>>> +static void drm_sched_run_free_queue_unlocked(struct drm_gpu_scheduler *sched)
>>> +{
>>> +	spin_lock(&sched->job_list_lock);
>>> +	drm_sched_run_free_queue(sched);
>>>  	spin_unlock(&sched->job_list_lock);
>>>  }
>>>  
>>> @@ -531,6 +536,32 @@ static void drm_sched_job_begin(struct drm_sched_job *s_job)
>>>  	spin_unlock(&sched->job_list_lock);
>>>  }
>>>  
>>> +/**
>>> + * drm_sched_job_reinsert_on_false_timeout - reinsert the job on a false timeout
>>> + * @sched: scheduler instance
>>> + * @job: job to be reinserted on the pending list
>>> + *
>>> + * In the case of a "false timeout" - when a timeout occurs but the GPU isn't
>>> + * hung and is making progress, the scheduler must reinsert the job back into
>>> + * @sched->pending_list. Otherwise, the job and its resources won't be freed
>>> + * through the &struct drm_sched_backend_ops.free_job callback.
>>> + *
>>> + * This function must be used in "false timeout" cases only.
>>> + */
>>> +static void drm_sched_job_reinsert_on_false_timeout(struct drm_gpu_scheduler *sched,
>>> +						    struct drm_sched_job *job)
>>> +{
>>> +	spin_lock(&sched->job_list_lock);
>>> +	list_add(&job->list, &sched->pending_list);
>>> +
>>> +	/* After reinserting the job, the scheduler enqueues the free-job work
>>> +	 * again if ready. Otherwise, a signaled job could be added to the
>>> +	 * pending list, but never freed.
>>> +	 */
>>> +	drm_sched_run_free_queue(sched);
>>> +	spin_unlock(&sched->job_list_lock);
>>> +}
>>> +
>>>  static void drm_sched_job_timedout(struct work_struct *work)
>>>  {
>>>  	struct drm_gpu_scheduler *sched;
>>> @@ -564,6 +595,9 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>  			job->sched->ops->free_job(job);
>>>  			sched->free_guilty = false;
>>>  		}
>>> +
>>> +		if (status == DRM_GPU_SCHED_STAT_NO_HANG)
>>> +			drm_sched_job_reinsert_on_false_timeout(sched, job);
>>>  	} else {
>>>  		spin_unlock(&sched->job_list_lock);
>>>  	}
>>> @@ -586,6 +620,10 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>   * This function is typically used for reset recovery (see the docu of
>>>   * drm_sched_backend_ops.timedout_job() for details). Do not call it for
>>>   * scheduler teardown, i.e., before calling drm_sched_fini().
>>> + *
>>> + * As it's only used for reset recovery, drivers must not call this function
>>> + * in their &struct drm_sched_backend_ops.timedout_job callback when they
>>> + * skip a reset using &enum drm_gpu_sched_stat.DRM_GPU_SCHED_STAT_NO_HANG.
>>>   */
>>>  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>>  {
>>> @@ -671,6 +709,10 @@ EXPORT_SYMBOL(drm_sched_stop);
>>>   * drm_sched_backend_ops.timedout_job() for details). Do not call it for
>>>   * scheduler startup. The scheduler itself is fully operational after
>>>   * drm_sched_init() succeeded.
>>> + *
>>> + * As it's only used for reset recovery, drivers must not call this function
>>> + * in their &struct drm_sched_backend_ops.timedout_job callback when they
>>> + * skip a reset using &enum drm_gpu_sched_stat.DRM_GPU_SCHED_STAT_NO_HANG.
>>>   */
>>>  void drm_sched_start(struct drm_gpu_scheduler *sched, int errno)
>>>  {
>>> @@ -1192,7 +1234,7 @@ static void drm_sched_free_job_work(struct work_struct *w)
>>>  	if (job)
>>>  		sched->ops->free_job(job);
>>>  
>>> -	drm_sched_run_free_queue(sched);
>>> +	drm_sched_run_free_queue_unlocked(sched);
>>>  	drm_sched_run_job_queue(sched);
>>>  }
>>>  
>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>> index 83e5c00d8dd9a83ab20547a93d6fc572de97616e..257d21d8d1d2c4f035d6d4882e159de59b263c76 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -393,11 +393,14 @@ struct drm_sched_job {
>>>   * @DRM_GPU_SCHED_STAT_NONE: Reserved. Do not use.
>>>   * @DRM_GPU_SCHED_STAT_RESET: The GPU hung and successfully reset.
>>>   * @DRM_GPU_SCHED_STAT_ENODEV: Error: Device is not available anymore.
>>> + * @DRM_GPU_SCHED_STAT_NO_HANG: Contrary to scheduler's assumption, the GPU
>>> + * did not hang and is still running.
>>>   */
>>>  enum drm_gpu_sched_stat {
>>>  	DRM_GPU_SCHED_STAT_NONE,
>>>  	DRM_GPU_SCHED_STAT_RESET,
>>>  	DRM_GPU_SCHED_STAT_ENODEV,
>>> +	DRM_GPU_SCHED_STAT_NO_HANG,
>>>  };
>>>  
>>>  /**
>>>
>>
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 2/8] drm/sched: Allow drivers to skip the reset and keep on running
  2025-07-11 15:20       ` Christian König
@ 2025-07-11 17:23         ` Matthew Brost
  2025-07-14  9:10           ` Christian König
  2025-07-13 19:03         ` Maíra Canal
  1 sibling, 1 reply; 23+ messages in thread
From: Matthew Brost @ 2025-07-11 17:23 UTC (permalink / raw)
  To: Christian König
  Cc: phasta, Maíra Canal, Danilo Krummrich, Tvrtko Ursulin,
	Simona Vetter, David Airlie, Melissa Wen, Lucas Stach,
	Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price, Liviu Dudau, kernel-dev, dri-devel, etnaviv,
	intel-xe

On Fri, Jul 11, 2025 at 05:20:49PM +0200, Christian König wrote:
> On 11.07.25 15:37, Philipp Stanner wrote:
> > On Fri, 2025-07-11 at 15:22 +0200, Christian König wrote:
> >>
> >>
> >> On 08.07.25 15:25, Maíra Canal wrote:
> >>> When the DRM scheduler times out, it's possible that the GPU isn't hung;
> >>> instead, a job just took unusually long (longer than the timeout) but is
> >>> still running, and there is, thus, no reason to reset the hardware. This
> >>> can occur in two scenarios:
> >>>
> >>>   1. The job is taking longer than the timeout, but the driver determined
> >>>      through a GPU-specific mechanism that the hardware is still making
> >>>      progress. Hence, the driver would like the scheduler to skip the
> >>>      timeout and treat the job as still pending from then onward. This
> >>>      happens in v3d, Etnaviv, and Xe.
> >>>   2. Timeout has fired before the free-job worker. Consequently, the
> >>>      scheduler calls `sched->ops->timedout_job()` for a job that isn't
> >>>      timed out.
> >>>
> >>> These two scenarios are problematic because the job was removed from the
> >>> `sched->pending_list` before calling `sched->ops->timedout_job()`, which
> >>> means that when the job finishes, it won't be freed by the scheduler
> >>> though `sched->ops->free_job()` - leading to a memory leak.
> >>
> >> Yeah, that is unfortunately intentional.
> >>
> >>> To solve these problems, create a new `drm_gpu_sched_stat`, called
> >>> DRM_GPU_SCHED_STAT_NO_HANG, which allows a driver to skip the reset. The
> >>> new status will indicate that the job must be reinserted into
> >>> `sched->pending_list`, and the hardware / driver will still complete that
> >>> job.
> >>
> >> Well long story short we have already tried this and the whole approach doesn't work correctly in all cases. See the git history around how we used to destroy the jobs.
> >>
> >> The basic problem is that you can always race between timing out and Signaling/destroying the job. This is the long lasting job lifetime issue we already discussed more than once.
> > 
> > The scheduler destroys the job, through free_job().
> > I think we have agreed that for now the scheduler is the party
> > responsible for the job lifetime.
> 
> That's what I strongly disagree on. The job is just a state bag between the submission and scheduling state of a submission.
>

See below. free_job is called after the fence signals, and drivers have
now built assumptions around this, which would be hard to untangle.
Sure, the scheduler could have been designed to free the job immediately
after calling run_job(), but it wasn’t. Honestly, I kind of agree with
freeing the job immediately after running it—but that’s not the world we
live in, and we don’t have a time machine to fix it. I’d rather document
the current rules around job lifetime and look for ways to improve it.

Changing the job lifetime rules would require broad community buy-in,
which I doubt everyone would agree to.
 
> For the scheduler the control starts when it is pushed into the entity and ends when run_job is called.
> 
> The real representation of the submission is the scheduler fence and that object has a perfectly defined lifetime, state and error handling.
> 
> >>
> >> If you want to fix this I think the correct approach is to completely drop tracking jobs in the scheduler at all.
> > 
> > I don't see how this series introduces a problem?
> > 
> > The fact is that drivers are abusing the API by just firing jobs back
> > into the scheduler's job list. This series legalizes the abuse by
> > providing scheduler functionality for that.
> > 
> > IOW, the series improves the situation but does not add a *new*
> > problem. Even less so as driver's aren't forced to use the new status
> > code, but can continue having job completion race with timeout
> > handlers.
> 
> Maybe yes, but I'm really not sure about it.
> 
> Take a look at the git history or job destruction, we already had exactly that approach, removed it and said that leaking memory is at least better than an use after free issue.
> 
> >>
> >> Instead we should track the HW fences (or maybe the scheduler fences which point to the HW fence) the scheduler waits for.
> >>
> >> This HW fence is then given as a parameter to the driver when we run into a timeout.
> >>
> >> This has the clear advantage that dma_fence objects have a well defined livetime and necessary state transition. E.g. you can check at all times if the fence is signaled or not.
> > 
> > I'd say that centralizing job handling in the scheduler is a
> > prerequisite of what you suggest. You can't clean up anything while
> > certain drivers (sometimes without even locking!) just willy-nilly re-
> > add jobs to the pending_list.
> 
> The point is we should get completely rid of the pending list.
> 
> Then the whole question of removing or re-adding anything to the pending list doesn't appear in the first place.
> 
> The issue with that is that we need to change the timeout callback to take the fence and not the job. And that is a rather big change affecting all drivers using the scheduler.
> 

I agree with the idea that the pending list should generally be a list
of pending fences, but once you start considering the implications, I'm
not really sure it works—or at least, not easily.

Most drivers these days decouple the fence and the job (i.e., the fence
isn't embedded in the job), so there's no inherent way to go from a
fence → job. In Xe, the job’s fence can be one of a variety of different
fences, depending on the submission type. To handle fence timeouts, we
don’t necessarily need the job itself, but we do need the driver-side
software queue. In Xe, this would be the scheduler, given the 1:1
relationship—so we could likely make it work. For drivers without a 1:1
relationship, it's unclear how this would be resolved, though perhaps
the hardware queue would be sufficient in that case.

It also follows that the job would be freed immediately after run_job()
if we don’t maintain a pending list of jobs, right? That opens a huge
can of worms, especially considering how Xe depends on the job being
freed only after the fence signals. Our job reference counts tie into
power management, hold a reference to the software queue, which in turn
reference-counts the VM, etc. I suspect other drivers do similar things.
It seems far simpler to just keep the job around until its fence
signals.

TL;DR: This seems like quite a lot of trouble.

Matt

> Regards,
> Christian. 
> 
> > 
> > 
> > P.
> > 
> >>
> >> Regards,
> >> Christian.
> >>
> >>>  
> >>> Signed-off-by: Maíra Canal <mcanal@igalia.com>
> >>> ---
> >>>  drivers/gpu/drm/scheduler/sched_main.c | 46 ++++++++++++++++++++++++++++++++--
> >>>  include/drm/gpu_scheduler.h            |  3 +++
> >>>  2 files changed, 47 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> >>> index 0f32e2cb43d6af294408968a970990f9f5c47bee..657846d56dacd4f26fffc954fc3d025c1e6bfc9f 100644
> >>> --- a/drivers/gpu/drm/scheduler/sched_main.c
> >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> >>> @@ -374,11 +374,16 @@ static void drm_sched_run_free_queue(struct drm_gpu_scheduler *sched)
> >>>  {
> >>>  	struct drm_sched_job *job;
> >>>  
> >>> -	spin_lock(&sched->job_list_lock);
> >>>  	job = list_first_entry_or_null(&sched->pending_list,
> >>>  				       struct drm_sched_job, list);
> >>>  	if (job && dma_fence_is_signaled(&job->s_fence->finished))
> >>>  		__drm_sched_run_free_queue(sched);
> >>> +}
> >>> +
> >>> +static void drm_sched_run_free_queue_unlocked(struct drm_gpu_scheduler *sched)
> >>> +{
> >>> +	spin_lock(&sched->job_list_lock);
> >>> +	drm_sched_run_free_queue(sched);
> >>>  	spin_unlock(&sched->job_list_lock);
> >>>  }
> >>>  
> >>> @@ -531,6 +536,32 @@ static void drm_sched_job_begin(struct drm_sched_job *s_job)
> >>>  	spin_unlock(&sched->job_list_lock);
> >>>  }
> >>>  
> >>> +/**
> >>> + * drm_sched_job_reinsert_on_false_timeout - reinsert the job on a false timeout
> >>> + * @sched: scheduler instance
> >>> + * @job: job to be reinserted on the pending list
> >>> + *
> >>> + * In the case of a "false timeout" - when a timeout occurs but the GPU isn't
> >>> + * hung and is making progress, the scheduler must reinsert the job back into
> >>> + * @sched->pending_list. Otherwise, the job and its resources won't be freed
> >>> + * through the &struct drm_sched_backend_ops.free_job callback.
> >>> + *
> >>> + * This function must be used in "false timeout" cases only.
> >>> + */
> >>> +static void drm_sched_job_reinsert_on_false_timeout(struct drm_gpu_scheduler *sched,
> >>> +						    struct drm_sched_job *job)
> >>> +{
> >>> +	spin_lock(&sched->job_list_lock);
> >>> +	list_add(&job->list, &sched->pending_list);
> >>> +
> >>> +	/* After reinserting the job, the scheduler enqueues the free-job work
> >>> +	 * again if ready. Otherwise, a signaled job could be added to the
> >>> +	 * pending list, but never freed.
> >>> +	 */
> >>> +	drm_sched_run_free_queue(sched);
> >>> +	spin_unlock(&sched->job_list_lock);
> >>> +}
> >>> +
> >>>  static void drm_sched_job_timedout(struct work_struct *work)
> >>>  {
> >>>  	struct drm_gpu_scheduler *sched;
> >>> @@ -564,6 +595,9 @@ static void drm_sched_job_timedout(struct work_struct *work)
> >>>  			job->sched->ops->free_job(job);
> >>>  			sched->free_guilty = false;
> >>>  		}
> >>> +
> >>> +		if (status == DRM_GPU_SCHED_STAT_NO_HANG)
> >>> +			drm_sched_job_reinsert_on_false_timeout(sched, job);
> >>>  	} else {
> >>>  		spin_unlock(&sched->job_list_lock);
> >>>  	}
> >>> @@ -586,6 +620,10 @@ static void drm_sched_job_timedout(struct work_struct *work)
> >>>   * This function is typically used for reset recovery (see the docu of
> >>>   * drm_sched_backend_ops.timedout_job() for details). Do not call it for
> >>>   * scheduler teardown, i.e., before calling drm_sched_fini().
> >>> + *
> >>> + * As it's only used for reset recovery, drivers must not call this function
> >>> + * in their &struct drm_sched_backend_ops.timedout_job callback when they
> >>> + * skip a reset using &enum drm_gpu_sched_stat.DRM_GPU_SCHED_STAT_NO_HANG.
> >>>   */
> >>>  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> >>>  {
> >>> @@ -671,6 +709,10 @@ EXPORT_SYMBOL(drm_sched_stop);
> >>>   * drm_sched_backend_ops.timedout_job() for details). Do not call it for
> >>>   * scheduler startup. The scheduler itself is fully operational after
> >>>   * drm_sched_init() succeeded.
> >>> + *
> >>> + * As it's only used for reset recovery, drivers must not call this function
> >>> + * in their &struct drm_sched_backend_ops.timedout_job callback when they
> >>> + * skip a reset using &enum drm_gpu_sched_stat.DRM_GPU_SCHED_STAT_NO_HANG.
> >>>   */
> >>>  void drm_sched_start(struct drm_gpu_scheduler *sched, int errno)
> >>>  {
> >>> @@ -1192,7 +1234,7 @@ static void drm_sched_free_job_work(struct work_struct *w)
> >>>  	if (job)
> >>>  		sched->ops->free_job(job);
> >>>  
> >>> -	drm_sched_run_free_queue(sched);
> >>> +	drm_sched_run_free_queue_unlocked(sched);
> >>>  	drm_sched_run_job_queue(sched);
> >>>  }
> >>>  
> >>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> >>> index 83e5c00d8dd9a83ab20547a93d6fc572de97616e..257d21d8d1d2c4f035d6d4882e159de59b263c76 100644
> >>> --- a/include/drm/gpu_scheduler.h
> >>> +++ b/include/drm/gpu_scheduler.h
> >>> @@ -393,11 +393,14 @@ struct drm_sched_job {
> >>>   * @DRM_GPU_SCHED_STAT_NONE: Reserved. Do not use.
> >>>   * @DRM_GPU_SCHED_STAT_RESET: The GPU hung and successfully reset.
> >>>   * @DRM_GPU_SCHED_STAT_ENODEV: Error: Device is not available anymore.
> >>> + * @DRM_GPU_SCHED_STAT_NO_HANG: Contrary to scheduler's assumption, the GPU
> >>> + * did not hang and is still running.
> >>>   */
> >>>  enum drm_gpu_sched_stat {
> >>>  	DRM_GPU_SCHED_STAT_NONE,
> >>>  	DRM_GPU_SCHED_STAT_RESET,
> >>>  	DRM_GPU_SCHED_STAT_ENODEV,
> >>> +	DRM_GPU_SCHED_STAT_NO_HANG,
> >>>  };
> >>>  
> >>>  /**
> >>>
> >>
> > 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 2/8] drm/sched: Allow drivers to skip the reset and keep on running
  2025-07-11 15:20       ` Christian König
  2025-07-11 17:23         ` Matthew Brost
@ 2025-07-13 19:03         ` Maíra Canal
  2025-07-14  9:23           ` Christian König
  1 sibling, 1 reply; 23+ messages in thread
From: Maíra Canal @ 2025-07-13 19:03 UTC (permalink / raw)
  To: Christian König, phasta, Matthew Brost, Danilo Krummrich,
	Tvrtko Ursulin, Simona Vetter, David Airlie, Melissa Wen,
	Lucas Stach, Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price, Liviu Dudau
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe

Hi Christian,

On 11/07/25 12:20, Christian König wrote:
> On 11.07.25 15:37, Philipp Stanner wrote:
>> On Fri, 2025-07-11 at 15:22 +0200, Christian König wrote:
>>>
>>>
>>> On 08.07.25 15:25, Maíra Canal wrote:
>>>> When the DRM scheduler times out, it's possible that the GPU isn't hung;
>>>> instead, a job just took unusually long (longer than the timeout) but is
>>>> still running, and there is, thus, no reason to reset the hardware. This
>>>> can occur in two scenarios:
>>>>
>>>>    1. The job is taking longer than the timeout, but the driver determined
>>>>       through a GPU-specific mechanism that the hardware is still making
>>>>       progress. Hence, the driver would like the scheduler to skip the
>>>>       timeout and treat the job as still pending from then onward. This
>>>>       happens in v3d, Etnaviv, and Xe.
>>>>    2. Timeout has fired before the free-job worker. Consequently, the
>>>>       scheduler calls `sched->ops->timedout_job()` for a job that isn't
>>>>       timed out.
>>>>
>>>> These two scenarios are problematic because the job was removed from the
>>>> `sched->pending_list` before calling `sched->ops->timedout_job()`, which
>>>> means that when the job finishes, it won't be freed by the scheduler
>>>> though `sched->ops->free_job()` - leading to a memory leak.
>>>
>>> Yeah, that is unfortunately intentional.
>>>
>>>> To solve these problems, create a new `drm_gpu_sched_stat`, called
>>>> DRM_GPU_SCHED_STAT_NO_HANG, which allows a driver to skip the reset. The
>>>> new status will indicate that the job must be reinserted into
>>>> `sched->pending_list`, and the hardware / driver will still complete that
>>>> job.
>>>
>>> Well long story short we have already tried this and the whole approach doesn't work correctly in all cases. See the git history around how we used to destroy the jobs.
>>>
>>> The basic problem is that you can always race between timing out and Signaling/destroying the job. This is the long lasting job lifetime issue we already discussed more than once.
>>
>> The scheduler destroys the job, through free_job().
>> I think we have agreed that for now the scheduler is the party
>> responsible for the job lifetime.
> 
> That's what I strongly disagree on. The job is just a state bag between the submission and scheduling state of a submission.
> 
> For the scheduler the control starts when it is pushed into the entity and ends when run_job is called.
> 
> The real representation of the submission is the scheduler fence and that object has a perfectly defined lifetime, state and error handling.
> 
>>>
>>> If you want to fix this I think the correct approach is to completely drop tracking jobs in the scheduler at all.
>>
>> I don't see how this series introduces a problem?
>>
>> The fact is that drivers are abusing the API by just firing jobs back
>> into the scheduler's job list. This series legalizes the abuse by
>> providing scheduler functionality for that.
>>
>> IOW, the series improves the situation but does not add a *new*
>> problem. Even less so as driver's aren't forced to use the new status
>> code, but can continue having job completion race with timeout
>> handlers.
> 
> Maybe yes, but I'm really not sure about it.
> 
> Take a look at the git history or job destruction, we already had exactly that approach, removed it and said that leaking memory is at least better than an use after free issue.
> 

If the job was removed from the pending list in the beginning of the
timeout and drm_sched_get_finished_job() fetches jobs from the pending
list, how can we end up with an use-after-free issue?

Best Regards,
- Maíra


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 2/8] drm/sched: Allow drivers to skip the reset and keep on running
  2025-07-11 17:23         ` Matthew Brost
@ 2025-07-14  9:10           ` Christian König
  0 siblings, 0 replies; 23+ messages in thread
From: Christian König @ 2025-07-14  9:10 UTC (permalink / raw)
  To: Matthew Brost
  Cc: phasta, Maíra Canal, Danilo Krummrich, Tvrtko Ursulin,
	Simona Vetter, David Airlie, Melissa Wen, Lucas Stach,
	Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price, Liviu Dudau, kernel-dev, dri-devel, etnaviv,
	intel-xe

On 11.07.25 19:23, Matthew Brost wrote:
> On Fri, Jul 11, 2025 at 05:20:49PM +0200, Christian König wrote:
>> On 11.07.25 15:37, Philipp Stanner wrote:
>>> On Fri, 2025-07-11 at 15:22 +0200, Christian König wrote:
>>>>
>>>>
>>>> On 08.07.25 15:25, Maíra Canal wrote:
>>>>> When the DRM scheduler times out, it's possible that the GPU isn't hung;
>>>>> instead, a job just took unusually long (longer than the timeout) but is
>>>>> still running, and there is, thus, no reason to reset the hardware. This
>>>>> can occur in two scenarios:
>>>>>
>>>>>   1. The job is taking longer than the timeout, but the driver determined
>>>>>      through a GPU-specific mechanism that the hardware is still making
>>>>>      progress. Hence, the driver would like the scheduler to skip the
>>>>>      timeout and treat the job as still pending from then onward. This
>>>>>      happens in v3d, Etnaviv, and Xe.
>>>>>   2. Timeout has fired before the free-job worker. Consequently, the
>>>>>      scheduler calls `sched->ops->timedout_job()` for a job that isn't
>>>>>      timed out.
>>>>>
>>>>> These two scenarios are problematic because the job was removed from the
>>>>> `sched->pending_list` before calling `sched->ops->timedout_job()`, which
>>>>> means that when the job finishes, it won't be freed by the scheduler
>>>>> though `sched->ops->free_job()` - leading to a memory leak.
>>>>
>>>> Yeah, that is unfortunately intentional.
>>>>
>>>>> To solve these problems, create a new `drm_gpu_sched_stat`, called
>>>>> DRM_GPU_SCHED_STAT_NO_HANG, which allows a driver to skip the reset. The
>>>>> new status will indicate that the job must be reinserted into
>>>>> `sched->pending_list`, and the hardware / driver will still complete that
>>>>> job.
>>>>
>>>> Well long story short we have already tried this and the whole approach doesn't work correctly in all cases. See the git history around how we used to destroy the jobs.
>>>>
>>>> The basic problem is that you can always race between timing out and Signaling/destroying the job. This is the long lasting job lifetime issue we already discussed more than once.
>>>
>>> The scheduler destroys the job, through free_job().
>>> I think we have agreed that for now the scheduler is the party
>>> responsible for the job lifetime.
>>
>> That's what I strongly disagree on. The job is just a state bag between the submission and scheduling state of a submission.
>>
> 
> See below. free_job is called after the fence signals, and drivers have
> now built assumptions around this, which would be hard to untangle.
> Sure, the scheduler could have been designed to free the job immediately
> after calling run_job(), but it wasn’t.

Exactly that's the point! The scheduler *was* designed around the idea of immediately freeing the job when it starts to run.

The original idea was that the scheduler just works with a void* as the job representation and converts that into the HW fence by calling run_job.

This avoids tons of problems, starting from that you can't allocate memory for the HW fence while run_job is called all the way to that you don't have a properly defined job lifetime.

I screamed out quite loud when people started to change this because it was absolutely foreseeable that this goes boom, but I just wasn't the maintainer of that stuff at this point.

> Honestly, I kind of agree with
> freeing the job immediately after running it—but that’s not the world we
> live in, and we don’t have a time machine to fix it.

Yeah, that's unfortunately true. 

> I’d rather document
> the current rules around job lifetime and look for ways to improve it.

Exactly that's what I'm trying to do here.

See we have spend the last 8 years iterating over the same problem over and over again, and the current solution just barely works. Take a look at the git history how often we have tried to get this to work properly.

And yes, it's well known that it leaks the job when the driver doesn't do a reset and as Maíra absolutely correctly pointed out as well that re-inserting the job during the reset is just a completely broken design.

It's just that everybody working on that has given up at some point.

> Changing the job lifetime rules would require broad community buy-in,
> which I doubt everyone would agree to.

Well, changing the parameter of the timedout callback from the job to the fence should be relatively easy to do. That should immediately fix the issue Maíra is working on.

Then potentially changing the job lifetime in the scheduler can be a different topic.

>> For the scheduler the control starts when it is pushed into the entity and ends when run_job is called.
>>
>> The real representation of the submission is the scheduler fence and that object has a perfectly defined lifetime, state and error handling.
>>
>>>>
>>>> If you want to fix this I think the correct approach is to completely drop tracking jobs in the scheduler at all.
>>>
>>> I don't see how this series introduces a problem?
>>>
>>> The fact is that drivers are abusing the API by just firing jobs back
>>> into the scheduler's job list. This series legalizes the abuse by
>>> providing scheduler functionality for that.
>>>
>>> IOW, the series improves the situation but does not add a *new*
>>> problem. Even less so as driver's aren't forced to use the new status
>>> code, but can continue having job completion race with timeout
>>> handlers.
>>
>> Maybe yes, but I'm really not sure about it.
>>
>> Take a look at the git history or job destruction, we already had exactly that approach, removed it and said that leaking memory is at least better than an use after free issue.
>>
>>>>
>>>> Instead we should track the HW fences (or maybe the scheduler fences which point to the HW fence) the scheduler waits for.
>>>>
>>>> This HW fence is then given as a parameter to the driver when we run into a timeout.
>>>>
>>>> This has the clear advantage that dma_fence objects have a well defined livetime and necessary state transition. E.g. you can check at all times if the fence is signaled or not.
>>>
>>> I'd say that centralizing job handling in the scheduler is a
>>> prerequisite of what you suggest. You can't clean up anything while
>>> certain drivers (sometimes without even locking!) just willy-nilly re-
>>> add jobs to the pending_list.
>>
>> The point is we should get completely rid of the pending list.
>>
>> Then the whole question of removing or re-adding anything to the pending list doesn't appear in the first place.
>>
>> The issue with that is that we need to change the timeout callback to take the fence and not the job. And that is a rather big change affecting all drivers using the scheduler.
>>
> 
> I agree with the idea that the pending list should generally be a list
> of pending fences, but once you start considering the implications, I'm
> not really sure it works—or at least, not easily.
> 
> Most drivers these days decouple the fence and the job (i.e., the fence
> isn't embedded in the job), so there's no inherent way to go from a
> fence → job. In Xe, the job’s fence can be one of a variety of different
> fences, depending on the submission type. To handle fence timeouts, we
> don’t necessarily need the job itself, but we do need the driver-side
> software queue. In Xe, this would be the scheduler, given the 1:1
> relationship—so we could likely make it work. For drivers without a 1:1
> relationship, it's unclear how this would be resolved, though perhaps
> the hardware queue would be sufficient in that case.
> 
> It also follows that the job would be freed immediately after run_job()
> if we don’t maintain a pending list of jobs, right? That opens a huge
> can of worms, especially considering how Xe depends on the job being
> freed only after the fence signals. Our job reference counts tie into
> power management, hold a reference to the software queue, which in turn
> reference-counts the VM, etc. I suspect other drivers do similar things.
> It seems far simpler to just keep the job around until its fence
> signals.
> 
> TL;DR: This seems like quite a lot of trouble.

For a start my idea is to have a pointer to the job in the scheduler fence, then track the pending fences instead of the pending jobs.

As a next step give the timedout callback the scheduler fence (which has both HW fence pointer and job at that point) instead of the job and stop messing with the pending list all together during a timeout.

As a next step we could move the job pointer from the scheduler fence into the driver specific HW fence.

Then we add a DMA-buf utility object which starts a work item when a dma_fence signals (I wanted to do that anyway cause that is a rather common pattern).

This utility object would then replace the free_job callback from the scheduler, so that drivers can basically decide themselves when they want their job object freed.

Regards,
Christian.

> 
> Matt
> 
>> Regards,
>> Christian. 
>>
>>>
>>>
>>> P.
>>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>  
>>>>> Signed-off-by: Maíra Canal <mcanal@igalia.com>
>>>>> ---
>>>>>  drivers/gpu/drm/scheduler/sched_main.c | 46 ++++++++++++++++++++++++++++++++--
>>>>>  include/drm/gpu_scheduler.h            |  3 +++
>>>>>  2 files changed, 47 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> index 0f32e2cb43d6af294408968a970990f9f5c47bee..657846d56dacd4f26fffc954fc3d025c1e6bfc9f 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> @@ -374,11 +374,16 @@ static void drm_sched_run_free_queue(struct drm_gpu_scheduler *sched)
>>>>>  {
>>>>>  	struct drm_sched_job *job;
>>>>>  
>>>>> -	spin_lock(&sched->job_list_lock);
>>>>>  	job = list_first_entry_or_null(&sched->pending_list,
>>>>>  				       struct drm_sched_job, list);
>>>>>  	if (job && dma_fence_is_signaled(&job->s_fence->finished))
>>>>>  		__drm_sched_run_free_queue(sched);
>>>>> +}
>>>>> +
>>>>> +static void drm_sched_run_free_queue_unlocked(struct drm_gpu_scheduler *sched)
>>>>> +{
>>>>> +	spin_lock(&sched->job_list_lock);
>>>>> +	drm_sched_run_free_queue(sched);
>>>>>  	spin_unlock(&sched->job_list_lock);
>>>>>  }
>>>>>  
>>>>> @@ -531,6 +536,32 @@ static void drm_sched_job_begin(struct drm_sched_job *s_job)
>>>>>  	spin_unlock(&sched->job_list_lock);
>>>>>  }
>>>>>  
>>>>> +/**
>>>>> + * drm_sched_job_reinsert_on_false_timeout - reinsert the job on a false timeout
>>>>> + * @sched: scheduler instance
>>>>> + * @job: job to be reinserted on the pending list
>>>>> + *
>>>>> + * In the case of a "false timeout" - when a timeout occurs but the GPU isn't
>>>>> + * hung and is making progress, the scheduler must reinsert the job back into
>>>>> + * @sched->pending_list. Otherwise, the job and its resources won't be freed
>>>>> + * through the &struct drm_sched_backend_ops.free_job callback.
>>>>> + *
>>>>> + * This function must be used in "false timeout" cases only.
>>>>> + */
>>>>> +static void drm_sched_job_reinsert_on_false_timeout(struct drm_gpu_scheduler *sched,
>>>>> +						    struct drm_sched_job *job)
>>>>> +{
>>>>> +	spin_lock(&sched->job_list_lock);
>>>>> +	list_add(&job->list, &sched->pending_list);
>>>>> +
>>>>> +	/* After reinserting the job, the scheduler enqueues the free-job work
>>>>> +	 * again if ready. Otherwise, a signaled job could be added to the
>>>>> +	 * pending list, but never freed.
>>>>> +	 */
>>>>> +	drm_sched_run_free_queue(sched);
>>>>> +	spin_unlock(&sched->job_list_lock);
>>>>> +}
>>>>> +
>>>>>  static void drm_sched_job_timedout(struct work_struct *work)
>>>>>  {
>>>>>  	struct drm_gpu_scheduler *sched;
>>>>> @@ -564,6 +595,9 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>  			job->sched->ops->free_job(job);
>>>>>  			sched->free_guilty = false;
>>>>>  		}
>>>>> +
>>>>> +		if (status == DRM_GPU_SCHED_STAT_NO_HANG)
>>>>> +			drm_sched_job_reinsert_on_false_timeout(sched, job);
>>>>>  	} else {
>>>>>  		spin_unlock(&sched->job_list_lock);
>>>>>  	}
>>>>> @@ -586,6 +620,10 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>   * This function is typically used for reset recovery (see the docu of
>>>>>   * drm_sched_backend_ops.timedout_job() for details). Do not call it for
>>>>>   * scheduler teardown, i.e., before calling drm_sched_fini().
>>>>> + *
>>>>> + * As it's only used for reset recovery, drivers must not call this function
>>>>> + * in their &struct drm_sched_backend_ops.timedout_job callback when they
>>>>> + * skip a reset using &enum drm_gpu_sched_stat.DRM_GPU_SCHED_STAT_NO_HANG.
>>>>>   */
>>>>>  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>>>>  {
>>>>> @@ -671,6 +709,10 @@ EXPORT_SYMBOL(drm_sched_stop);
>>>>>   * drm_sched_backend_ops.timedout_job() for details). Do not call it for
>>>>>   * scheduler startup. The scheduler itself is fully operational after
>>>>>   * drm_sched_init() succeeded.
>>>>> + *
>>>>> + * As it's only used for reset recovery, drivers must not call this function
>>>>> + * in their &struct drm_sched_backend_ops.timedout_job callback when they
>>>>> + * skip a reset using &enum drm_gpu_sched_stat.DRM_GPU_SCHED_STAT_NO_HANG.
>>>>>   */
>>>>>  void drm_sched_start(struct drm_gpu_scheduler *sched, int errno)
>>>>>  {
>>>>> @@ -1192,7 +1234,7 @@ static void drm_sched_free_job_work(struct work_struct *w)
>>>>>  	if (job)
>>>>>  		sched->ops->free_job(job);
>>>>>  
>>>>> -	drm_sched_run_free_queue(sched);
>>>>> +	drm_sched_run_free_queue_unlocked(sched);
>>>>>  	drm_sched_run_job_queue(sched);
>>>>>  }
>>>>>  
>>>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>>>> index 83e5c00d8dd9a83ab20547a93d6fc572de97616e..257d21d8d1d2c4f035d6d4882e159de59b263c76 100644
>>>>> --- a/include/drm/gpu_scheduler.h
>>>>> +++ b/include/drm/gpu_scheduler.h
>>>>> @@ -393,11 +393,14 @@ struct drm_sched_job {
>>>>>   * @DRM_GPU_SCHED_STAT_NONE: Reserved. Do not use.
>>>>>   * @DRM_GPU_SCHED_STAT_RESET: The GPU hung and successfully reset.
>>>>>   * @DRM_GPU_SCHED_STAT_ENODEV: Error: Device is not available anymore.
>>>>> + * @DRM_GPU_SCHED_STAT_NO_HANG: Contrary to scheduler's assumption, the GPU
>>>>> + * did not hang and is still running.
>>>>>   */
>>>>>  enum drm_gpu_sched_stat {
>>>>>  	DRM_GPU_SCHED_STAT_NONE,
>>>>>  	DRM_GPU_SCHED_STAT_RESET,
>>>>>  	DRM_GPU_SCHED_STAT_ENODEV,
>>>>> +	DRM_GPU_SCHED_STAT_NO_HANG,
>>>>>  };
>>>>>  
>>>>>  /**
>>>>>
>>>>
>>>
>>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 2/8] drm/sched: Allow drivers to skip the reset and keep on running
  2025-07-13 19:03         ` Maíra Canal
@ 2025-07-14  9:23           ` Christian König
  2025-07-14 10:16             ` Philipp Stanner
  0 siblings, 1 reply; 23+ messages in thread
From: Christian König @ 2025-07-14  9:23 UTC (permalink / raw)
  To: Maíra Canal, phasta, Matthew Brost, Danilo Krummrich,
	Tvrtko Ursulin, Simona Vetter, David Airlie, Melissa Wen,
	Lucas Stach, Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price, Liviu Dudau
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe

On 13.07.25 21:03, Maíra Canal wrote:
> Hi Christian,
> 
> On 11/07/25 12:20, Christian König wrote:
>> On 11.07.25 15:37, Philipp Stanner wrote:
>>> On Fri, 2025-07-11 at 15:22 +0200, Christian König wrote:
>>>>
>>>>
>>>> On 08.07.25 15:25, Maíra Canal wrote:
>>>>> When the DRM scheduler times out, it's possible that the GPU isn't hung;
>>>>> instead, a job just took unusually long (longer than the timeout) but is
>>>>> still running, and there is, thus, no reason to reset the hardware. This
>>>>> can occur in two scenarios:
>>>>>
>>>>>    1. The job is taking longer than the timeout, but the driver determined
>>>>>       through a GPU-specific mechanism that the hardware is still making
>>>>>       progress. Hence, the driver would like the scheduler to skip the
>>>>>       timeout and treat the job as still pending from then onward. This
>>>>>       happens in v3d, Etnaviv, and Xe.
>>>>>    2. Timeout has fired before the free-job worker. Consequently, the
>>>>>       scheduler calls `sched->ops->timedout_job()` for a job that isn't
>>>>>       timed out.
>>>>>
>>>>> These two scenarios are problematic because the job was removed from the
>>>>> `sched->pending_list` before calling `sched->ops->timedout_job()`, which
>>>>> means that when the job finishes, it won't be freed by the scheduler
>>>>> though `sched->ops->free_job()` - leading to a memory leak.
>>>>
>>>> Yeah, that is unfortunately intentional.
>>>>
>>>>> To solve these problems, create a new `drm_gpu_sched_stat`, called
>>>>> DRM_GPU_SCHED_STAT_NO_HANG, which allows a driver to skip the reset. The
>>>>> new status will indicate that the job must be reinserted into
>>>>> `sched->pending_list`, and the hardware / driver will still complete that
>>>>> job.
>>>>
>>>> Well long story short we have already tried this and the whole approach doesn't work correctly in all cases. See the git history around how we used to destroy the jobs.
>>>>
>>>> The basic problem is that you can always race between timing out and Signaling/destroying the job. This is the long lasting job lifetime issue we already discussed more than once.
>>>
>>> The scheduler destroys the job, through free_job().
>>> I think we have agreed that for now the scheduler is the party
>>> responsible for the job lifetime.
>>
>> That's what I strongly disagree on. The job is just a state bag between the submission and scheduling state of a submission.
>>
>> For the scheduler the control starts when it is pushed into the entity and ends when run_job is called.
>>
>> The real representation of the submission is the scheduler fence and that object has a perfectly defined lifetime, state and error handling.
>>
>>>>
>>>> If you want to fix this I think the correct approach is to completely drop tracking jobs in the scheduler at all.
>>>
>>> I don't see how this series introduces a problem?
>>>
>>> The fact is that drivers are abusing the API by just firing jobs back
>>> into the scheduler's job list. This series legalizes the abuse by
>>> providing scheduler functionality for that.
>>>
>>> IOW, the series improves the situation but does not add a *new*
>>> problem. Even less so as driver's aren't forced to use the new status
>>> code, but can continue having job completion race with timeout
>>> handlers.
>>
>> Maybe yes, but I'm really not sure about it.
>>
>> Take a look at the git history or job destruction, we already had exactly that approach, removed it and said that leaking memory is at least better than an use after free issue.
>>
> 
> If the job was removed from the pending list in the beginning of the
> timeout and drm_sched_get_finished_job() fetches jobs from the pending
> list, how can we end up with an use-after-free issue?

By adding it back into the list after the timeout handling completed.

We had exactly that before we came up with the horrible design to add it back to the pending list in drm_sched_stop() and the free_guilty flag.

Could be that your approach now works better, but I'm really not sure about that.

Regards,
Christian.

> 
> Best Regards,
> - Maíra
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 2/8] drm/sched: Allow drivers to skip the reset and keep on running
  2025-07-14  9:23           ` Christian König
@ 2025-07-14 10:16             ` Philipp Stanner
  2025-07-14 11:46               ` Christian König
  0 siblings, 1 reply; 23+ messages in thread
From: Philipp Stanner @ 2025-07-14 10:16 UTC (permalink / raw)
  To: Christian König, Maíra Canal, phasta, Matthew Brost,
	Danilo Krummrich, Tvrtko Ursulin, Simona Vetter, David Airlie,
	Melissa Wen, Lucas Stach, Russell King, Christian Gmeiner,
	Lucas De Marchi, Thomas Hellström, Rodrigo Vivi,
	Boris Brezillon, Rob Herring, Steven Price, Liviu Dudau
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe

On Mon, 2025-07-14 at 11:23 +0200, Christian König wrote:
> On 13.07.25 21:03, Maíra Canal wrote:
> > Hi Christian,
> > 
> > On 11/07/25 12:20, Christian König wrote:
> > > On 11.07.25 15:37, Philipp Stanner wrote:
> > > > On Fri, 2025-07-11 at 15:22 +0200, Christian König wrote:
> > > > > 
> > > > > 
> > > > > On 08.07.25 15:25, Maíra Canal wrote:
> > > > > > When the DRM scheduler times out, it's possible that the
> > > > > > GPU isn't hung;
> > > > > > instead, a job just took unusually long (longer than the
> > > > > > timeout) but is
> > > > > > still running, and there is, thus, no reason to reset the
> > > > > > hardware. This
> > > > > > can occur in two scenarios:
> > > > > > 
> > > > > >    1. The job is taking longer than the timeout, but the
> > > > > > driver determined
> > > > > >       through a GPU-specific mechanism that the hardware is
> > > > > > still making
> > > > > >       progress. Hence, the driver would like the scheduler
> > > > > > to skip the
> > > > > >       timeout and treat the job as still pending from then
> > > > > > onward. This
> > > > > >       happens in v3d, Etnaviv, and Xe.
> > > > > >    2. Timeout has fired before the free-job worker.
> > > > > > Consequently, the
> > > > > >       scheduler calls `sched->ops->timedout_job()` for a
> > > > > > job that isn't
> > > > > >       timed out.
> > > > > > 
> > > > > > These two scenarios are problematic because the job was
> > > > > > removed from the
> > > > > > `sched->pending_list` before calling `sched->ops-
> > > > > > >timedout_job()`, which
> > > > > > means that when the job finishes, it won't be freed by the
> > > > > > scheduler
> > > > > > though `sched->ops->free_job()` - leading to a memory leak.
> > > > > 
> > > > > Yeah, that is unfortunately intentional.
> > > > > 
> > > > > > To solve these problems, create a new `drm_gpu_sched_stat`,
> > > > > > called
> > > > > > DRM_GPU_SCHED_STAT_NO_HANG, which allows a driver to skip
> > > > > > the reset. The
> > > > > > new status will indicate that the job must be reinserted
> > > > > > into
> > > > > > `sched->pending_list`, and the hardware / driver will still
> > > > > > complete that
> > > > > > job.
> > > > > 
> > > > > Well long story short we have already tried this and the
> > > > > whole approach doesn't work correctly in all cases. See the
> > > > > git history around how we used to destroy the jobs.
> > > > > 
> > > > > The basic problem is that you can always race between timing
> > > > > out and Signaling/destroying the job. This is the long
> > > > > lasting job lifetime issue we already discussed more than
> > > > > once.
> > > > 
> > > > The scheduler destroys the job, through free_job().
> > > > I think we have agreed that for now the scheduler is the party
> > > > responsible for the job lifetime.
> > > 
> > > That's what I strongly disagree on. The job is just a state bag
> > > between the submission and scheduling state of a submission.
> > > 
> > > For the scheduler the control starts when it is pushed into the
> > > entity and ends when run_job is called.
> > > 
> > > The real representation of the submission is the scheduler fence
> > > and that object has a perfectly defined lifetime, state and error
> > > handling.
> > > 
> > > > > 
> > > > > If you want to fix this I think the correct approach is to
> > > > > completely drop tracking jobs in the scheduler at all.
> > > > 
> > > > I don't see how this series introduces a problem?
> > > > 
> > > > The fact is that drivers are abusing the API by just firing
> > > > jobs back
> > > > into the scheduler's job list. This series legalizes the abuse
> > > > by
> > > > providing scheduler functionality for that.
> > > > 
> > > > IOW, the series improves the situation but does not add a *new*
> > > > problem. Even less so as driver's aren't forced to use the new
> > > > status
> > > > code, but can continue having job completion race with timeout
> > > > handlers.
> > > 
> > > Maybe yes, but I'm really not sure about it.
> > > 
> > > Take a look at the git history or job destruction, we already had
> > > exactly that approach, removed it and said that leaking memory is
> > > at least better than an use after free issue.
> > > 
> > 
> > If the job was removed from the pending list in the beginning of
> > the
> > timeout and drm_sched_get_finished_job() fetches jobs from the
> > pending
> > list, how can we end up with an use-after-free issue?
> 
> By adding it back into the list after the timeout handling completed.
> 
> We had exactly that before we came up with the horrible design to add
> it back to the pending list in drm_sched_stop() and the free_guilty
> flag.
> 
> Could be that your approach now works better, but I'm really not sure
> about that.

This isn't Maíra's approach; this is an approach that is already
established practice in DRM. Have you looked at the code? There isn't
that much magic happening there.

The job gets added back to pending_list. For the scheduler, this is
like a completely new job, with the timeout handler running again and
the other work items picking the job up when appropriate.

Maíra ports several drivers which have been (illegally, outside our
API) doing that and it worked for them.

Now there's a centralized solution; one which is only active for the
drivers which choose to use it. Other driver's aren't affected.

IOW, my main argument still stands: this series doesn't add a new bug,
but improves the overall situation in DRM.


P.

> 
> Regards,
> Christian.
> 
> > 
> > Best Regards,
> > - Maíra
> > 
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 2/8] drm/sched: Allow drivers to skip the reset and keep on running
  2025-07-14 10:16             ` Philipp Stanner
@ 2025-07-14 11:46               ` Christian König
  0 siblings, 0 replies; 23+ messages in thread
From: Christian König @ 2025-07-14 11:46 UTC (permalink / raw)
  To: phasta, Maíra Canal, Matthew Brost, Danilo Krummrich,
	Tvrtko Ursulin, Simona Vetter, David Airlie, Melissa Wen,
	Lucas Stach, Russell King, Christian Gmeiner, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi, Boris Brezillon, Rob Herring,
	Steven Price, Liviu Dudau
  Cc: kernel-dev, dri-devel, etnaviv, intel-xe

On 14.07.25 12:16, Philipp Stanner wrote:
> On Mon, 2025-07-14 at 11:23 +0200, Christian König wrote:
>> On 13.07.25 21:03, Maíra Canal wrote:
>>> Hi Christian,
>>>
>>> On 11/07/25 12:20, Christian König wrote:
>>>> On 11.07.25 15:37, Philipp Stanner wrote:
>>>>> On Fri, 2025-07-11 at 15:22 +0200, Christian König wrote:
>>>>>>
>>>>>>
>>>>>> On 08.07.25 15:25, Maíra Canal wrote:
>>>>>>> When the DRM scheduler times out, it's possible that the
>>>>>>> GPU isn't hung;
>>>>>>> instead, a job just took unusually long (longer than the
>>>>>>> timeout) but is
>>>>>>> still running, and there is, thus, no reason to reset the
>>>>>>> hardware. This
>>>>>>> can occur in two scenarios:
>>>>>>>
>>>>>>>    1. The job is taking longer than the timeout, but the
>>>>>>> driver determined
>>>>>>>       through a GPU-specific mechanism that the hardware is
>>>>>>> still making
>>>>>>>       progress. Hence, the driver would like the scheduler
>>>>>>> to skip the
>>>>>>>       timeout and treat the job as still pending from then
>>>>>>> onward. This
>>>>>>>       happens in v3d, Etnaviv, and Xe.
>>>>>>>    2. Timeout has fired before the free-job worker.
>>>>>>> Consequently, the
>>>>>>>       scheduler calls `sched->ops->timedout_job()` for a
>>>>>>> job that isn't
>>>>>>>       timed out.
>>>>>>>
>>>>>>> These two scenarios are problematic because the job was
>>>>>>> removed from the
>>>>>>> `sched->pending_list` before calling `sched->ops-
>>>>>>>> timedout_job()`, which
>>>>>>> means that when the job finishes, it won't be freed by the
>>>>>>> scheduler
>>>>>>> though `sched->ops->free_job()` - leading to a memory leak.
>>>>>>
>>>>>> Yeah, that is unfortunately intentional.
>>>>>>
>>>>>>> To solve these problems, create a new `drm_gpu_sched_stat`,
>>>>>>> called
>>>>>>> DRM_GPU_SCHED_STAT_NO_HANG, which allows a driver to skip
>>>>>>> the reset. The
>>>>>>> new status will indicate that the job must be reinserted
>>>>>>> into
>>>>>>> `sched->pending_list`, and the hardware / driver will still
>>>>>>> complete that
>>>>>>> job.
>>>>>>
>>>>>> Well long story short we have already tried this and the
>>>>>> whole approach doesn't work correctly in all cases. See the
>>>>>> git history around how we used to destroy the jobs.
>>>>>>
>>>>>> The basic problem is that you can always race between timing
>>>>>> out and Signaling/destroying the job. This is the long
>>>>>> lasting job lifetime issue we already discussed more than
>>>>>> once.
>>>>>
>>>>> The scheduler destroys the job, through free_job().
>>>>> I think we have agreed that for now the scheduler is the party
>>>>> responsible for the job lifetime.
>>>>
>>>> That's what I strongly disagree on. The job is just a state bag
>>>> between the submission and scheduling state of a submission.
>>>>
>>>> For the scheduler the control starts when it is pushed into the
>>>> entity and ends when run_job is called.
>>>>
>>>> The real representation of the submission is the scheduler fence
>>>> and that object has a perfectly defined lifetime, state and error
>>>> handling.
>>>>
>>>>>>
>>>>>> If you want to fix this I think the correct approach is to
>>>>>> completely drop tracking jobs in the scheduler at all.
>>>>>
>>>>> I don't see how this series introduces a problem?
>>>>>
>>>>> The fact is that drivers are abusing the API by just firing
>>>>> jobs back
>>>>> into the scheduler's job list. This series legalizes the abuse
>>>>> by
>>>>> providing scheduler functionality for that.
>>>>>
>>>>> IOW, the series improves the situation but does not add a *new*
>>>>> problem. Even less so as driver's aren't forced to use the new
>>>>> status
>>>>> code, but can continue having job completion race with timeout
>>>>> handlers.
>>>>
>>>> Maybe yes, but I'm really not sure about it.
>>>>
>>>> Take a look at the git history or job destruction, we already had
>>>> exactly that approach, removed it and said that leaking memory is
>>>> at least better than an use after free issue.
>>>>
>>>
>>> If the job was removed from the pending list in the beginning of
>>> the
>>> timeout and drm_sched_get_finished_job() fetches jobs from the
>>> pending
>>> list, how can we end up with an use-after-free issue?
>>
>> By adding it back into the list after the timeout handling completed.
>>
>> We had exactly that before we came up with the horrible design to add
>> it back to the pending list in drm_sched_stop() and the free_guilty
>> flag.
>>
>> Could be that your approach now works better, but I'm really not sure
>> about that.
> 
> This isn't Maíra's approach; this is an approach that is already
> established practice in DRM. Have you looked at the code? There isn't
> that much magic happening there.
> 
> The job gets added back to pending_list. For the scheduler, this is
> like a completely new job, with the timeout handler running again and
> the other work items picking the job up when appropriate.
> 
> Maíra ports several drivers which have been (illegally, outside our
> API) doing that and it worked for them.

Oh, good point! I completely missed that.

> Now there's a centralized solution; one which is only active for the
> drivers which choose to use it. Other driver's aren't affected.
> 
> IOW, my main argument still stands: this series doesn't add a new bug,
> but improves the overall situation in DRM.

Yeah, if it's just a cleanup then please go ahead with it. Certainly better to have it centralized in the scheduler then every driver doing it's own thing.

Regards,
Christian.

> 
> 
> P.
> 
>>
>> Regards,
>> Christian.
>>
>>>
>>> Best Regards,
>>> - Maíra
>>>
>>
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-07-14 11:46 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-08 13:25 [PATCH v5 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_NO_HANG Maíra Canal
2025-07-08 13:25 ` [PATCH v5 1/8] drm/sched: Rename DRM_GPU_SCHED_STAT_NOMINAL to DRM_GPU_SCHED_STAT_RESET Maíra Canal
2025-07-08 13:25 ` [PATCH v5 2/8] drm/sched: Allow drivers to skip the reset and keep on running Maíra Canal
2025-07-09 13:08   ` Philipp Stanner
2025-07-11 13:22   ` Christian König
2025-07-11 13:37     ` Philipp Stanner
2025-07-11 15:20       ` Christian König
2025-07-11 17:23         ` Matthew Brost
2025-07-14  9:10           ` Christian König
2025-07-13 19:03         ` Maíra Canal
2025-07-14  9:23           ` Christian König
2025-07-14 10:16             ` Philipp Stanner
2025-07-14 11:46               ` Christian König
2025-07-11 14:35     ` Maíra Canal
2025-07-08 13:25 ` [PATCH v5 3/8] drm/sched: Make timeout KUnit tests faster Maíra Canal
2025-07-08 13:25 ` [PATCH v5 4/8] drm/sched: Add new test for DRM_GPU_SCHED_STAT_NO_HANG Maíra Canal
2025-07-08 13:25 ` [PATCH v5 5/8] drm/v3d: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset Maíra Canal
2025-07-08 13:25 ` [PATCH v5 6/8] drm/etnaviv: " Maíra Canal
2025-07-08 13:25 ` [PATCH v5 7/8] drm/xe: " Maíra Canal
2025-07-08 18:35   ` Matthew Brost
2025-07-08 13:25 ` [PATCH v5 8/8] drm/panfrost: " Maíra Canal
2025-07-09 13:14 ` [PATCH v5 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_NO_HANG Philipp Stanner
2025-07-10 11:27   ` Maíra Canal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).