* [PATCH] drm/sched: Extend and update documentation
@ 2024-11-15 10:35 Philipp Stanner
2024-11-21 16:05 ` Alice Ryhl
2024-11-26 16:31 ` Tvrtko Ursulin
0 siblings, 2 replies; 12+ messages in thread
From: Philipp Stanner @ 2024-11-15 10:35 UTC (permalink / raw)
To: David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, Jonathan Corbet, Luben Tuikov, Matthew Brost,
Danilo Krummrich, Philipp Stanner
Cc: dri-devel, linux-doc, linux-kernel, Christian König
The various objects defined and used by the GPU scheduler are currently
not fully documented. Furthermore, there is no documentation yet
informing drivers about how they should handle timeouts.
Add documentation describing the scheduler's objects and timeout
procedure. Consistently, update drm_sched_backend_ops.timedout_job()'s
documentation.
Co-developed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Philipp Stanner <pstanner@redhat.com>
---
I shamelessly stole- ahm, borrowed this documentation patch that
Christian had submitted a year ago:
https://lore.kernel.org/dri-devel/20231116141547.206695-1-christian.koenig@amd.com/
I took feedback from last year into account where applicable, but it's
probably a good idea if you all take a close look again.
P.
---
Documentation/gpu/drm-mm.rst | 36 +++++
drivers/gpu/drm/scheduler/sched_main.c | 200 ++++++++++++++++++++++---
include/drm/gpu_scheduler.h | 16 +-
3 files changed, 225 insertions(+), 27 deletions(-)
diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
index d55751cad67c..95ee95fd987a 100644
--- a/Documentation/gpu/drm-mm.rst
+++ b/Documentation/gpu/drm-mm.rst
@@ -556,12 +556,48 @@ Overview
.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
:doc: Overview
+Job Object
+----------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+ :doc: Job Object
+
+Entity Object
+-------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+ :doc: Entity Object
+
+Hardware Fence Object
+---------------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+ :doc: Hardware Fence Object
+
+Scheduler Fence Object
+----------------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+ :doc: Scheduler Fence Object
+
+Scheduler and Run Queue Objects
+-------------------------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+ :doc: Scheduler and Run Queue Objects
+
Flow Control
------------
.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
:doc: Flow Control
+Error and Timeout handling
+--------------------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+ :doc: Error and Timeout handling
+
Scheduler Function References
-----------------------------
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index e97c6c60bc96..76eb46281985 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -24,28 +24,155 @@
/**
* DOC: Overview
*
- * The GPU scheduler provides entities which allow userspace to push jobs
- * into software queues which are then scheduled on a hardware run queue.
- * The software queues have a priority among them. The scheduler selects the entities
- * from the run queue using a FIFO. The scheduler provides dependency handling
- * features among jobs. The driver is supposed to provide callback functions for
- * backend operations to the scheduler like submitting a job to hardware run queue,
- * returning the dependencies of a job etc.
+ * The GPU scheduler is shared infrastructure intended to help drivers managing
+ * command submission to their hardware.
*
- * The organisation of the scheduler is the following:
+ * To do so, it offers a set of scheduling facilities that interact with the
+ * driver through callbacks which the latter can register.
*
- * 1. Each hw run queue has one scheduler
- * 2. Each scheduler has multiple run queues with different priorities
- * (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
- * 3. Each scheduler run queue has a queue of entities to schedule
- * 4. Entities themselves maintain a queue of jobs that will be scheduled on
- * the hardware.
+ * In particular, the scheduler takes care of:
+ * - Ordering command submissions
+ * - Signalling DMA fences, e.g., for finished commands
+ * - Taking dependencies between command submissions into account
+ * - Handling timeouts for command submissions
*
- * The jobs in a entity are always scheduled in the order that they were pushed.
+ * All callbacks the driver needs to implement are restricted by DMA-fence
+ * signaling rules to guarantee deadlock free forward progress. This especially
+ * means that for normal operation no memory can be allocated in a callback.
+ * All memory which is needed for pushing the job to the hardware must be
+ * allocated before arming a job. It also means that no locks can be taken
+ * under which memory might be allocated as well.
*
- * Note that once a job was taken from the entities queue and pushed to the
- * hardware, i.e. the pending queue, the entity must not be referenced anymore
- * through the jobs entity pointer.
+ * Memory which is optional to allocate, for example for device core dumping or
+ * debugging, *must* be allocated with GFP_NOWAIT and appropriate error
+ * handling if that allocation fails. GFP_ATOMIC should only be used if
+ * absolutely necessary since dipping into the special atomic reserves is
+ * usually not justified for a GPU driver.
+ *
+ * Note especially the following about the scheduler's historic background that
+ * lead to sort of a double role it plays today:
+ *
+ * In classic setups N entities share one scheduler, and the scheduler decides
+ * which job to pick from which entity and move it to the hardware ring next
+ * (that is: "scheduling").
+ *
+ * Many (especially newer) GPUs, however, can have an almost arbitrary number
+ * of hardware rings and it's a firmware scheduler which actually decides which
+ * job will run next. In such setups, the GPU scheduler is still used (e.g., in
+ * Nouveau) but does not "schedule" jobs in the classical sense anymore. It
+ * merely serves to queue and dequeue jobs and resolve dependencies. In such a
+ * scenario, it is recommended to have one scheduler per entity.
+ */
+
+/**
+ * DOC: Job Object
+ *
+ * The base job object (drm_sched_job) contains submission dependencies in the
+ * form of DMA-fence objects. Drivers can also implement an optional
+ * prepare_job callback which returns additional dependencies as DMA-fence
+ * objects. It's important to note that this callback can't allocate memory or
+ * grab locks under which memory is allocated.
+ *
+ * Drivers should use this as base class for an object which contains the
+ * necessary state to push the command submission to the hardware.
+ *
+ * The lifetime of the job object needs to last at least from submitting it to
+ * the scheduler (through drm_sched_job_arm()) until the scheduler has invoked
+ * drm_sched_backend_ops.free_job() and, thereby, has indicated that it does
+ * not need the job anymore. Drivers can of course keep their job object alive
+ * for longer than that, but that's outside of the scope of the scheduler
+ * component.
+ *
+ * Job initialization is split into two stages:
+ * 1. drm_sched_job_init() which serves for basic preparation of a job.
+ * Drivers don't have to be mindful of this function's consequences and
+ * its effects can be reverted through drm_sched_job_cleanup().
+ * 2. drm_sched_job_arm() which irrevokably arms a job for execution. This
+ * activates the job's fence, i.e., it registers the callbacks. Thus,
+ * inevitably, the callbacks will access the job and its memory at some
+ * point in the future. This means that once drm_sched_job_arm() has been
+ * called, the job structure has to be valid until the scheduler invoked
+ * drm_sched_backend_ops.free_job().
+ *
+ * It's important to note that after arming a job drivers must follow the
+ * DMA-fence rules and can't easily allocate memory or takes locks under which
+ * memory is allocated.
+ */
+
+/**
+ * DOC: Entity Object
+ *
+ * The entity object (drm_sched_entity) which is a container for jobs which
+ * should execute sequentially. Drivers should create an entity for each
+ * individual context they maintain for command submissions which can run in
+ * parallel.
+ *
+ * The lifetime of the entity *should not* exceed the lifetime of the
+ * userspace process it was created for and drivers should call the
+ * drm_sched_entity_flush() function from their file_operations.flush()
+ * callback. It is possible that an entity object is not alive anymore
+ * while jobs previously fetched from it are still running on the hardware.
+ *
+ * This is done because all results of a command submission should become
+ * visible externally even after a process exits. This is normal POSIX
+ * behavior for I/O operations.
+ *
+ * The problem with this approach is that GPU submissions contain executable
+ * shaders enabling processes to evade their termination by offloading work to
+ * the GPU. So when a process is terminated with a SIGKILL the entity object
+ * makes sure that jobs are freed without running them while still maintaining
+ * correct sequential order for signaling fences.
+ */
+
+/**
+ * DOC: Hardware Fence Object
+ *
+ * The hardware fence object is a DMA-fence provided by the driver as result of
+ * running jobs. Drivers need to make sure that the normal DMA-fence semantics
+ * are followed for this object. It's important to note that the memory for
+ * this object can *not* be allocated in drm_sched_backend_ops.run_job() since
+ * that would violate the requirements for the DMA-fence implementation. The
+ * scheduler maintains a timeout handler which triggers if this fence doesn't
+ * signal within a configurable amount of time.
+ *
+ * The lifetime of this object follows DMA-fence refcounting rules. The
+ * scheduler takes ownership of the reference returned by the driver and
+ * drops it when it's not needed any more.
+ */
+
+/**
+ * DOC: Scheduler Fence Object
+ *
+ * The scheduler fence object (drm_sched_fence) which encapsulates the whole
+ * time from pushing the job into the scheduler until the hardware has finished
+ * processing it. This is internally managed by the scheduler, but drivers can
+ * grab additional reference to it after arming a job. The implementation
+ * provides DMA-fence interfaces for signaling both scheduling of a command
+ * submission as well as finishing of processing.
+ *
+ * The lifetime of this object also follows normal DMA-fence refcounting rules.
+ * The finished fence is the one normally exposed to the outside world, but the
+ * driver can grab references to both the scheduled as well as the finished
+ * fence when needed for pipelining optimizations.
+ */
+
+/**
+ * DOC: Scheduler and Run Queue Objects
+ *
+ * The scheduler object itself (drm_gpu_scheduler) does the actual work of
+ * selecting a job and pushing it to the hardware. Both FIFO and RR selection
+ * algorithm are supported, but FIFO is preferred for many use cases.
+ *
+ * The lifetime of the scheduler is managed by the driver using it. Before
+ * destroying the scheduler the driver must ensure that all hardware processing
+ * involving this scheduler object has finished by calling for example
+ * disable_irq(). It is *not* sufficient to wait for the hardware fence here
+ * since this doesn't guarantee that all callback processing has finished.
+ *
+ * The run queue object (drm_sched_rq) is a container for entities of a certain
+ * priority level. This object is internally managed by the scheduler and
+ * drivers shouldn't touch it directly. The lifetime of a run queue is bound to
+ * the scheduler's lifetime.
*/
/**
@@ -72,6 +199,43 @@
* limit.
*/
+/**
+ * DOC: Error and Timeout handling
+ *
+ * Errors schould be signaled by using dma_fence_set_error() on the hardware
+ * fence object before signaling it. Errors are then bubbled up from the
+ * hardware fence to the scheduler fence.
+ *
+ * The entity allows querying errors on the last run submission using the
+ * drm_sched_entity_error() function which can be used to cancel queued
+ * submissions in drm_sched_backend_ops.run_job() as well as preventing
+ * pushing further ones into the entity in the driver's submission function.
+ *
+ * When the hardware fence doesn't signal within a configurable amount of time
+ * drm_sched_backend_ops.timedout_job() gets invoked. The driver should then
+ * follow the procedure described in that callback's documentation.
+ * (TODO: The timeout handler should probably switch to using the hardware
+ * fence as parameter instead of the job. Otherwise the handling will always
+ * race between timing out and signaling the fence).
+ *
+ * The scheduler also used to provided functionality for re-submitting jobs
+ * and, thereby, replaced the hardware fence during reset handling. This
+ * functionality is now marked as deprecated. This has proven to be
+ * fundamentally racy and not compatible with DMA-fence rules and shouldn't be
+ * used in new code.
+ *
+ * Additionally, there is the function drm_sched_increase_karma() which tries
+ * to find the entity which submitted a job and increases its 'karma' atomic
+ * variable to prevent resubmitting jobs from this entity. This has quite some
+ * overhead and resubmitting jobs is now marked as deprecated. Thus, using this
+ * function is discouraged.
+ *
+ * Drivers can still recreate the GPU state in case it should be lost during
+ * timeout handling *if* they can guarantee that forward progress will be made
+ * and this doesn't cause another timeout. But this is strongly hardware
+ * specific and out of the scope of the general GPU scheduler.
+ */
+
#include <linux/wait.h>
#include <linux/sched.h>
#include <linux/completion.h>
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 9c437a057e5d..c52363453861 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -417,8 +417,8 @@ struct drm_sched_backend_ops {
struct dma_fence *(*run_job)(struct drm_sched_job *sched_job);
/**
- * @timedout_job: Called when a job has taken too long to execute,
- * to trigger GPU recovery.
+ * @timedout_job: Called when a hardware fence didn't signal within a
+ * configurable amount of time. Triggers GPU recovery.
*
* This method is called in a workqueue context.
*
@@ -429,9 +429,8 @@ struct drm_sched_backend_ops {
* scheduler thread and cancel the timeout work, guaranteeing that
* nothing is queued while we reset the hardware queue
* 2. Try to gracefully stop non-faulty jobs (optional)
- * 3. Issue a GPU reset (driver-specific)
- * 4. Re-submit jobs using drm_sched_resubmit_jobs()
- * 5. Restart the scheduler using drm_sched_start(). At that point, new
+ * 3. Issue a GPU or context reset (driver-specific)
+ * 4. Restart the scheduler using drm_sched_start(). At that point, new
* jobs can be queued, and the scheduler thread is unblocked
*
* Note that some GPUs have distinct hardware queues but need to reset
@@ -447,16 +446,15 @@ struct drm_sched_backend_ops {
* 2. Try to gracefully stop non-faulty jobs on all queues impacted by
* the reset (optional)
* 3. Issue a GPU reset on all faulty queues (driver-specific)
- * 4. Re-submit jobs on all schedulers impacted by the reset using
- * drm_sched_resubmit_jobs()
- * 5. Restart all schedulers that were stopped in step #1 using
+ * 4. Restart all schedulers that were stopped in step #1 using
* drm_sched_start()
*
* Return DRM_GPU_SCHED_STAT_NOMINAL, when all is normal,
* and the underlying driver has started or completed recovery.
*
* Return DRM_GPU_SCHED_STAT_ENODEV, if the device is no longer
- * available, i.e. has been unplugged.
+ * available, for example if it has been unplugged or failed to
+ * recover.
*/
enum drm_gpu_sched_stat (*timedout_job)(struct drm_sched_job *sched_job);
--
2.47.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH] drm/sched: Extend and update documentation
2024-11-15 10:35 Philipp Stanner
@ 2024-11-21 16:05 ` Alice Ryhl
2024-12-05 13:20 ` Philipp Stanner
2024-11-26 16:31 ` Tvrtko Ursulin
1 sibling, 1 reply; 12+ messages in thread
From: Alice Ryhl @ 2024-11-21 16:05 UTC (permalink / raw)
To: Philipp Stanner
Cc: David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, Jonathan Corbet, Luben Tuikov, Matthew Brost,
Danilo Krummrich, dri-devel, linux-doc, linux-kernel,
Christian König
On Fri, Nov 15, 2024 at 11:36 AM Philipp Stanner <pstanner@redhat.com> wrote:
>
> The various objects defined and used by the GPU scheduler are currently
> not fully documented. Furthermore, there is no documentation yet
> informing drivers about how they should handle timeouts.
>
> Add documentation describing the scheduler's objects and timeout
> procedure. Consistently, update drm_sched_backend_ops.timedout_job()'s
> documentation.
>
> Co-developed-by: Christian König <christian.koenig@amd.com>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> Signed-off-by: Philipp Stanner <pstanner@redhat.com>
> ---
> I shamelessly stole- ahm, borrowed this documentation patch that
> Christian had submitted a year ago:
>
> https://lore.kernel.org/dri-devel/20231116141547.206695-1-christian.koenig@amd.com/
>
> I took feedback from last year into account where applicable, but it's
> probably a good idea if you all take a close look again.
>
> P.
> ---
> Documentation/gpu/drm-mm.rst | 36 +++++
> drivers/gpu/drm/scheduler/sched_main.c | 200 ++++++++++++++++++++++---
> include/drm/gpu_scheduler.h | 16 +-
> 3 files changed, 225 insertions(+), 27 deletions(-)
>
> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
> index d55751cad67c..95ee95fd987a 100644
> --- a/Documentation/gpu/drm-mm.rst
> +++ b/Documentation/gpu/drm-mm.rst
> @@ -556,12 +556,48 @@ Overview
> .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> :doc: Overview
>
> +Job Object
> +----------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> + :doc: Job Object
> +
> +Entity Object
> +-------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> + :doc: Entity Object
> +
> +Hardware Fence Object
> +---------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> + :doc: Hardware Fence Object
> +
> +Scheduler Fence Object
> +----------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> + :doc: Scheduler Fence Object
> +
> +Scheduler and Run Queue Objects
> +-------------------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> + :doc: Scheduler and Run Queue Objects
> +
> Flow Control
> ------------
>
> .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> :doc: Flow Control
>
> +Error and Timeout handling
> +--------------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> + :doc: Error and Timeout handling
> +
> Scheduler Function References
> -----------------------------
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index e97c6c60bc96..76eb46281985 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -24,28 +24,155 @@
> /**
> * DOC: Overview
> *
> - * The GPU scheduler provides entities which allow userspace to push jobs
> - * into software queues which are then scheduled on a hardware run queue.
> - * The software queues have a priority among them. The scheduler selects the entities
> - * from the run queue using a FIFO. The scheduler provides dependency handling
> - * features among jobs. The driver is supposed to provide callback functions for
> - * backend operations to the scheduler like submitting a job to hardware run queue,
> - * returning the dependencies of a job etc.
> + * The GPU scheduler is shared infrastructure intended to help drivers managing
> + * command submission to their hardware.
> *
> - * The organisation of the scheduler is the following:
> + * To do so, it offers a set of scheduling facilities that interact with the
> + * driver through callbacks which the latter can register.
> *
> - * 1. Each hw run queue has one scheduler
> - * 2. Each scheduler has multiple run queues with different priorities
> - * (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
> - * 3. Each scheduler run queue has a queue of entities to schedule
> - * 4. Entities themselves maintain a queue of jobs that will be scheduled on
> - * the hardware.
> + * In particular, the scheduler takes care of:
> + * - Ordering command submissions
> + * - Signalling DMA fences, e.g., for finished commands
> + * - Taking dependencies between command submissions into account
> + * - Handling timeouts for command submissions
For the signalling case, you say "e.g.". Does that mean it also
signals DMA fences in other cases?
> - * The jobs in a entity are always scheduled in the order that they were pushed.
> + * All callbacks the driver needs to implement are restricted by DMA-fence
> + * signaling rules to guarantee deadlock free forward progress. This especially
> + * means that for normal operation no memory can be allocated in a callback.
> + * All memory which is needed for pushing the job to the hardware must be
> + * allocated before arming a job. It also means that no locks can be taken
> + * under which memory might be allocated as well.
> *
> - * Note that once a job was taken from the entities queue and pushed to the
> - * hardware, i.e. the pending queue, the entity must not be referenced anymore
> - * through the jobs entity pointer.
> + * Memory which is optional to allocate, for example for device core dumping or
> + * debugging, *must* be allocated with GFP_NOWAIT and appropriate error
> + * handling if that allocation fails. GFP_ATOMIC should only be used if
> + * absolutely necessary since dipping into the special atomic reserves is
> + * usually not justified for a GPU driver.
> + *
> + * Note especially the following about the scheduler's historic background that
> + * lead to sort of a double role it plays today:
> + *
> + * In classic setups N entities share one scheduler, and the scheduler decides
> + * which job to pick from which entity and move it to the hardware ring next
> + * (that is: "scheduling").
> + *
> + * Many (especially newer) GPUs, however, can have an almost arbitrary number
> + * of hardware rings and it's a firmware scheduler which actually decides which
> + * job will run next. In such setups, the GPU scheduler is still used (e.g., in
> + * Nouveau) but does not "schedule" jobs in the classical sense anymore. It
> + * merely serves to queue and dequeue jobs and resolve dependencies. In such a
> + * scenario, it is recommended to have one scheduler per entity.
> + */
> +
> +/**
> + * DOC: Job Object
> + *
> + * The base job object (drm_sched_job) contains submission dependencies in the
> + * form of DMA-fence objects. Drivers can also implement an optional
> + * prepare_job callback which returns additional dependencies as DMA-fence
> + * objects. It's important to note that this callback can't allocate memory or
> + * grab locks under which memory is allocated.
> + *
> + * Drivers should use this as base class for an object which contains the
> + * necessary state to push the command submission to the hardware.
> + *
> + * The lifetime of the job object needs to last at least from submitting it to
> + * the scheduler (through drm_sched_job_arm()) until the scheduler has invoked
> + * drm_sched_backend_ops.free_job() and, thereby, has indicated that it does
> + * not need the job anymore. Drivers can of course keep their job object alive
> + * for longer than that, but that's outside of the scope of the scheduler
> + * component.
> + *
> + * Job initialization is split into two stages:
> + * 1. drm_sched_job_init() which serves for basic preparation of a job.
> + * Drivers don't have to be mindful of this function's consequences and
> + * its effects can be reverted through drm_sched_job_cleanup().
> + * 2. drm_sched_job_arm() which irrevokably arms a job for execution. This
> + * activates the job's fence, i.e., it registers the callbacks. Thus,
> + * inevitably, the callbacks will access the job and its memory at some
> + * point in the future. This means that once drm_sched_job_arm() has been
> + * called, the job structure has to be valid until the scheduler invoked
> + * drm_sched_backend_ops.free_job().
This is written as-if there could be multiple callbacks in a single
job. Is that the case?
Also typo: "invoked" -> "invokes".
> + * It's important to note that after arming a job drivers must follow the
> + * DMA-fence rules and can't easily allocate memory or takes locks under which
> + * memory is allocated.
comma? "job, drivers"
typo: "or takes" -> "or take"
> +
> +/**
> + * DOC: Entity Object
> + *
> + * The entity object (drm_sched_entity) which is a container for jobs which
> + * should execute sequentially. Drivers should create an entity for each
> + * individual context they maintain for command submissions which can run in
> + * parallel.
This is a bit awkward, how about: "The entity object is a container
for jobs that should execute sequentially."
> + * The lifetime of the entity *should not* exceed the lifetime of the
> + * userspace process it was created for and drivers should call the
> + * drm_sched_entity_flush() function from their file_operations.flush()
> + * callback. It is possible that an entity object is not alive anymore
> + * while jobs previously fetched from it are still running on the hardware.
To be clear ... this is about not letting processes run code after
dying, and not because something you're using gets freed after
flush(), correct?
> + * This is done because all results of a command submission should become
> + * visible externally even after a process exits. This is normal POSIX
> + * behavior for I/O operations.
> + *
> + * The problem with this approach is that GPU submissions contain executable
> + * shaders enabling processes to evade their termination by offloading work to
> + * the GPU. So when a process is terminated with a SIGKILL the entity object
> + * makes sure that jobs are freed without running them while still maintaining
> + * correct sequential order for signaling fences.
> + */
> +
> +/**
> + * DOC: Hardware Fence Object
> + *
> + * The hardware fence object is a DMA-fence provided by the driver as result of
> + * running jobs. Drivers need to make sure that the normal DMA-fence semantics
> + * are followed for this object. It's important to note that the memory for
> + * this object can *not* be allocated in drm_sched_backend_ops.run_job() since
> + * that would violate the requirements for the DMA-fence implementation. The
> + * scheduler maintains a timeout handler which triggers if this fence doesn't
> + * signal within a configurable amount of time.
> + *
> + * The lifetime of this object follows DMA-fence refcounting rules. The
> + * scheduler takes ownership of the reference returned by the driver and
> + * drops it when it's not needed any more.
> + */
> +
> +/**
> + * DOC: Scheduler Fence Object
> + *
> + * The scheduler fence object (drm_sched_fence) which encapsulates the whole
> + * time from pushing the job into the scheduler until the hardware has finished
> + * processing it. This is internally managed by the scheduler, but drivers can
> + * grab additional reference to it after arming a job. The implementation
> + * provides DMA-fence interfaces for signaling both scheduling of a command
> + * submission as well as finishing of processing.
typo: "an additional reference" or "additional references"
> + * The lifetime of this object also follows normal DMA-fence refcounting rules.
> + * The finished fence is the one normally exposed to the outside world, but the
> + * driver can grab references to both the scheduled as well as the finished
> + * fence when needed for pipelining optimizations.
When you refer to the "scheduled fence" and the "finished fence",
these are referring to "a fence indicating when the job was scheduled
/ finished", rather than "a fence which was scheduled for execution
and has now become finished", correct? I think the wording could be a
bit clearer here.
Alice
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] drm/sched: Extend and update documentation
2024-11-15 10:35 Philipp Stanner
2024-11-21 16:05 ` Alice Ryhl
@ 2024-11-26 16:31 ` Tvrtko Ursulin
1 sibling, 0 replies; 12+ messages in thread
From: Tvrtko Ursulin @ 2024-11-26 16:31 UTC (permalink / raw)
To: Philipp Stanner, David Airlie, Simona Vetter, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, Jonathan Corbet, Luben Tuikov,
Matthew Brost, Danilo Krummrich
Cc: dri-devel, linux-doc, linux-kernel, Christian König
On 15/11/2024 10:35, Philipp Stanner wrote:
> The various objects defined and used by the GPU scheduler are currently
> not fully documented. Furthermore, there is no documentation yet
> informing drivers about how they should handle timeouts.
>
> Add documentation describing the scheduler's objects and timeout
> procedure. Consistently, update drm_sched_backend_ops.timedout_job()'s
> documentation.
>
> Co-developed-by: Christian König <christian.koenig@amd.com>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> Signed-off-by: Philipp Stanner <pstanner@redhat.com>
> ---
> I shamelessly stole- ahm, borrowed this documentation patch that
> Christian had submitted a year ago:
>
> https://lore.kernel.org/dri-devel/20231116141547.206695-1-christian.koenig@amd.com/
>
> I took feedback from last year into account where applicable, but it's
> probably a good idea if you all take a close look again.
>
> P.
> ---
> Documentation/gpu/drm-mm.rst | 36 +++++
> drivers/gpu/drm/scheduler/sched_main.c | 200 ++++++++++++++++++++++---
> include/drm/gpu_scheduler.h | 16 +-
> 3 files changed, 225 insertions(+), 27 deletions(-)
>
> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
> index d55751cad67c..95ee95fd987a 100644
> --- a/Documentation/gpu/drm-mm.rst
> +++ b/Documentation/gpu/drm-mm.rst
> @@ -556,12 +556,48 @@ Overview
> .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> :doc: Overview
>
> +Job Object
> +----------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> + :doc: Job Object
> +
> +Entity Object
> +-------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> + :doc: Entity Object
> +
> +Hardware Fence Object
> +---------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> + :doc: Hardware Fence Object
> +
> +Scheduler Fence Object
> +----------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> + :doc: Scheduler Fence Object
> +
> +Scheduler and Run Queue Objects
> +-------------------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> + :doc: Scheduler and Run Queue Objects
> +
> Flow Control
> ------------
>
> .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> :doc: Flow Control
>
> +Error and Timeout handling
> +--------------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> + :doc: Error and Timeout handling
> +
> Scheduler Function References
> -----------------------------
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index e97c6c60bc96..76eb46281985 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -24,28 +24,155 @@
> /**
> * DOC: Overview
> *
> - * The GPU scheduler provides entities which allow userspace to push jobs
> - * into software queues which are then scheduled on a hardware run queue.
> - * The software queues have a priority among them. The scheduler selects the entities
> - * from the run queue using a FIFO. The scheduler provides dependency handling
> - * features among jobs. The driver is supposed to provide callback functions for
> - * backend operations to the scheduler like submitting a job to hardware run queue,
> - * returning the dependencies of a job etc.
> + * The GPU scheduler is shared infrastructure intended to help drivers managing
> + * command submission to their hardware.
> *
> - * The organisation of the scheduler is the following:
> + * To do so, it offers a set of scheduling facilities that interact with the
> + * driver through callbacks which the latter can register.
> *
> - * 1. Each hw run queue has one scheduler
> - * 2. Each scheduler has multiple run queues with different priorities
> - * (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
> - * 3. Each scheduler run queue has a queue of entities to schedule
> - * 4. Entities themselves maintain a queue of jobs that will be scheduled on
> - * the hardware.
> + * In particular, the scheduler takes care of:
> + * - Ordering command submissions
> + * - Signalling DMA fences, e.g., for finished commands
> + * - Taking dependencies between command submissions into account
> + * - Handling timeouts for command submissions
> *
> - * The jobs in a entity are always scheduled in the order that they were pushed.
> + * All callbacks the driver needs to implement are restricted by DMA-fence
> + * signaling rules to guarantee deadlock free forward progress. This especially
> + * means that for normal operation no memory can be allocated in a callback.
> + * All memory which is needed for pushing the job to the hardware must be
> + * allocated before arming a job. It also means that no locks can be taken
> + * under which memory might be allocated as well.
> *
> - * Note that once a job was taken from the entities queue and pushed to the
> - * hardware, i.e. the pending queue, the entity must not be referenced anymore
> - * through the jobs entity pointer.
> + * Memory which is optional to allocate, for example for device core dumping or
> + * debugging, *must* be allocated with GFP_NOWAIT and appropriate error
> + * handling if that allocation fails. GFP_ATOMIC should only be used if
> + * absolutely necessary since dipping into the special atomic reserves is
> + * usually not justified for a GPU driver.
> + *
> + * Note especially the following about the scheduler's historic background that
> + * lead to sort of a double role it plays today:
> + *
> + * In classic setups N entities share one scheduler, and the scheduler decides
> + * which job to pick from which entity and move it to the hardware ring next
> + * (that is: "scheduling").
> + *
> + * Many (especially newer) GPUs, however, can have an almost arbitrary number
> + * of hardware rings and it's a firmware scheduler which actually decides which
> + * job will run next. In such setups, the GPU scheduler is still used (e.g., in
> + * Nouveau) but does not "schedule" jobs in the classical sense anymore. It
> + * merely serves to queue and dequeue jobs and resolve dependencies. In such a
> + * scenario, it is recommended to have one scheduler per entity.
> + */
> +
> +/**
> + * DOC: Job Object
> + *
> + * The base job object (drm_sched_job) contains submission dependencies in the
> + * form of DMA-fence objects. Drivers can also implement an optional
> + * prepare_job callback which returns additional dependencies as DMA-fence
> + * objects. It's important to note that this callback can't allocate memory or
> + * grab locks under which memory is allocated.
AFAICT amdgpu_prepare_job can allocate memory. Maybe that is okay
because scheduler workqueue is nowadays marked as WQ_MEM_RECLAIM but in
general it would be more logical if it wasn't allocating. It does not
seem an easy job to make it stop doing so. Perhaps Christian has
something planned here?
> + *
> + * Drivers should use this as base class for an object which contains the
> + * necessary state to push the command submission to the hardware.
> + *
> + * The lifetime of the job object needs to last at least from submitting it to
> + * the scheduler (through drm_sched_job_arm()) until the scheduler has invoked
> + * drm_sched_backend_ops.free_job() and, thereby, has indicated that it does
> + * not need the job anymore. Drivers can of course keep their job object alive
> + * for longer than that, but that's outside of the scope of the scheduler
> + * component.
> + *
> + * Job initialization is split into two stages:
> + * 1. drm_sched_job_init() which serves for basic preparation of a job.
> + * Drivers don't have to be mindful of this function's consequences and
> + * its effects can be reverted through drm_sched_job_cleanup().
> + * 2. drm_sched_job_arm() which irrevokably arms a job for execution. This
irrevocably
(Btw I did not do a full read, just came here for prepare_job
clarifications.)
Regards,
Tvrtko
> + * activates the job's fence, i.e., it registers the callbacks. Thus,
> + * inevitably, the callbacks will access the job and its memory at some
> + * point in the future. This means that once drm_sched_job_arm() has been
> + * called, the job structure has to be valid until the scheduler invoked
> + * drm_sched_backend_ops.free_job().
> + *
> + * It's important to note that after arming a job drivers must follow the
> + * DMA-fence rules and can't easily allocate memory or takes locks under which
> + * memory is allocated.
> + */
> +
> +/**
> + * DOC: Entity Object
> + *
> + * The entity object (drm_sched_entity) which is a container for jobs which
> + * should execute sequentially. Drivers should create an entity for each
> + * individual context they maintain for command submissions which can run in
> + * parallel.
> + *
> + * The lifetime of the entity *should not* exceed the lifetime of the
> + * userspace process it was created for and drivers should call the
> + * drm_sched_entity_flush() function from their file_operations.flush()
> + * callback. It is possible that an entity object is not alive anymore
> + * while jobs previously fetched from it are still running on the hardware.
> + *
> + * This is done because all results of a command submission should become
> + * visible externally even after a process exits. This is normal POSIX
> + * behavior for I/O operations.
> + *
> + * The problem with this approach is that GPU submissions contain executable
> + * shaders enabling processes to evade their termination by offloading work to
> + * the GPU. So when a process is terminated with a SIGKILL the entity object
> + * makes sure that jobs are freed without running them while still maintaining
> + * correct sequential order for signaling fences.
> + */
> +
> +/**
> + * DOC: Hardware Fence Object
> + *
> + * The hardware fence object is a DMA-fence provided by the driver as result of
> + * running jobs. Drivers need to make sure that the normal DMA-fence semantics
> + * are followed for this object. It's important to note that the memory for
> + * this object can *not* be allocated in drm_sched_backend_ops.run_job() since
> + * that would violate the requirements for the DMA-fence implementation. The
> + * scheduler maintains a timeout handler which triggers if this fence doesn't
> + * signal within a configurable amount of time.
> + *
> + * The lifetime of this object follows DMA-fence refcounting rules. The
> + * scheduler takes ownership of the reference returned by the driver and
> + * drops it when it's not needed any more.
> + */
> +
> +/**
> + * DOC: Scheduler Fence Object
> + *
> + * The scheduler fence object (drm_sched_fence) which encapsulates the whole
> + * time from pushing the job into the scheduler until the hardware has finished
> + * processing it. This is internally managed by the scheduler, but drivers can
> + * grab additional reference to it after arming a job. The implementation
> + * provides DMA-fence interfaces for signaling both scheduling of a command
> + * submission as well as finishing of processing.
> + *
> + * The lifetime of this object also follows normal DMA-fence refcounting rules.
> + * The finished fence is the one normally exposed to the outside world, but the
> + * driver can grab references to both the scheduled as well as the finished
> + * fence when needed for pipelining optimizations.
> + */
> +
> +/**
> + * DOC: Scheduler and Run Queue Objects
> + *
> + * The scheduler object itself (drm_gpu_scheduler) does the actual work of
> + * selecting a job and pushing it to the hardware. Both FIFO and RR selection
> + * algorithm are supported, but FIFO is preferred for many use cases.
> + *
> + * The lifetime of the scheduler is managed by the driver using it. Before
> + * destroying the scheduler the driver must ensure that all hardware processing
> + * involving this scheduler object has finished by calling for example
> + * disable_irq(). It is *not* sufficient to wait for the hardware fence here
> + * since this doesn't guarantee that all callback processing has finished.
> + *
> + * The run queue object (drm_sched_rq) is a container for entities of a certain
> + * priority level. This object is internally managed by the scheduler and
> + * drivers shouldn't touch it directly. The lifetime of a run queue is bound to
> + * the scheduler's lifetime.
> */
>
> /**
> @@ -72,6 +199,43 @@
> * limit.
> */
>
> +/**
> + * DOC: Error and Timeout handling
> + *
> + * Errors schould be signaled by using dma_fence_set_error() on the hardware
> + * fence object before signaling it. Errors are then bubbled up from the
> + * hardware fence to the scheduler fence.
> + *
> + * The entity allows querying errors on the last run submission using the
> + * drm_sched_entity_error() function which can be used to cancel queued
> + * submissions in drm_sched_backend_ops.run_job() as well as preventing
> + * pushing further ones into the entity in the driver's submission function.
> + *
> + * When the hardware fence doesn't signal within a configurable amount of time
> + * drm_sched_backend_ops.timedout_job() gets invoked. The driver should then
> + * follow the procedure described in that callback's documentation.
> + * (TODO: The timeout handler should probably switch to using the hardware
> + * fence as parameter instead of the job. Otherwise the handling will always
> + * race between timing out and signaling the fence).
> + *
> + * The scheduler also used to provided functionality for re-submitting jobs
> + * and, thereby, replaced the hardware fence during reset handling. This
> + * functionality is now marked as deprecated. This has proven to be
> + * fundamentally racy and not compatible with DMA-fence rules and shouldn't be
> + * used in new code.
> + *
> + * Additionally, there is the function drm_sched_increase_karma() which tries
> + * to find the entity which submitted a job and increases its 'karma' atomic
> + * variable to prevent resubmitting jobs from this entity. This has quite some
> + * overhead and resubmitting jobs is now marked as deprecated. Thus, using this
> + * function is discouraged.
> + *
> + * Drivers can still recreate the GPU state in case it should be lost during
> + * timeout handling *if* they can guarantee that forward progress will be made
> + * and this doesn't cause another timeout. But this is strongly hardware
> + * specific and out of the scope of the general GPU scheduler.
> + */
> +
> #include <linux/wait.h>
> #include <linux/sched.h>
> #include <linux/completion.h>
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 9c437a057e5d..c52363453861 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -417,8 +417,8 @@ struct drm_sched_backend_ops {
> struct dma_fence *(*run_job)(struct drm_sched_job *sched_job);
>
> /**
> - * @timedout_job: Called when a job has taken too long to execute,
> - * to trigger GPU recovery.
> + * @timedout_job: Called when a hardware fence didn't signal within a
> + * configurable amount of time. Triggers GPU recovery.
> *
> * This method is called in a workqueue context.
> *
> @@ -429,9 +429,8 @@ struct drm_sched_backend_ops {
> * scheduler thread and cancel the timeout work, guaranteeing that
> * nothing is queued while we reset the hardware queue
> * 2. Try to gracefully stop non-faulty jobs (optional)
> - * 3. Issue a GPU reset (driver-specific)
> - * 4. Re-submit jobs using drm_sched_resubmit_jobs()
> - * 5. Restart the scheduler using drm_sched_start(). At that point, new
> + * 3. Issue a GPU or context reset (driver-specific)
> + * 4. Restart the scheduler using drm_sched_start(). At that point, new
> * jobs can be queued, and the scheduler thread is unblocked
> *
> * Note that some GPUs have distinct hardware queues but need to reset
> @@ -447,16 +446,15 @@ struct drm_sched_backend_ops {
> * 2. Try to gracefully stop non-faulty jobs on all queues impacted by
> * the reset (optional)
> * 3. Issue a GPU reset on all faulty queues (driver-specific)
> - * 4. Re-submit jobs on all schedulers impacted by the reset using
> - * drm_sched_resubmit_jobs()
> - * 5. Restart all schedulers that were stopped in step #1 using
> + * 4. Restart all schedulers that were stopped in step #1 using
> * drm_sched_start()
> *
> * Return DRM_GPU_SCHED_STAT_NOMINAL, when all is normal,
> * and the underlying driver has started or completed recovery.
> *
> * Return DRM_GPU_SCHED_STAT_ENODEV, if the device is no longer
> - * available, i.e. has been unplugged.
> + * available, for example if it has been unplugged or failed to
> + * recover.
> */
> enum drm_gpu_sched_stat (*timedout_job)(struct drm_sched_job *sched_job);
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] drm/sched: Extend and update documentation
2024-11-21 16:05 ` Alice Ryhl
@ 2024-12-05 13:20 ` Philipp Stanner
0 siblings, 0 replies; 12+ messages in thread
From: Philipp Stanner @ 2024-12-05 13:20 UTC (permalink / raw)
To: Alice Ryhl
Cc: David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, Jonathan Corbet, Luben Tuikov, Matthew Brost,
Danilo Krummrich, dri-devel, linux-doc, linux-kernel,
Christian König
On Thu, 2024-11-21 at 17:05 +0100, Alice Ryhl wrote:
> On Fri, Nov 15, 2024 at 11:36 AM Philipp Stanner
> <pstanner@redhat.com> wrote:
> >
> > The various objects defined and used by the GPU scheduler are
> > currently
> > not fully documented. Furthermore, there is no documentation yet
> > informing drivers about how they should handle timeouts.
> >
> > Add documentation describing the scheduler's objects and timeout
> > procedure. Consistently, update
> > drm_sched_backend_ops.timedout_job()'s
> > documentation.
> >
> > Co-developed-by: Christian König <christian.koenig@amd.com>
> > Signed-off-by: Christian König <christian.koenig@amd.com>
> > Signed-off-by: Philipp Stanner <pstanner@redhat.com>
> > ---
> > I shamelessly stole- ahm, borrowed this documentation patch that
> > Christian had submitted a year ago:
> >
> > https://lore.kernel.org/dri-devel/20231116141547.206695-1-christian.koenig@amd.com/
> >
> > I took feedback from last year into account where applicable, but
> > it's
> > probably a good idea if you all take a close look again.
> >
> > P.
> > ---
> > Documentation/gpu/drm-mm.rst | 36 +++++
> > drivers/gpu/drm/scheduler/sched_main.c | 200
> > ++++++++++++++++++++++---
> > include/drm/gpu_scheduler.h | 16 +-
> > 3 files changed, 225 insertions(+), 27 deletions(-)
> >
> > diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-
> > mm.rst
> > index d55751cad67c..95ee95fd987a 100644
> > --- a/Documentation/gpu/drm-mm.rst
> > +++ b/Documentation/gpu/drm-mm.rst
> > @@ -556,12 +556,48 @@ Overview
> > .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> > :doc: Overview
> >
> > +Job Object
> > +----------
> > +
> > +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> > + :doc: Job Object
> > +
> > +Entity Object
> > +-------------
> > +
> > +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> > + :doc: Entity Object
> > +
> > +Hardware Fence Object
> > +---------------------
> > +
> > +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> > + :doc: Hardware Fence Object
> > +
> > +Scheduler Fence Object
> > +----------------------
> > +
> > +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> > + :doc: Scheduler Fence Object
> > +
> > +Scheduler and Run Queue Objects
> > +-------------------------------
> > +
> > +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> > + :doc: Scheduler and Run Queue Objects
> > +
> > Flow Control
> > ------------
> >
> > .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> > :doc: Flow Control
> >
> > +Error and Timeout handling
> > +--------------------------
> > +
> > +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> > + :doc: Error and Timeout handling
> > +
> > Scheduler Function References
> > -----------------------------
> >
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > b/drivers/gpu/drm/scheduler/sched_main.c
> > index e97c6c60bc96..76eb46281985 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -24,28 +24,155 @@
> > /**
> > * DOC: Overview
> > *
> > - * The GPU scheduler provides entities which allow userspace to
> > push jobs
> > - * into software queues which are then scheduled on a hardware run
> > queue.
> > - * The software queues have a priority among them. The scheduler
> > selects the entities
> > - * from the run queue using a FIFO. The scheduler provides
> > dependency handling
> > - * features among jobs. The driver is supposed to provide callback
> > functions for
> > - * backend operations to the scheduler like submitting a job to
> > hardware run queue,
> > - * returning the dependencies of a job etc.
> > + * The GPU scheduler is shared infrastructure intended to help
> > drivers managing
> > + * command submission to their hardware.
> > *
> > - * The organisation of the scheduler is the following:
> > + * To do so, it offers a set of scheduling facilities that
> > interact with the
> > + * driver through callbacks which the latter can register.
> > *
> > - * 1. Each hw run queue has one scheduler
> > - * 2. Each scheduler has multiple run queues with different
> > priorities
> > - * (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
> > - * 3. Each scheduler run queue has a queue of entities to schedule
> > - * 4. Entities themselves maintain a queue of jobs that will be
> > scheduled on
> > - * the hardware.
> > + * In particular, the scheduler takes care of:
> > + * - Ordering command submissions
> > + * - Signalling DMA fences, e.g., for finished commands
> > + * - Taking dependencies between command submissions into
> > account
> > + * - Handling timeouts for command submissions
>
> For the signalling case, you say "e.g.". Does that mean it also
> signals DMA fences in other cases?
Good question – it does signal another fence when a job is being
scheduled, but that's not really relevant for the scheduler user IMO.
On the other hand, the docu further down does refer to that fence.
I think we could mention them both here.
>
> > - * The jobs in a entity are always scheduled in the order that
> > they were pushed.
> > + * All callbacks the driver needs to implement are restricted by
> > DMA-fence
> > + * signaling rules to guarantee deadlock free forward progress.
> > This especially
> > + * means that for normal operation no memory can be allocated in a
> > callback.
> > + * All memory which is needed for pushing the job to the hardware
> > must be
> > + * allocated before arming a job. It also means that no locks can
> > be taken
> > + * under which memory might be allocated as well.
> > *
> > - * Note that once a job was taken from the entities queue and
> > pushed to the
> > - * hardware, i.e. the pending queue, the entity must not be
> > referenced anymore
> > - * through the jobs entity pointer.
> > + * Memory which is optional to allocate, for example for device
> > core dumping or
> > + * debugging, *must* be allocated with GFP_NOWAIT and appropriate
> > error
> > + * handling if that allocation fails. GFP_ATOMIC should only be
> > used if
> > + * absolutely necessary since dipping into the special atomic
> > reserves is
> > + * usually not justified for a GPU driver.
> > + *
> > + * Note especially the following about the scheduler's historic
> > background that
> > + * lead to sort of a double role it plays today:
> > + *
> > + * In classic setups N entities share one scheduler, and the
> > scheduler decides
> > + * which job to pick from which entity and move it to the hardware
> > ring next
> > + * (that is: "scheduling").
> > + *
> > + * Many (especially newer) GPUs, however, can have an almost
> > arbitrary number
> > + * of hardware rings and it's a firmware scheduler which actually
> > decides which
> > + * job will run next. In such setups, the GPU scheduler is still
> > used (e.g., in
> > + * Nouveau) but does not "schedule" jobs in the classical sense
> > anymore. It
> > + * merely serves to queue and dequeue jobs and resolve
> > dependencies. In such a
> > + * scenario, it is recommended to have one scheduler per entity.
> > + */
> > +
> > +/**
> > + * DOC: Job Object
> > + *
> > + * The base job object (drm_sched_job) contains submission
> > dependencies in the
> > + * form of DMA-fence objects. Drivers can also implement an
> > optional
> > + * prepare_job callback which returns additional dependencies as
> > DMA-fence
> > + * objects. It's important to note that this callback can't
> > allocate memory or
> > + * grab locks under which memory is allocated.
> > + *
> > + * Drivers should use this as base class for an object which
> > contains the
> > + * necessary state to push the command submission to the hardware.
> > + *
> > + * The lifetime of the job object needs to last at least from
> > submitting it to
> > + * the scheduler (through drm_sched_job_arm()) until the scheduler
> > has invoked
> > + * drm_sched_backend_ops.free_job() and, thereby, has indicated
> > that it does
> > + * not need the job anymore. Drivers can of course keep their job
> > object alive
> > + * for longer than that, but that's outside of the scope of the
> > scheduler
> > + * component.
> > + *
> > + * Job initialization is split into two stages:
> > + * 1. drm_sched_job_init() which serves for basic preparation of
> > a job.
> > + * Drivers don't have to be mindful of this function's
> > consequences and
> > + * its effects can be reverted through
> > drm_sched_job_cleanup().
> > + * 2. drm_sched_job_arm() which irrevokably arms a job for
> > execution. This
> > + * activates the job's fence, i.e., it registers the
> > callbacks. Thus,
> > + * inevitably, the callbacks will access the job and its
> > memory at some
> > + * point in the future. This means that once
> > drm_sched_job_arm() has been
> > + * called, the job structure has to be valid until the
> > scheduler invoked
> > + * drm_sched_backend_ops.free_job().
>
> This is written as-if there could be multiple callbacks in a single
> job. Is that the case?
>
No, arm() just arms the software fence that will fire after the
scheduler is done with the job and calls free_job().
Will adjust.
>
>
>
> Also typo: "invoked" -> "invokes".
Past tense actually was done on purpose, but I think present tense is
better indeed, because the validity ends once the function starts
executing.
>
> > + * It's important to note that after arming a job drivers must
> > follow the
> > + * DMA-fence rules and can't easily allocate memory or takes locks
> > under which
> > + * memory is allocated.
>
> comma? "job, drivers"
> typo: "or takes" -> "or take"
ACK
>
> > +
> > +/**
> > + * DOC: Entity Object
> > + *
> > + * The entity object (drm_sched_entity) which is a container for
> > jobs which
> > + * should execute sequentially. Drivers should create an entity
> > for each
> > + * individual context they maintain for command submissions which
> > can run in
> > + * parallel.
>
> This is a bit awkward, how about: "The entity object is a container
> for jobs that should execute sequentially."
ACK
>
> > + * The lifetime of the entity *should not* exceed the lifetime of
> > the
> > + * userspace process it was created for and drivers should call
> > the
> > + * drm_sched_entity_flush() function from their
> > file_operations.flush()
> > + * callback. It is possible that an entity object is not alive
> > anymore
> > + * while jobs previously fetched from it are still running on the
> > hardware.
>
> To be clear ... this is about not letting processes run code after
> dying, and not because something you're using gets freed after
> flush(), correct?
Not sure if I can fully follow, but this section is not really about
preventing processes from anything. It's just that an entity is created
as a job-container for a specific process, so it doesn't make sense
that the entity lives longer than the process.
If the driver wouldn't ensure that it would simply mean that the
scheduler keeps pulling jobs from that entity, but since the userspace
process is gone by then it wouldn't really matter. I'm not aware of
this really being harmful, as long as the entity gets cleaned up at
some point.
>
> > + * This is done because all results of a command submission should
> > become
> > + * visible externally even after a process exits. This is normal
> > POSIX
> > + * behavior for I/O operations.
> > + *
> > + * The problem with this approach is that GPU submissions contain
> > executable
> > + * shaders enabling processes to evade their termination by
> > offloading work to
> > + * the GPU. So when a process is terminated with a SIGKILL the
> > entity object
> > + * makes sure that jobs are freed without running them while still
> > maintaining
> > + * correct sequential order for signaling fences.
> > + */
> > +
> > +/**
> > + * DOC: Hardware Fence Object
> > + *
> > + * The hardware fence object is a DMA-fence provided by the driver
> > as result of
> > + * running jobs. Drivers need to make sure that the normal DMA-
> > fence semantics
> > + * are followed for this object. It's important to note that the
> > memory for
> > + * this object can *not* be allocated in
> > drm_sched_backend_ops.run_job() since
> > + * that would violate the requirements for the DMA-fence
> > implementation. The
> > + * scheduler maintains a timeout handler which triggers if this
> > fence doesn't
> > + * signal within a configurable amount of time.
> > + *
> > + * The lifetime of this object follows DMA-fence refcounting
> > rules. The
> > + * scheduler takes ownership of the reference returned by the
> > driver and
> > + * drops it when it's not needed any more.
> > + */
> > +
> > +/**
> > + * DOC: Scheduler Fence Object
> > + *
> > + * The scheduler fence object (drm_sched_fence) which encapsulates
> > the whole
> > + * time from pushing the job into the scheduler until the hardware
> > has finished
> > + * processing it. This is internally managed by the scheduler, but
> > drivers can
> > + * grab additional reference to it after arming a job. The
> > implementation
> > + * provides DMA-fence interfaces for signaling both scheduling of
> > a command
> > + * submission as well as finishing of processing.
>
> typo: "an additional reference" or "additional references"
>
> > + * The lifetime of this object also follows normal DMA-fence
> > refcounting rules.
> > + * The finished fence is the one normally exposed to the outside
> > world, but the
> > + * driver can grab references to both the scheduled as well as the
> > finished
> > + * fence when needed for pipelining optimizations.
>
> When you refer to the "scheduled fence" and the "finished fence",
> these are referring to "a fence indicating when the job was scheduled
> / finished", rather than "a fence which was scheduled for execution
> and has now become finished", correct? I think the wording could be a
> bit clearer here.
If anything it would have to be "A fence signaling once a job has been
scheduled".
I'll try to come up with a better wording
Thx,
P.
>
> Alice
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH] drm/sched: Extend and update documentation
@ 2025-07-24 14:01 Philipp Stanner
2025-07-24 15:07 ` Philipp Stanner
2025-08-04 15:41 ` Christian König
0 siblings, 2 replies; 12+ messages in thread
From: Philipp Stanner @ 2025-07-24 14:01 UTC (permalink / raw)
To: Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
Simona Vetter, Jonathan Corbet, Matthew Brost, Danilo Krummrich,
Philipp Stanner, Christian König, Sumit Semwal
Cc: dri-devel, linux-doc, linux-kernel, linux-media, Philipp Stanner,
Christian König
From: Philipp Stanner <pstanner@redhat.com>
The various objects and their memory lifetime used by the GPU scheduler
are currently not fully documented.
Add documentation describing the scheduler's objects. Improve the
general documentation at a few other places.
Co-developed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Philipp Stanner <pstanner@redhat.com>
---
The first draft for this docu was posted by Christian in late 2023 IIRC.
This is an updated version. Please review.
@Christian: As we agreed on months (a year?) ago I kept your Signed-off
by. Just tell me if there's any issue or sth.
---
Documentation/gpu/drm-mm.rst | 36 ++++
drivers/gpu/drm/scheduler/sched_main.c | 228 ++++++++++++++++++++++---
include/drm/gpu_scheduler.h | 5 +-
3 files changed, 238 insertions(+), 31 deletions(-)
diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
index d55751cad67c..95ee95fd987a 100644
--- a/Documentation/gpu/drm-mm.rst
+++ b/Documentation/gpu/drm-mm.rst
@@ -556,12 +556,48 @@ Overview
.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
:doc: Overview
+Job Object
+----------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+ :doc: Job Object
+
+Entity Object
+-------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+ :doc: Entity Object
+
+Hardware Fence Object
+---------------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+ :doc: Hardware Fence Object
+
+Scheduler Fence Object
+----------------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+ :doc: Scheduler Fence Object
+
+Scheduler and Run Queue Objects
+-------------------------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+ :doc: Scheduler and Run Queue Objects
+
Flow Control
------------
.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
:doc: Flow Control
+Error and Timeout handling
+--------------------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+ :doc: Error and Timeout handling
+
Scheduler Function References
-----------------------------
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 5a550fd76bf0..2e7bc1e74186 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -24,48 +24,220 @@
/**
* DOC: Overview
*
- * The GPU scheduler provides entities which allow userspace to push jobs
- * into software queues which are then scheduled on a hardware run queue.
- * The software queues have a priority among them. The scheduler selects the entities
- * from the run queue using a FIFO. The scheduler provides dependency handling
- * features among jobs. The driver is supposed to provide callback functions for
- * backend operations to the scheduler like submitting a job to hardware run queue,
- * returning the dependencies of a job etc.
+ * The GPU scheduler is shared infrastructure intended to help drivers managing
+ * command submission to their hardware.
*
- * The organisation of the scheduler is the following:
+ * To do so, it offers a set of scheduling facilities that interact with the
+ * driver through callbacks which the latter can register.
*
- * 1. Each hw run queue has one scheduler
- * 2. Each scheduler has multiple run queues with different priorities
- * (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
- * 3. Each scheduler run queue has a queue of entities to schedule
- * 4. Entities themselves maintain a queue of jobs that will be scheduled on
- * the hardware.
+ * In particular, the scheduler takes care of:
+ * - Ordering command submissions
+ * - Signalling dma_fences, e.g., for finished commands
+ * - Taking dependencies between command submissions into account
+ * - Handling timeouts for command submissions
*
- * The jobs in an entity are always scheduled in the order in which they were pushed.
+ * All callbacks the driver needs to implement are restricted by dma_fence
+ * signaling rules to guarantee deadlock free forward progress. This especially
+ * means that for normal operation no memory can be allocated in a callback.
+ * All memory which is needed for pushing the job to the hardware must be
+ * allocated before arming a job. It also means that no locks can be taken
+ * under which memory might be allocated.
*
- * Note that once a job was taken from the entities queue and pushed to the
- * hardware, i.e. the pending queue, the entity must not be referenced anymore
- * through the jobs entity pointer.
+ * Optional memory, for example for device core dumping or debugging, *must* be
+ * allocated with GFP_NOWAIT and appropriate error handling if that allocation
+ * fails. GFP_ATOMIC should only be used if absolutely necessary since dipping
+ * into the special atomic reserves is usually not justified for a GPU driver.
+ *
+ * Note especially the following about the scheduler's historic background that
+ * lead to sort of a double role it plays today:
+ *
+ * In classic setups N ("hardware scheduling") entities share one scheduler,
+ * and the scheduler decides which job to pick from which entity and move it to
+ * the hardware ring next (that is: "scheduling").
+ *
+ * Many (especially newer) GPUs, however, can have an almost arbitrary number
+ * of hardware rings and it's a firmware scheduler which actually decides which
+ * job will run next. In such setups, the GPU scheduler is still used (e.g., in
+ * Nouveau) but does not "schedule" jobs in the classical sense anymore. It
+ * merely serves to queue and dequeue jobs and resolve dependencies. In such a
+ * scenario, it is recommended to have one scheduler per entity.
+ */
+
+/**
+ * DOC: Job Object
+ *
+ * The base job object (&struct drm_sched_job) contains submission dependencies
+ * in the form of &struct dma_fence objects. Drivers can also implement an
+ * optional prepare_job callback which returns additional dependencies as
+ * dma_fence objects. It's important to note that this callback can't allocate
+ * memory or grab locks under which memory is allocated.
+ *
+ * Drivers should use this as base class for an object which contains the
+ * necessary state to push the command submission to the hardware.
+ *
+ * The lifetime of the job object needs to last at least from submitting it to
+ * the scheduler (through drm_sched_job_arm()) until the scheduler has invoked
+ * &struct drm_sched_backend_ops.free_job and, thereby, has indicated that it
+ * does not need the job anymore. Drivers can of course keep their job object
+ * alive for longer than that, but that's outside of the scope of the scheduler
+ * component.
+ *
+ * Job initialization is split into two stages:
+ * 1. drm_sched_job_init() which serves for basic preparation of a job.
+ * Drivers don't have to be mindful of this function's consequences and
+ * its effects can be reverted through drm_sched_job_cleanup().
+ * 2. drm_sched_job_arm() which irrevokably arms a job for execution. This
+ * initializes the job's fences and the job has to be submitted with
+ * drm_sched_entity_push_job(). Once drm_sched_job_arm() has been called,
+ * the job structure has to be valid until the scheduler invoked
+ * drm_sched_backend_ops.free_job().
+ *
+ * It's important to note that after arming a job drivers must follow the
+ * dma_fence rules and can't easily allocate memory or takes locks under which
+ * memory is allocated.
+ */
+
+/**
+ * DOC: Entity Object
+ *
+ * The entity object (&struct drm_sched_entity) is a container for jobs which
+ * should execute sequentially. Drivers should create an entity for each
+ * individual context they maintain for command submissions which can run in
+ * parallel.
+ *
+ * The lifetime of the entity *should not* exceed the lifetime of the
+ * userspace process it was created for and drivers should call the
+ * drm_sched_entity_flush() function from their file_operations.flush()
+ * callback. It is possible that an entity object is not alive anymore
+ * while jobs previously fetched from it are still running on the hardware.
+ *
+ * This is done because all results of a command submission should become
+ * visible externally even after a process exits. This is normal POSIX
+ * behavior for I/O operations.
+ *
+ * The problem with this approach is that GPU submissions contain executable
+ * shaders enabling processes to evade their termination by offloading work to
+ * the GPU. So when a process is terminated with a SIGKILL the entity object
+ * makes sure that jobs are freed without running them while still maintaining
+ * correct sequential order for signaling fences.
+ *
+ * All entities associated with a scheduler have to be torn down before that
+ * scheduler.
+ */
+
+/**
+ * DOC: Hardware Fence Object
+ *
+ * The hardware fence object is a dma_fence provided by the driver through
+ * &struct drm_sched_backend_ops.run_job. The driver signals this fence once the
+ * hardware has completed the associated job.
+ *
+ * Drivers need to make sure that the normal dma_fence semantics are followed
+ * for this object. It's important to note that the memory for this object can
+ * *not* be allocated in &struct drm_sched_backend_ops.run_job since that would
+ * violate the requirements for the dma_fence implementation. The scheduler
+ * maintains a timeout handler which triggers if this fence doesn't signal
+ * within a configurable amount of time.
+ *
+ * The lifetime of this object follows dma_fence refcounting rules. The
+ * scheduler takes ownership of the reference returned by the driver and
+ * drops it when it's not needed any more.
+ *
+ * See &struct drm_sched_backend_ops.run_job for precise refcounting rules.
+ */
+
+/**
+ * DOC: Scheduler Fence Object
+ *
+ * The scheduler fence object (&struct drm_sched_fence) encapsulates the whole
+ * time from pushing the job into the scheduler until the hardware has finished
+ * processing it. It is managed by the scheduler. The implementation provides
+ * dma_fence interfaces for signaling both scheduling of a command submission
+ * as well as finishing of processing.
+ *
+ * The lifetime of this object also follows normal dma_fence refcounting rules.
+ */
+
+/**
+ * DOC: Scheduler and Run Queue Objects
+ *
+ * The scheduler object itself (&struct drm_gpu_scheduler) does the actual
+ * scheduling: it picks the next entity to run a job from and pushes that job
+ * onto the hardware. Both FIFO and RR selection algorithms are supported, with
+ * FIFO being the default and the recommended one.
+ *
+ * The lifetime of the scheduler is managed by the driver using it. Before
+ * destroying the scheduler the driver must ensure that all hardware processing
+ * involving this scheduler object has finished by calling for example
+ * disable_irq(). It is *not* sufficient to wait for the hardware fence here
+ * since this doesn't guarantee that all callback processing has finished.
+ *
+ * The run queue object (&struct drm_sched_rq) is a container for entities of a
+ * certain priority level. This object is internally managed by the scheduler
+ * and drivers must not touch it directly. The lifetime of a run queue is bound
+ * to the scheduler's lifetime.
+ *
+ * All entities associated with a scheduler must be torn down before it. Drivers
+ * should implement &struct drm_sched_backend_ops.cancel_job to avoid pending
+ * jobs (those that were pulled from an entity into the scheduler, but have not
+ * been completed by the hardware yet) from leaking.
*/
/**
* DOC: Flow Control
*
* The DRM GPU scheduler provides a flow control mechanism to regulate the rate
- * in which the jobs fetched from scheduler entities are executed.
+ * at which jobs fetched from scheduler entities are executed.
*
- * In this context the &drm_gpu_scheduler keeps track of a driver specified
- * credit limit representing the capacity of this scheduler and a credit count;
- * every &drm_sched_job carries a driver specified number of credits.
+ * In this context the &struct drm_gpu_scheduler keeps track of a driver
+ * specified credit limit representing the capacity of this scheduler and a
+ * credit count; every &struct drm_sched_job carries a driver-specified number
+ * of credits.
*
- * Once a job is executed (but not yet finished), the job's credits contribute
- * to the scheduler's credit count until the job is finished. If by executing
- * one more job the scheduler's credit count would exceed the scheduler's
- * credit limit, the job won't be executed. Instead, the scheduler will wait
- * until the credit count has decreased enough to not overflow its credit limit.
- * This implies waiting for previously executed jobs.
+ * Once a job is being executed, the job's credits contribute to the
+ * scheduler's credit count until the job is finished. If by executing one more
+ * job the scheduler's credit count would exceed the scheduler's credit limit,
+ * the job won't be executed. Instead, the scheduler will wait until the credit
+ * count has decreased enough to not overflow its credit limit. This implies
+ * waiting for previously executed jobs.
*/
+/**
+ * DOC: Error and Timeout handling
+ *
+ * Errors are signaled by using dma_fence_set_error() on the hardware fence
+ * object before signaling it with dma_fence_signal(). Errors are then bubbled
+ * up from the hardware fence to the scheduler fence.
+ *
+ * The entity allows querying errors on the last run submission using the
+ * drm_sched_entity_error() function which can be used to cancel queued
+ * submissions in &struct drm_sched_backend_ops.run_job as well as preventing
+ * pushing further ones into the entity in the driver's submission function.
+ *
+ * When the hardware fence doesn't signal within a configurable amount of time
+ * &struct drm_sched_backend_ops.timedout_job gets invoked. The driver should
+ * then follow the procedure described in that callback's documentation.
+ *
+ * (TODO: The timeout handler should probably switch to using the hardware
+ * fence as parameter instead of the job. Otherwise the handling will always
+ * race between timing out and signaling the fence).
+ *
+ * The scheduler also used to provided functionality for re-submitting jobs
+ * and, thereby, replaced the hardware fence during reset handling. This
+ * functionality is now deprecated. This has proven to be fundamentally racy
+ * and not compatible with dma_fence rules and shouldn't be used in new code.
+ *
+ * Additionally, there is the function drm_sched_increase_karma() which tries
+ * to find the entity which submitted a job and increases its 'karma' atomic
+ * variable to prevent resubmitting jobs from this entity. This has quite some
+ * overhead and resubmitting jobs is now marked as deprecated. Thus, using this
+ * function is discouraged.
+ *
+ * Drivers can still recreate the GPU state in case it should be lost during
+ * timeout handling *if* they can guarantee that forward progress will be made
+ * and this doesn't cause another timeout. But this is strongly hardware
+ * specific and out of the scope of the general GPU scheduler.
+ */
#include <linux/export.h>
#include <linux/wait.h>
#include <linux/sched.h>
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 323a505e6e6a..0f0687b7ae9c 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -458,8 +458,8 @@ struct drm_sched_backend_ops {
struct dma_fence *(*run_job)(struct drm_sched_job *sched_job);
/**
- * @timedout_job: Called when a job has taken too long to execute,
- * to trigger GPU recovery.
+ * @timedout_job: Called when a hardware fence didn't signal within a
+ * configurable amount of time. Triggers GPU recovery.
*
* @sched_job: The job that has timed out
*
@@ -506,7 +506,6 @@ struct drm_sched_backend_ops {
* that timeout handlers are executed sequentially.
*
* Return: The scheduler's status, defined by &enum drm_gpu_sched_stat
- *
*/
enum drm_gpu_sched_stat (*timedout_job)(struct drm_sched_job *sched_job);
--
2.49.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH] drm/sched: Extend and update documentation
2025-07-24 14:01 [PATCH] drm/sched: Extend and update documentation Philipp Stanner
@ 2025-07-24 15:07 ` Philipp Stanner
2025-08-05 9:05 ` Christian König
2025-08-04 15:41 ` Christian König
1 sibling, 1 reply; 12+ messages in thread
From: Philipp Stanner @ 2025-07-24 15:07 UTC (permalink / raw)
To: Philipp Stanner, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, David Airlie, Simona Vetter, Jonathan Corbet,
Matthew Brost, Danilo Krummrich, Christian König,
Sumit Semwal
Cc: dri-devel, linux-doc, linux-kernel, linux-media,
Christian König
Two comments from myself to open up room for discussion:
On Thu, 2025-07-24 at 16:01 +0200, Philipp Stanner wrote:
> From: Philipp Stanner <pstanner@redhat.com>
>
> The various objects and their memory lifetime used by the GPU scheduler
> are currently not fully documented.
>
> Add documentation describing the scheduler's objects. Improve the
> general documentation at a few other places.
>
> Co-developed-by: Christian König <christian.koenig@amd.com>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> Signed-off-by: Philipp Stanner <pstanner@redhat.com>
> ---
> The first draft for this docu was posted by Christian in late 2023 IIRC.
>
> This is an updated version. Please review.
>
> @Christian: As we agreed on months (a year?) ago I kept your Signed-off
> by. Just tell me if there's any issue or sth.
> ---
> Documentation/gpu/drm-mm.rst | 36 ++++
> drivers/gpu/drm/scheduler/sched_main.c | 228 ++++++++++++++++++++++---
> include/drm/gpu_scheduler.h | 5 +-
> 3 files changed, 238 insertions(+), 31 deletions(-)
>
> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
> index d55751cad67c..95ee95fd987a 100644
> --- a/Documentation/gpu/drm-mm.rst
> +++ b/Documentation/gpu/drm-mm.rst
> @@ -556,12 +556,48 @@ Overview
> .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> :doc: Overview
>
> +Job Object
> +----------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> + :doc: Job Object
> +
> +Entity Object
> +-------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> + :doc: Entity Object
> +
> +Hardware Fence Object
> +---------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> + :doc: Hardware Fence Object
> +
> +Scheduler Fence Object
> +----------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> + :doc: Scheduler Fence Object
> +
> +Scheduler and Run Queue Objects
> +-------------------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> + :doc: Scheduler and Run Queue Objects
> +
> Flow Control
> ------------
>
> .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> :doc: Flow Control
>
> +Error and Timeout handling
> +--------------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> + :doc: Error and Timeout handling
> +
> Scheduler Function References
> -----------------------------
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 5a550fd76bf0..2e7bc1e74186 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -24,48 +24,220 @@
> /**
> * DOC: Overview
> *
> - * The GPU scheduler provides entities which allow userspace to push jobs
> - * into software queues which are then scheduled on a hardware run queue.
> - * The software queues have a priority among them. The scheduler selects the entities
> - * from the run queue using a FIFO. The scheduler provides dependency handling
> - * features among jobs. The driver is supposed to provide callback functions for
> - * backend operations to the scheduler like submitting a job to hardware run queue,
> - * returning the dependencies of a job etc.
> + * The GPU scheduler is shared infrastructure intended to help drivers managing
> + * command submission to their hardware.
> *
> - * The organisation of the scheduler is the following:
> + * To do so, it offers a set of scheduling facilities that interact with the
> + * driver through callbacks which the latter can register.
> *
> - * 1. Each hw run queue has one scheduler
> - * 2. Each scheduler has multiple run queues with different priorities
> - * (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
> - * 3. Each scheduler run queue has a queue of entities to schedule
> - * 4. Entities themselves maintain a queue of jobs that will be scheduled on
> - * the hardware.
> + * In particular, the scheduler takes care of:
> + * - Ordering command submissions
> + * - Signalling dma_fences, e.g., for finished commands
> + * - Taking dependencies between command submissions into account
> + * - Handling timeouts for command submissions
> *
> - * The jobs in an entity are always scheduled in the order in which they were pushed.
> + * All callbacks the driver needs to implement are restricted by dma_fence
> + * signaling rules to guarantee deadlock free forward progress. This especially
> + * means that for normal operation no memory can be allocated in a callback.
> + * All memory which is needed for pushing the job to the hardware must be
> + * allocated before arming a job. It also means that no locks can be taken
> + * under which memory might be allocated.
> *
> - * Note that once a job was taken from the entities queue and pushed to the
> - * hardware, i.e. the pending queue, the entity must not be referenced anymore
> - * through the jobs entity pointer.
> + * Optional memory, for example for device core dumping or debugging, *must* be
> + * allocated with GFP_NOWAIT and appropriate error handling if that allocation
> + * fails. GFP_ATOMIC should only be used if absolutely necessary since dipping
> + * into the special atomic reserves is usually not justified for a GPU driver.
> + *
> + * Note especially the following about the scheduler's historic background that
> + * lead to sort of a double role it plays today:
> + *
> + * In classic setups N ("hardware scheduling") entities share one scheduler,
> + * and the scheduler decides which job to pick from which entity and move it to
> + * the hardware ring next (that is: "scheduling").
> + *
> + * Many (especially newer) GPUs, however, can have an almost arbitrary number
> + * of hardware rings and it's a firmware scheduler which actually decides which
> + * job will run next. In such setups, the GPU scheduler is still used (e.g., in
> + * Nouveau) but does not "schedule" jobs in the classical sense anymore. It
> + * merely serves to queue and dequeue jobs and resolve dependencies. In such a
> + * scenario, it is recommended to have one scheduler per entity.
> + */
> +
> +/**
> + * DOC: Job Object
> + *
> + * The base job object (&struct drm_sched_job) contains submission dependencies
> + * in the form of &struct dma_fence objects. Drivers can also implement an
> + * optional prepare_job callback which returns additional dependencies as
> + * dma_fence objects. It's important to note that this callback can't allocate
> + * memory or grab locks under which memory is allocated.
> + *
> + * Drivers should use this as base class for an object which contains the
> + * necessary state to push the command submission to the hardware.
> + *
> + * The lifetime of the job object needs to last at least from submitting it to
> + * the scheduler (through drm_sched_job_arm()) until the scheduler has invoked
> + * &struct drm_sched_backend_ops.free_job and, thereby, has indicated that it
> + * does not need the job anymore. Drivers can of course keep their job object
> + * alive for longer than that, but that's outside of the scope of the scheduler
> + * component.
> + *
> + * Job initialization is split into two stages:
> + * 1. drm_sched_job_init() which serves for basic preparation of a job.
> + * Drivers don't have to be mindful of this function's consequences and
> + * its effects can be reverted through drm_sched_job_cleanup().
> + * 2. drm_sched_job_arm() which irrevokably arms a job for execution. This
> + * initializes the job's fences and the job has to be submitted with
> + * drm_sched_entity_push_job(). Once drm_sched_job_arm() has been called,
> + * the job structure has to be valid until the scheduler invoked
> + * drm_sched_backend_ops.free_job().
> + *
> + * It's important to note that after arming a job drivers must follow the
> + * dma_fence rules and can't easily allocate memory or takes locks under which
> + * memory is allocated.
> + */
> +
> +/**
> + * DOC: Entity Object
> + *
> + * The entity object (&struct drm_sched_entity) is a container for jobs which
> + * should execute sequentially. Drivers should create an entity for each
> + * individual context they maintain for command submissions which can run in
> + * parallel.
> + *
> + * The lifetime of the entity *should not* exceed the lifetime of the
> + * userspace process it was created for and drivers should call the
> + * drm_sched_entity_flush() function from their file_operations.flush()
> + * callback. It is possible that an entity object is not alive anymore
> + * while jobs previously fetched from it are still running on the hardware.
> + *
> + * This is done because all results of a command submission should become
> + * visible externally even after a process exits. This is normal POSIX
> + * behavior for I/O operations.
> + *
> + * The problem with this approach is that GPU submissions contain executable
> + * shaders enabling processes to evade their termination by offloading work to
> + * the GPU. So when a process is terminated with a SIGKILL the entity object
> + * makes sure that jobs are freed without running them while still maintaining
> + * correct sequential order for signaling fences.
> + *
> + * All entities associated with a scheduler have to be torn down before that
> + * scheduler.
> + */
> +
> +/**
> + * DOC: Hardware Fence Object
> + *
> + * The hardware fence object is a dma_fence provided by the driver through
> + * &struct drm_sched_backend_ops.run_job. The driver signals this fence once the
> + * hardware has completed the associated job.
> + *
> + * Drivers need to make sure that the normal dma_fence semantics are followed
> + * for this object. It's important to note that the memory for this object can
> + * *not* be allocated in &struct drm_sched_backend_ops.run_job since that would
> + * violate the requirements for the dma_fence implementation. The scheduler
> + * maintains a timeout handler which triggers if this fence doesn't signal
> + * within a configurable amount of time.
> + *
> + * The lifetime of this object follows dma_fence refcounting rules. The
> + * scheduler takes ownership of the reference returned by the driver and
> + * drops it when it's not needed any more.
> + *
> + * See &struct drm_sched_backend_ops.run_job for precise refcounting rules.
> + */
> +
> +/**
> + * DOC: Scheduler Fence Object
> + *
> + * The scheduler fence object (&struct drm_sched_fence) encapsulates the whole
> + * time from pushing the job into the scheduler until the hardware has finished
> + * processing it. It is managed by the scheduler. The implementation provides
> + * dma_fence interfaces for signaling both scheduling of a command submission
> + * as well as finishing of processing.
> + *
> + * The lifetime of this object also follows normal dma_fence refcounting rules.
> + */
The relict I'm most unsure about is this docu for the scheduler fence.
I know that some drivers are accessing the s_fence, but I strongly
suspect that this is a) unncessary and b) dangerous.
But the original draft from Christian hinted at that. So, @Christian,
this would be an opportunity to discuss this matter.
Otherwise I'd drop this docu section in v2. What users don't know, they
cannot misuse.
> +
> +/**
> + * DOC: Scheduler and Run Queue Objects
> + *
> + * The scheduler object itself (&struct drm_gpu_scheduler) does the actual
> + * scheduling: it picks the next entity to run a job from and pushes that job
> + * onto the hardware. Both FIFO and RR selection algorithms are supported, with
> + * FIFO being the default and the recommended one.
> + *
> + * The lifetime of the scheduler is managed by the driver using it. Before
> + * destroying the scheduler the driver must ensure that all hardware processing
> + * involving this scheduler object has finished by calling for example
> + * disable_irq(). It is *not* sufficient to wait for the hardware fence here
> + * since this doesn't guarantee that all callback processing has finished.
> + *
> + * The run queue object (&struct drm_sched_rq) is a container for entities of a
> + * certain priority level. This object is internally managed by the scheduler
> + * and drivers must not touch it directly. The lifetime of a run queue is bound
> + * to the scheduler's lifetime.
> + *
> + * All entities associated with a scheduler must be torn down before it. Drivers
> + * should implement &struct drm_sched_backend_ops.cancel_job to avoid pending
> + * jobs (those that were pulled from an entity into the scheduler, but have not
> + * been completed by the hardware yet) from leaking.
> */
>
> /**
> * DOC: Flow Control
> *
> * The DRM GPU scheduler provides a flow control mechanism to regulate the rate
> - * in which the jobs fetched from scheduler entities are executed.
> + * at which jobs fetched from scheduler entities are executed.
> *
> - * In this context the &drm_gpu_scheduler keeps track of a driver specified
> - * credit limit representing the capacity of this scheduler and a credit count;
> - * every &drm_sched_job carries a driver specified number of credits.
> + * In this context the &struct drm_gpu_scheduler keeps track of a driver
> + * specified credit limit representing the capacity of this scheduler and a
> + * credit count; every &struct drm_sched_job carries a driver-specified number
> + * of credits.
> *
> - * Once a job is executed (but not yet finished), the job's credits contribute
> - * to the scheduler's credit count until the job is finished. If by executing
> - * one more job the scheduler's credit count would exceed the scheduler's
> - * credit limit, the job won't be executed. Instead, the scheduler will wait
> - * until the credit count has decreased enough to not overflow its credit limit.
> - * This implies waiting for previously executed jobs.
> + * Once a job is being executed, the job's credits contribute to the
> + * scheduler's credit count until the job is finished. If by executing one more
> + * job the scheduler's credit count would exceed the scheduler's credit limit,
> + * the job won't be executed. Instead, the scheduler will wait until the credit
> + * count has decreased enough to not overflow its credit limit. This implies
> + * waiting for previously executed jobs.
> */
>
> +/**
> + * DOC: Error and Timeout handling
> + *
> + * Errors are signaled by using dma_fence_set_error() on the hardware fence
> + * object before signaling it with dma_fence_signal(). Errors are then bubbled
> + * up from the hardware fence to the scheduler fence.
> + *
> + * The entity allows querying errors on the last run submission using the
> + * drm_sched_entity_error() function which can be used to cancel queued
> + * submissions in &struct drm_sched_backend_ops.run_job as well as preventing
> + * pushing further ones into the entity in the driver's submission function.
> + *
> + * When the hardware fence doesn't signal within a configurable amount of time
> + * &struct drm_sched_backend_ops.timedout_job gets invoked. The driver should
> + * then follow the procedure described in that callback's documentation.
> + *
> + * (TODO: The timeout handler should probably switch to using the hardware
> + * fence as parameter instead of the job. Otherwise the handling will always
> + * race between timing out and signaling the fence).
This TODO can probably removed, too. The recently merged
DRM_GPU_SCHED_STAT_NO_HANG has solved this issue.
P.
> + *
> + * The scheduler also used to provided functionality for re-submitting jobs
> + * and, thereby, replaced the hardware fence during reset handling. This
> + * functionality is now deprecated. This has proven to be fundamentally racy
> + * and not compatible with dma_fence rules and shouldn't be used in new code.
> + *
> + * Additionally, there is the function drm_sched_increase_karma() which tries
> + * to find the entity which submitted a job and increases its 'karma' atomic
> + * variable to prevent resubmitting jobs from this entity. This has quite some
> + * overhead and resubmitting jobs is now marked as deprecated. Thus, using this
> + * function is discouraged.
> + *
> + * Drivers can still recreate the GPU state in case it should be lost during
> + * timeout handling *if* they can guarantee that forward progress will be made
> + * and this doesn't cause another timeout. But this is strongly hardware
> + * specific and out of the scope of the general GPU scheduler.
> + */
> #include <linux/export.h>
> #include <linux/wait.h>
> #include <linux/sched.h>
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 323a505e6e6a..0f0687b7ae9c 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -458,8 +458,8 @@ struct drm_sched_backend_ops {
> struct dma_fence *(*run_job)(struct drm_sched_job *sched_job);
>
> /**
> - * @timedout_job: Called when a job has taken too long to execute,
> - * to trigger GPU recovery.
> + * @timedout_job: Called when a hardware fence didn't signal within a
> + * configurable amount of time. Triggers GPU recovery.
> *
> * @sched_job: The job that has timed out
> *
> @@ -506,7 +506,6 @@ struct drm_sched_backend_ops {
> * that timeout handlers are executed sequentially.
> *
> * Return: The scheduler's status, defined by &enum drm_gpu_sched_stat
> - *
> */
> enum drm_gpu_sched_stat (*timedout_job)(struct drm_sched_job *sched_job);
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] drm/sched: Extend and update documentation
2025-07-24 14:01 [PATCH] drm/sched: Extend and update documentation Philipp Stanner
2025-07-24 15:07 ` Philipp Stanner
@ 2025-08-04 15:41 ` Christian König
1 sibling, 0 replies; 12+ messages in thread
From: Christian König @ 2025-08-04 15:41 UTC (permalink / raw)
To: Philipp Stanner, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, David Airlie, Simona Vetter, Jonathan Corbet,
Matthew Brost, Danilo Krummrich, Christian König,
Sumit Semwal
Cc: dri-devel, linux-doc, linux-kernel, linux-media, Philipp Stanner
On 24.07.25 16:01, Philipp Stanner wrote:
> From: Philipp Stanner <pstanner@redhat.com>
>
> The various objects and their memory lifetime used by the GPU scheduler
> are currently not fully documented.
>
> Add documentation describing the scheduler's objects. Improve the
> general documentation at a few other places.
>
> Co-developed-by: Christian König <christian.koenig@amd.com>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> Signed-off-by: Philipp Stanner <pstanner@redhat.com>
> ---
> The first draft for this docu was posted by Christian in late 2023 IIRC.
>
> This is an updated version. Please review.
>
> @Christian: As we agreed on months (a year?) ago I kept your Signed-off
> by. Just tell me if there's any issue or sth.
Sure go ahead. And thanks a lot for taking the work to look into this.
Regards,
Christian.
> ---
> Documentation/gpu/drm-mm.rst | 36 ++++
> drivers/gpu/drm/scheduler/sched_main.c | 228 ++++++++++++++++++++++---
> include/drm/gpu_scheduler.h | 5 +-
> 3 files changed, 238 insertions(+), 31 deletions(-)
>
> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
> index d55751cad67c..95ee95fd987a 100644
> --- a/Documentation/gpu/drm-mm.rst
> +++ b/Documentation/gpu/drm-mm.rst
> @@ -556,12 +556,48 @@ Overview
> .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> :doc: Overview
>
> +Job Object
> +----------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> + :doc: Job Object
> +
> +Entity Object
> +-------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> + :doc: Entity Object
> +
> +Hardware Fence Object
> +---------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> + :doc: Hardware Fence Object
> +
> +Scheduler Fence Object
> +----------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> + :doc: Scheduler Fence Object
> +
> +Scheduler and Run Queue Objects
> +-------------------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> + :doc: Scheduler and Run Queue Objects
> +
> Flow Control
> ------------
>
> .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> :doc: Flow Control
>
> +Error and Timeout handling
> +--------------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> + :doc: Error and Timeout handling
> +
> Scheduler Function References
> -----------------------------
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 5a550fd76bf0..2e7bc1e74186 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -24,48 +24,220 @@
> /**
> * DOC: Overview
> *
> - * The GPU scheduler provides entities which allow userspace to push jobs
> - * into software queues which are then scheduled on a hardware run queue.
> - * The software queues have a priority among them. The scheduler selects the entities
> - * from the run queue using a FIFO. The scheduler provides dependency handling
> - * features among jobs. The driver is supposed to provide callback functions for
> - * backend operations to the scheduler like submitting a job to hardware run queue,
> - * returning the dependencies of a job etc.
> + * The GPU scheduler is shared infrastructure intended to help drivers managing
> + * command submission to their hardware.
> *
> - * The organisation of the scheduler is the following:
> + * To do so, it offers a set of scheduling facilities that interact with the
> + * driver through callbacks which the latter can register.
> *
> - * 1. Each hw run queue has one scheduler
> - * 2. Each scheduler has multiple run queues with different priorities
> - * (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
> - * 3. Each scheduler run queue has a queue of entities to schedule
> - * 4. Entities themselves maintain a queue of jobs that will be scheduled on
> - * the hardware.
> + * In particular, the scheduler takes care of:
> + * - Ordering command submissions
> + * - Signalling dma_fences, e.g., for finished commands
> + * - Taking dependencies between command submissions into account
> + * - Handling timeouts for command submissions
> *
> - * The jobs in an entity are always scheduled in the order in which they were pushed.
> + * All callbacks the driver needs to implement are restricted by dma_fence
> + * signaling rules to guarantee deadlock free forward progress. This especially
> + * means that for normal operation no memory can be allocated in a callback.
> + * All memory which is needed for pushing the job to the hardware must be
> + * allocated before arming a job. It also means that no locks can be taken
> + * under which memory might be allocated.
> *
> - * Note that once a job was taken from the entities queue and pushed to the
> - * hardware, i.e. the pending queue, the entity must not be referenced anymore
> - * through the jobs entity pointer.
> + * Optional memory, for example for device core dumping or debugging, *must* be
> + * allocated with GFP_NOWAIT and appropriate error handling if that allocation
> + * fails. GFP_ATOMIC should only be used if absolutely necessary since dipping
> + * into the special atomic reserves is usually not justified for a GPU driver.
> + *
> + * Note especially the following about the scheduler's historic background that
> + * lead to sort of a double role it plays today:
> + *
> + * In classic setups N ("hardware scheduling") entities share one scheduler,
> + * and the scheduler decides which job to pick from which entity and move it to
> + * the hardware ring next (that is: "scheduling").
> + *
> + * Many (especially newer) GPUs, however, can have an almost arbitrary number
> + * of hardware rings and it's a firmware scheduler which actually decides which
> + * job will run next. In such setups, the GPU scheduler is still used (e.g., in
> + * Nouveau) but does not "schedule" jobs in the classical sense anymore. It
> + * merely serves to queue and dequeue jobs and resolve dependencies. In such a
> + * scenario, it is recommended to have one scheduler per entity.
> + */
> +
> +/**
> + * DOC: Job Object
> + *
> + * The base job object (&struct drm_sched_job) contains submission dependencies
> + * in the form of &struct dma_fence objects. Drivers can also implement an
> + * optional prepare_job callback which returns additional dependencies as
> + * dma_fence objects. It's important to note that this callback can't allocate
> + * memory or grab locks under which memory is allocated.
> + *
> + * Drivers should use this as base class for an object which contains the
> + * necessary state to push the command submission to the hardware.
> + *
> + * The lifetime of the job object needs to last at least from submitting it to
> + * the scheduler (through drm_sched_job_arm()) until the scheduler has invoked
> + * &struct drm_sched_backend_ops.free_job and, thereby, has indicated that it
> + * does not need the job anymore. Drivers can of course keep their job object
> + * alive for longer than that, but that's outside of the scope of the scheduler
> + * component.
> + *
> + * Job initialization is split into two stages:
> + * 1. drm_sched_job_init() which serves for basic preparation of a job.
> + * Drivers don't have to be mindful of this function's consequences and
> + * its effects can be reverted through drm_sched_job_cleanup().
> + * 2. drm_sched_job_arm() which irrevokably arms a job for execution. This
> + * initializes the job's fences and the job has to be submitted with
> + * drm_sched_entity_push_job(). Once drm_sched_job_arm() has been called,
> + * the job structure has to be valid until the scheduler invoked
> + * drm_sched_backend_ops.free_job().
> + *
> + * It's important to note that after arming a job drivers must follow the
> + * dma_fence rules and can't easily allocate memory or takes locks under which
> + * memory is allocated.
> + */
> +
> +/**
> + * DOC: Entity Object
> + *
> + * The entity object (&struct drm_sched_entity) is a container for jobs which
> + * should execute sequentially. Drivers should create an entity for each
> + * individual context they maintain for command submissions which can run in
> + * parallel.
> + *
> + * The lifetime of the entity *should not* exceed the lifetime of the
> + * userspace process it was created for and drivers should call the
> + * drm_sched_entity_flush() function from their file_operations.flush()
> + * callback. It is possible that an entity object is not alive anymore
> + * while jobs previously fetched from it are still running on the hardware.
> + *
> + * This is done because all results of a command submission should become
> + * visible externally even after a process exits. This is normal POSIX
> + * behavior for I/O operations.
> + *
> + * The problem with this approach is that GPU submissions contain executable
> + * shaders enabling processes to evade their termination by offloading work to
> + * the GPU. So when a process is terminated with a SIGKILL the entity object
> + * makes sure that jobs are freed without running them while still maintaining
> + * correct sequential order for signaling fences.
> + *
> + * All entities associated with a scheduler have to be torn down before that
> + * scheduler.
> + */
> +
> +/**
> + * DOC: Hardware Fence Object
> + *
> + * The hardware fence object is a dma_fence provided by the driver through
> + * &struct drm_sched_backend_ops.run_job. The driver signals this fence once the
> + * hardware has completed the associated job.
> + *
> + * Drivers need to make sure that the normal dma_fence semantics are followed
> + * for this object. It's important to note that the memory for this object can
> + * *not* be allocated in &struct drm_sched_backend_ops.run_job since that would
> + * violate the requirements for the dma_fence implementation. The scheduler
> + * maintains a timeout handler which triggers if this fence doesn't signal
> + * within a configurable amount of time.
> + *
> + * The lifetime of this object follows dma_fence refcounting rules. The
> + * scheduler takes ownership of the reference returned by the driver and
> + * drops it when it's not needed any more.
> + *
> + * See &struct drm_sched_backend_ops.run_job for precise refcounting rules.
> + */
> +
> +/**
> + * DOC: Scheduler Fence Object
> + *
> + * The scheduler fence object (&struct drm_sched_fence) encapsulates the whole
> + * time from pushing the job into the scheduler until the hardware has finished
> + * processing it. It is managed by the scheduler. The implementation provides
> + * dma_fence interfaces for signaling both scheduling of a command submission
> + * as well as finishing of processing.
> + *
> + * The lifetime of this object also follows normal dma_fence refcounting rules.
> + */
> +
> +/**
> + * DOC: Scheduler and Run Queue Objects
> + *
> + * The scheduler object itself (&struct drm_gpu_scheduler) does the actual
> + * scheduling: it picks the next entity to run a job from and pushes that job
> + * onto the hardware. Both FIFO and RR selection algorithms are supported, with
> + * FIFO being the default and the recommended one.
> + *
> + * The lifetime of the scheduler is managed by the driver using it. Before
> + * destroying the scheduler the driver must ensure that all hardware processing
> + * involving this scheduler object has finished by calling for example
> + * disable_irq(). It is *not* sufficient to wait for the hardware fence here
> + * since this doesn't guarantee that all callback processing has finished.
> + *
> + * The run queue object (&struct drm_sched_rq) is a container for entities of a
> + * certain priority level. This object is internally managed by the scheduler
> + * and drivers must not touch it directly. The lifetime of a run queue is bound
> + * to the scheduler's lifetime.
> + *
> + * All entities associated with a scheduler must be torn down before it. Drivers
> + * should implement &struct drm_sched_backend_ops.cancel_job to avoid pending
> + * jobs (those that were pulled from an entity into the scheduler, but have not
> + * been completed by the hardware yet) from leaking.
> */
>
> /**
> * DOC: Flow Control
> *
> * The DRM GPU scheduler provides a flow control mechanism to regulate the rate
> - * in which the jobs fetched from scheduler entities are executed.
> + * at which jobs fetched from scheduler entities are executed.
> *
> - * In this context the &drm_gpu_scheduler keeps track of a driver specified
> - * credit limit representing the capacity of this scheduler and a credit count;
> - * every &drm_sched_job carries a driver specified number of credits.
> + * In this context the &struct drm_gpu_scheduler keeps track of a driver
> + * specified credit limit representing the capacity of this scheduler and a
> + * credit count; every &struct drm_sched_job carries a driver-specified number
> + * of credits.
> *
> - * Once a job is executed (but not yet finished), the job's credits contribute
> - * to the scheduler's credit count until the job is finished. If by executing
> - * one more job the scheduler's credit count would exceed the scheduler's
> - * credit limit, the job won't be executed. Instead, the scheduler will wait
> - * until the credit count has decreased enough to not overflow its credit limit.
> - * This implies waiting for previously executed jobs.
> + * Once a job is being executed, the job's credits contribute to the
> + * scheduler's credit count until the job is finished. If by executing one more
> + * job the scheduler's credit count would exceed the scheduler's credit limit,
> + * the job won't be executed. Instead, the scheduler will wait until the credit
> + * count has decreased enough to not overflow its credit limit. This implies
> + * waiting for previously executed jobs.
> */
>
> +/**
> + * DOC: Error and Timeout handling
> + *
> + * Errors are signaled by using dma_fence_set_error() on the hardware fence
> + * object before signaling it with dma_fence_signal(). Errors are then bubbled
> + * up from the hardware fence to the scheduler fence.
> + *
> + * The entity allows querying errors on the last run submission using the
> + * drm_sched_entity_error() function which can be used to cancel queued
> + * submissions in &struct drm_sched_backend_ops.run_job as well as preventing
> + * pushing further ones into the entity in the driver's submission function.
> + *
> + * When the hardware fence doesn't signal within a configurable amount of time
> + * &struct drm_sched_backend_ops.timedout_job gets invoked. The driver should
> + * then follow the procedure described in that callback's documentation.
> + *
> + * (TODO: The timeout handler should probably switch to using the hardware
> + * fence as parameter instead of the job. Otherwise the handling will always
> + * race between timing out and signaling the fence).
> + *
> + * The scheduler also used to provided functionality for re-submitting jobs
> + * and, thereby, replaced the hardware fence during reset handling. This
> + * functionality is now deprecated. This has proven to be fundamentally racy
> + * and not compatible with dma_fence rules and shouldn't be used in new code.
> + *
> + * Additionally, there is the function drm_sched_increase_karma() which tries
> + * to find the entity which submitted a job and increases its 'karma' atomic
> + * variable to prevent resubmitting jobs from this entity. This has quite some
> + * overhead and resubmitting jobs is now marked as deprecated. Thus, using this
> + * function is discouraged.
> + *
> + * Drivers can still recreate the GPU state in case it should be lost during
> + * timeout handling *if* they can guarantee that forward progress will be made
> + * and this doesn't cause another timeout. But this is strongly hardware
> + * specific and out of the scope of the general GPU scheduler.
> + */
> #include <linux/export.h>
> #include <linux/wait.h>
> #include <linux/sched.h>
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 323a505e6e6a..0f0687b7ae9c 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -458,8 +458,8 @@ struct drm_sched_backend_ops {
> struct dma_fence *(*run_job)(struct drm_sched_job *sched_job);
>
> /**
> - * @timedout_job: Called when a job has taken too long to execute,
> - * to trigger GPU recovery.
> + * @timedout_job: Called when a hardware fence didn't signal within a
> + * configurable amount of time. Triggers GPU recovery.
> *
> * @sched_job: The job that has timed out
> *
> @@ -506,7 +506,6 @@ struct drm_sched_backend_ops {
> * that timeout handlers are executed sequentially.
> *
> * Return: The scheduler's status, defined by &enum drm_gpu_sched_stat
> - *
> */
> enum drm_gpu_sched_stat (*timedout_job)(struct drm_sched_job *sched_job);
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] drm/sched: Extend and update documentation
2025-07-24 15:07 ` Philipp Stanner
@ 2025-08-05 9:05 ` Christian König
2025-08-05 10:22 ` Philipp Stanner
0 siblings, 1 reply; 12+ messages in thread
From: Christian König @ 2025-08-05 9:05 UTC (permalink / raw)
To: Philipp Stanner, Philipp Stanner, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
Jonathan Corbet, Matthew Brost, Danilo Krummrich,
Christian König, Sumit Semwal
Cc: dri-devel, linux-doc, linux-kernel, linux-media
On 24.07.25 17:07, Philipp Stanner wrote:
>> +/**
>> + * DOC: Scheduler Fence Object
>> + *
>> + * The scheduler fence object (&struct drm_sched_fence) encapsulates the whole
>> + * time from pushing the job into the scheduler until the hardware has finished
>> + * processing it. It is managed by the scheduler. The implementation provides
>> + * dma_fence interfaces for signaling both scheduling of a command submission
>> + * as well as finishing of processing.
>> + *
>> + * The lifetime of this object also follows normal dma_fence refcounting rules.
>> + */
>
> The relict I'm most unsure about is this docu for the scheduler fence.
> I know that some drivers are accessing the s_fence, but I strongly
> suspect that this is a) unncessary and b) dangerous.
Which s_fence member do you mean? The one in the job? That should be harmless as far as I can see.
> But the original draft from Christian hinted at that. So, @Christian,
> this would be an opportunity to discuss this matter.
>
> Otherwise I'd drop this docu section in v2. What users don't know, they
> cannot misuse.
I would rather like to keep that to avoid misusing the job as the object for tracking the submission lifetime.
>> +/**
>> + * DOC: Error and Timeout handling
>> + *
>> + * Errors are signaled by using dma_fence_set_error() on the hardware fence
>> + * object before signaling it with dma_fence_signal(). Errors are then bubbled
>> + * up from the hardware fence to the scheduler fence.
>> + *
>> + * The entity allows querying errors on the last run submission using the
>> + * drm_sched_entity_error() function which can be used to cancel queued
>> + * submissions in &struct drm_sched_backend_ops.run_job as well as preventing
>> + * pushing further ones into the entity in the driver's submission function.
>> + *
>> + * When the hardware fence doesn't signal within a configurable amount of time
>> + * &struct drm_sched_backend_ops.timedout_job gets invoked. The driver should
>> + * then follow the procedure described in that callback's documentation.
>> + *
>> + * (TODO: The timeout handler should probably switch to using the hardware
>> + * fence as parameter instead of the job. Otherwise the handling will always
>> + * race between timing out and signaling the fence).
>
> This TODO can probably removed, too. The recently merged
> DRM_GPU_SCHED_STAT_NO_HANG has solved this issue.
No, it only scratched on the surface of problems here.
I'm seriously considering sending a RFC patch to cleanup the job lifetime and implementing this change.
Not necessarily giving the HW fence as parameter to the timeout callback, but more generally not letting the scheduler depend on driver behavior.
Regards,
Christian.
>
>
> P.
>
>> + *
>> + * The scheduler also used to provided functionality for re-submitting jobs
>> + * and, thereby, replaced the hardware fence during reset handling. This
>> + * functionality is now deprecated. This has proven to be fundamentally racy
>> + * and not compatible with dma_fence rules and shouldn't be used in new code.
>> + *
>> + * Additionally, there is the function drm_sched_increase_karma() which tries
>> + * to find the entity which submitted a job and increases its 'karma' atomic
>> + * variable to prevent resubmitting jobs from this entity. This has quite some
>> + * overhead and resubmitting jobs is now marked as deprecated. Thus, using this
>> + * function is discouraged.
>> + *
>> + * Drivers can still recreate the GPU state in case it should be lost during
>> + * timeout handling *if* they can guarantee that forward progress will be made
>> + * and this doesn't cause another timeout. But this is strongly hardware
>> + * specific and out of the scope of the general GPU scheduler.
>> + */
>> #include <linux/export.h>
>> #include <linux/wait.h>
>> #include <linux/sched.h>
>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>> index 323a505e6e6a..0f0687b7ae9c 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -458,8 +458,8 @@ struct drm_sched_backend_ops {
>> struct dma_fence *(*run_job)(struct drm_sched_job *sched_job);
>>
>> /**
>> - * @timedout_job: Called when a job has taken too long to execute,
>> - * to trigger GPU recovery.
>> + * @timedout_job: Called when a hardware fence didn't signal within a
>> + * configurable amount of time. Triggers GPU recovery.
>> *
>> * @sched_job: The job that has timed out
>> *
>> @@ -506,7 +506,6 @@ struct drm_sched_backend_ops {
>> * that timeout handlers are executed sequentially.
>> *
>> * Return: The scheduler's status, defined by &enum drm_gpu_sched_stat
>> - *
>> */
>> enum drm_gpu_sched_stat (*timedout_job)(struct drm_sched_job *sched_job);
>>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] drm/sched: Extend and update documentation
2025-08-05 9:05 ` Christian König
@ 2025-08-05 10:22 ` Philipp Stanner
2025-08-07 14:15 ` Christian König
0 siblings, 1 reply; 12+ messages in thread
From: Philipp Stanner @ 2025-08-05 10:22 UTC (permalink / raw)
To: Christian König, Philipp Stanner, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
Jonathan Corbet, Matthew Brost, Danilo Krummrich,
Christian König, Sumit Semwal
Cc: dri-devel, linux-doc, linux-kernel, linux-media
On Tue, 2025-08-05 at 11:05 +0200, Christian König wrote:
> On 24.07.25 17:07, Philipp Stanner wrote:
> > > +/**
> > > + * DOC: Scheduler Fence Object
> > > + *
> > > + * The scheduler fence object (&struct drm_sched_fence) encapsulates the whole
> > > + * time from pushing the job into the scheduler until the hardware has finished
> > > + * processing it. It is managed by the scheduler. The implementation provides
> > > + * dma_fence interfaces for signaling both scheduling of a command submission
> > > + * as well as finishing of processing.
> > > + *
> > > + * The lifetime of this object also follows normal dma_fence refcounting rules.
> > > + */
> >
> > The relict I'm most unsure about is this docu for the scheduler fence.
> > I know that some drivers are accessing the s_fence, but I strongly
> > suspect that this is a) unncessary and b) dangerous.
>
> Which s_fence member do you mean? The one in the job? That should be harmless as far as I can see.
I'm talking about struct drm_sched_fence.
>
> > But the original draft from Christian hinted at that. So, @Christian,
> > this would be an opportunity to discuss this matter.
> >
> > Otherwise I'd drop this docu section in v2. What users don't know, they
> > cannot misuse.
>
> I would rather like to keep that to avoid misusing the job as the object for tracking the submission lifetime.
Why would a driver ever want to access struct drm_sched_fence? The
driver knows when it signaled the hardware fence, and it knows when its
callbacks run_job() and free_job() were invoked.
I'm open to learn what amdgpu does there and why.
>
> > > +/**
> > > + * DOC: Error and Timeout handling
> > > + *
> > > + * Errors are signaled by using dma_fence_set_error() on the hardware fence
> > > + * object before signaling it with dma_fence_signal(). Errors are then bubbled
> > > + * up from the hardware fence to the scheduler fence.
> > > + *
> > > + * The entity allows querying errors on the last run submission using the
> > > + * drm_sched_entity_error() function which can be used to cancel queued
> > > + * submissions in &struct drm_sched_backend_ops.run_job as well as preventing
> > > + * pushing further ones into the entity in the driver's submission function.
> > > + *
> > > + * When the hardware fence doesn't signal within a configurable amount of time
> > > + * &struct drm_sched_backend_ops.timedout_job gets invoked. The driver should
> > > + * then follow the procedure described in that callback's documentation.
> > > + *
> > > + * (TODO: The timeout handler should probably switch to using the hardware
> > > + * fence as parameter instead of the job. Otherwise the handling will always
> > > + * race between timing out and signaling the fence).
> >
> > This TODO can probably removed, too. The recently merged
> > DRM_GPU_SCHED_STAT_NO_HANG has solved this issue.
>
> No, it only scratched on the surface of problems here.
>
> I'm seriously considering sending a RFC patch to cleanup the job lifetime and implementing this change.
>
> Not necessarily giving the HW fence as parameter to the timeout callback, but more generally not letting the scheduler depend on driver behavior.
That's rather vague. Regarding this TODO, "racing between timing out
and signaling the fence" can now be corrected by the driver. Are there
more issues? If so, we want to add a new FIXME for them.
That said, such an RFC would obviously be great. We can discuss the
paragraph above there, if you want.
Regards
P.
>
> Regards,
> Christian.
>
> >
> >
> > P.
> >
> > > + *
> > > + * The scheduler also used to provided functionality for re-submitting jobs
> > > + * and, thereby, replaced the hardware fence during reset handling. This
> > > + * functionality is now deprecated. This has proven to be fundamentally racy
> > > + * and not compatible with dma_fence rules and shouldn't be used in new code.
> > > + *
> > > + * Additionally, there is the function drm_sched_increase_karma() which tries
> > > + * to find the entity which submitted a job and increases its 'karma' atomic
> > > + * variable to prevent resubmitting jobs from this entity. This has quite some
> > > + * overhead and resubmitting jobs is now marked as deprecated. Thus, using this
> > > + * function is discouraged.
> > > + *
> > > + * Drivers can still recreate the GPU state in case it should be lost during
> > > + * timeout handling *if* they can guarantee that forward progress will be made
> > > + * and this doesn't cause another timeout. But this is strongly hardware
> > > + * specific and out of the scope of the general GPU scheduler.
> > > + */
> > > #include <linux/export.h>
> > > #include <linux/wait.h>
> > > #include <linux/sched.h>
> > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > > index 323a505e6e6a..0f0687b7ae9c 100644
> > > --- a/include/drm/gpu_scheduler.h
> > > +++ b/include/drm/gpu_scheduler.h
> > > @@ -458,8 +458,8 @@ struct drm_sched_backend_ops {
> > > struct dma_fence *(*run_job)(struct drm_sched_job *sched_job);
> > >
> > > /**
> > > - * @timedout_job: Called when a job has taken too long to execute,
> > > - * to trigger GPU recovery.
> > > + * @timedout_job: Called when a hardware fence didn't signal within a
> > > + * configurable amount of time. Triggers GPU recovery.
> > > *
> > > * @sched_job: The job that has timed out
> > > *
> > > @@ -506,7 +506,6 @@ struct drm_sched_backend_ops {
> > > * that timeout handlers are executed sequentially.
> > > *
> > > * Return: The scheduler's status, defined by &enum drm_gpu_sched_stat
> > > - *
> > > */
> > > enum drm_gpu_sched_stat (*timedout_job)(struct drm_sched_job *sched_job);
> > >
> >
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] drm/sched: Extend and update documentation
2025-08-05 10:22 ` Philipp Stanner
@ 2025-08-07 14:15 ` Christian König
2025-08-11 9:50 ` Philipp Stanner
0 siblings, 1 reply; 12+ messages in thread
From: Christian König @ 2025-08-07 14:15 UTC (permalink / raw)
To: Philipp Stanner, Philipp Stanner, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
Jonathan Corbet, Matthew Brost, Danilo Krummrich,
Christian König, Sumit Semwal
Cc: dri-devel, linux-doc, linux-kernel, linux-media
On 05.08.25 12:22, Philipp Stanner wrote:
> On Tue, 2025-08-05 at 11:05 +0200, Christian König wrote:
>> On 24.07.25 17:07, Philipp Stanner wrote:
>>>> +/**
>>>> + * DOC: Scheduler Fence Object
>>>> + *
>>>> + * The scheduler fence object (&struct drm_sched_fence) encapsulates the whole
>>>> + * time from pushing the job into the scheduler until the hardware has finished
>>>> + * processing it. It is managed by the scheduler. The implementation provides
>>>> + * dma_fence interfaces for signaling both scheduling of a command submission
>>>> + * as well as finishing of processing.
>>>> + *
>>>> + * The lifetime of this object also follows normal dma_fence refcounting rules.
>>>> + */
>>>
>>> The relict I'm most unsure about is this docu for the scheduler fence.
>>> I know that some drivers are accessing the s_fence, but I strongly
>>> suspect that this is a) unncessary and b) dangerous.
>>
>> Which s_fence member do you mean? The one in the job? That should be harmless as far as I can see.
>
> I'm talking about struct drm_sched_fence.
Yeah that is necessary for the drivers to know about. We could potentially abstract it better but we can't really hide it completely.
>>
>>> But the original draft from Christian hinted at that. So, @Christian,
>>> this would be an opportunity to discuss this matter.
>>>
>>> Otherwise I'd drop this docu section in v2. What users don't know, they
>>> cannot misuse.
>>
>> I would rather like to keep that to avoid misusing the job as the object for tracking the submission lifetime.
>
> Why would a driver ever want to access struct drm_sched_fence? The
> driver knows when it signaled the hardware fence, and it knows when its
> callbacks run_job() and free_job() were invoked.
>
> I'm open to learn what amdgpu does there and why.
The simplest use case is performance optimization. You sometimes have submissions which ideally run with others at the same time.
So we have AMDGPU_CHUNK_ID_SCHEDULED_DEPENDENCIES which basically tries to cast a fence to a scheduler fence and then only waits for the dependency to be pushed to the HW instead of waiting for it to finish (see amdgpu_cs.c).
Another example are gang submissions (where I still have the TODO to actually fix the code to not crash in an OOM situation).
Here we have a gang leader and gang members which are guaranteed to run together on the HW at the same time.
This works by adding scheduled dependencies to the gang leader so that the scheduler pushes it to the HW only after all gang members have been pushed.
The first gang member pushed now triggers a dependency handling which makes sure that no other gang can be pushed until gang leader is pushed as well.
>>>> +/**
>>>> + * DOC: Error and Timeout handling
>>>> + *
>>>> + * Errors are signaled by using dma_fence_set_error() on the hardware fence
>>>> + * object before signaling it with dma_fence_signal(). Errors are then bubbled
>>>> + * up from the hardware fence to the scheduler fence.
>>>> + *
>>>> + * The entity allows querying errors on the last run submission using the
>>>> + * drm_sched_entity_error() function which can be used to cancel queued
>>>> + * submissions in &struct drm_sched_backend_ops.run_job as well as preventing
>>>> + * pushing further ones into the entity in the driver's submission function.
>>>> + *
>>>> + * When the hardware fence doesn't signal within a configurable amount of time
>>>> + * &struct drm_sched_backend_ops.timedout_job gets invoked. The driver should
>>>> + * then follow the procedure described in that callback's documentation.
>>>> + *
>>>> + * (TODO: The timeout handler should probably switch to using the hardware
>>>> + * fence as parameter instead of the job. Otherwise the handling will always
>>>> + * race between timing out and signaling the fence).
>>>
>>> This TODO can probably removed, too. The recently merged
>>> DRM_GPU_SCHED_STAT_NO_HANG has solved this issue.
>>
>> No, it only scratched on the surface of problems here.
>>
>> I'm seriously considering sending a RFC patch to cleanup the job lifetime and implementing this change.
>>
>> Not necessarily giving the HW fence as parameter to the timeout callback, but more generally not letting the scheduler depend on driver behavior.
>
> That's rather vague. Regarding this TODO, "racing between timing out
> and signaling the fence" can now be corrected by the driver. Are there
> more issues? If so, we want to add a new FIXME for them.
Yeah good point. We basically worked around all those issues now.
It's just that I still see that we are missing a general concept. E.g. we applied workaround on top of workaround until it didn't crashed any more instead of saying ok that is the design does that work? Is it valid? etc...
> That said, such an RFC would obviously be great. We can discuss the
> paragraph above there, if you want.
I will try to hack something together. Not necessarily complete but it should show the direction.
Christian.
>
>
> Regards
> P.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] drm/sched: Extend and update documentation
2025-08-07 14:15 ` Christian König
@ 2025-08-11 9:50 ` Philipp Stanner
2025-08-11 14:17 ` Christian König
0 siblings, 1 reply; 12+ messages in thread
From: Philipp Stanner @ 2025-08-11 9:50 UTC (permalink / raw)
To: Christian König, Philipp Stanner, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
Jonathan Corbet, Matthew Brost, Danilo Krummrich,
Christian König, Sumit Semwal
Cc: dri-devel, linux-doc, linux-kernel, linux-media
On Thu, 2025-08-07 at 16:15 +0200, Christian König wrote:
> On 05.08.25 12:22, Philipp Stanner wrote:
> > On Tue, 2025-08-05 at 11:05 +0200, Christian König wrote:
> > > On 24.07.25 17:07, Philipp Stanner wrote:
> > > > > +/**
> > > > > + * DOC: Scheduler Fence Object
> > > > > + *
> > > > > + * The scheduler fence object (&struct drm_sched_fence) encapsulates the whole
> > > > > + * time from pushing the job into the scheduler until the hardware has finished
> > > > > + * processing it. It is managed by the scheduler. The implementation provides
> > > > > + * dma_fence interfaces for signaling both scheduling of a command submission
> > > > > + * as well as finishing of processing.
> > > > > + *
> > > > > + * The lifetime of this object also follows normal dma_fence refcounting rules.
> > > > > + */
> > > >
> > > > The relict I'm most unsure about is this docu for the scheduler fence.
> > > > I know that some drivers are accessing the s_fence, but I strongly
> > > > suspect that this is a) unncessary and b) dangerous.
> > >
> > > Which s_fence member do you mean? The one in the job? That should be harmless as far as I can see.
> >
> > I'm talking about struct drm_sched_fence.
>
> Yeah that is necessary for the drivers to know about. We could potentially abstract it better but we can't really hide it completely.
>
> > >
> > > > But the original draft from Christian hinted at that. So, @Christian,
> > > > this would be an opportunity to discuss this matter.
> > > >
> > > > Otherwise I'd drop this docu section in v2. What users don't know, they
> > > > cannot misuse.
> > >
> > > I would rather like to keep that to avoid misusing the job as the object for tracking the submission lifetime.
> >
> > Why would a driver ever want to access struct drm_sched_fence? The
> > driver knows when it signaled the hardware fence, and it knows when its
> > callbacks run_job() and free_job() were invoked.
> >
> > I'm open to learn what amdgpu does there and why.
>
> The simplest use case is performance optimization. You sometimes have submissions which ideally run with others at the same time.
>
> So we have AMDGPU_CHUNK_ID_SCHEDULED_DEPENDENCIES which basically tries to cast a fence to a scheduler fence and then only waits for the dependency to be pushed to the HW instead of waiting for it to finish (see amdgpu_cs.c).
But the driver recognizes that a certain fence got / gets pushed right
now through backend_ops.run_job(), doesn't it?
>
> Another example are gang submissions (where I still have the TODO to actually fix the code to not crash in an OOM situation).
>
> Here we have a gang leader and gang members which are guaranteed to run together on the HW at the same time.
>
> This works by adding scheduled dependencies to the gang leader so that the scheduler pushes it to the HW only after all gang members have been pushed.
>
> The first gang member pushed now triggers a dependency handling which makes sure that no other gang can be pushed until gang leader is pushed as well.
You mean amdgpu registers callbacks to drm_sched_fence?
>
> > > > > +/**
> > > > > + * DOC: Error and Timeout handling
> > > > > + *
> > > > > + * Errors are signaled by using dma_fence_set_error() on the hardware fence
> > > > > + * object before signaling it with dma_fence_signal(). Errors are then bubbled
> > > > > + * up from the hardware fence to the scheduler fence.
> > > > > + *
> > > > > + * The entity allows querying errors on the last run submission using the
> > > > > + * drm_sched_entity_error() function which can be used to cancel queued
> > > > > + * submissions in &struct drm_sched_backend_ops.run_job as well as preventing
> > > > > + * pushing further ones into the entity in the driver's submission function.
> > > > > + *
> > > > > + * When the hardware fence doesn't signal within a configurable amount of time
> > > > > + * &struct drm_sched_backend_ops.timedout_job gets invoked. The driver should
> > > > > + * then follow the procedure described in that callback's documentation.
> > > > > + *
> > > > > + * (TODO: The timeout handler should probably switch to using the hardware
> > > > > + * fence as parameter instead of the job. Otherwise the handling will always
> > > > > + * race between timing out and signaling the fence).
> > > >
> > > > This TODO can probably removed, too. The recently merged
> > > > DRM_GPU_SCHED_STAT_NO_HANG has solved this issue.
> > >
> > > No, it only scratched on the surface of problems here.
> > >
> > > I'm seriously considering sending a RFC patch to cleanup the job lifetime and implementing this change.
> > >
> > > Not necessarily giving the HW fence as parameter to the timeout callback, but more generally not letting the scheduler depend on driver behavior.
> >
> > That's rather vague. Regarding this TODO, "racing between timing out
> > and signaling the fence" can now be corrected by the driver. Are there
> > more issues? If so, we want to add a new FIXME for them.
>
> Yeah good point. We basically worked around all those issues now.
>
> It's just that I still see that we are missing a general concept. E.g. we applied workaround on top of workaround until it didn't crashed any more instead of saying ok that is the design does that work? Is it valid? etc...
Yes, that seems to have been our destiny for a while now :) :(
What I'm afraid of right now is that with the callbacks vs.
drm_sched_fence we now potentially have several distinct mechanisms for
doing things. The hardware fence is clearly the relevant
synchronization object for telling when a job is completed; yet, we
also have s_fence->finished.
Using it (for what?) is even encouraged by the docu:
/**
* @finished: this fence is what will be signaled by the scheduler
* when the job is completed.
*
* When setting up an out fence for the job, you should use
* this, since it's available immediately upon
* drm_sched_job_init(), and the fence returned by the driver
* from run_job() won't be created until the dependencies have
* resolved.
*/
Anyways.
I think this is a big topic very suitable for our work shop at XDC. I
also have some ideas about paths forward that I want to present.
P.
>
> > That said, such an RFC would obviously be great. We can discuss the
> > paragraph above there, if you want.
>
> I will try to hack something together. Not necessarily complete but it should show the direction.
>
> Christian.
>
> >
> >
> > Regards
> > P.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] drm/sched: Extend and update documentation
2025-08-11 9:50 ` Philipp Stanner
@ 2025-08-11 14:17 ` Christian König
0 siblings, 0 replies; 12+ messages in thread
From: Christian König @ 2025-08-11 14:17 UTC (permalink / raw)
To: phasta, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
David Airlie, Simona Vetter, Jonathan Corbet, Matthew Brost,
Danilo Krummrich, Christian König, Sumit Semwal
Cc: dri-devel, linux-doc, linux-kernel, linux-media
On 11.08.25 11:50, Philipp Stanner wrote:
>>>>
>>>>> But the original draft from Christian hinted at that. So, @Christian,
>>>>> this would be an opportunity to discuss this matter.
>>>>>
>>>>> Otherwise I'd drop this docu section in v2. What users don't know, they
>>>>> cannot misuse.
>>>>
>>>> I would rather like to keep that to avoid misusing the job as the object for tracking the submission lifetime.
>>>
>>> Why would a driver ever want to access struct drm_sched_fence? The
>>> driver knows when it signaled the hardware fence, and it knows when its
>>> callbacks run_job() and free_job() were invoked.
>>>
>>> I'm open to learn what amdgpu does there and why.
>>
>> The simplest use case is performance optimization. You sometimes have submissions which ideally run with others at the same time.
>>
>> So we have AMDGPU_CHUNK_ID_SCHEDULED_DEPENDENCIES which basically tries to cast a fence to a scheduler fence and then only waits for the dependency to be pushed to the HW instead of waiting for it to finish (see amdgpu_cs.c).
>
> But the driver recognizes that a certain fence got / gets pushed right
> now through backend_ops.run_job(), doesn't it?
Yeah, but how does that helps?
>>
>> Another example are gang submissions (where I still have the TODO to actually fix the code to not crash in an OOM situation).
>>
>> Here we have a gang leader and gang members which are guaranteed to run together on the HW at the same time.
>>
>> This works by adding scheduled dependencies to the gang leader so that the scheduler pushes it to the HW only after all gang members have been pushed.
>>
>> The first gang member pushed now triggers a dependency handling which makes sure that no other gang can be pushed until gang leader is pushed as well.
>
> You mean amdgpu registers callbacks to drm_sched_fence?
No, we give it as dependency to drm_sched_job_add_dependency().
>>> That's rather vague. Regarding this TODO, "racing between timing out
>>> and signaling the fence" can now be corrected by the driver. Are there
>>> more issues? If so, we want to add a new FIXME for them.
>>
>> Yeah good point. We basically worked around all those issues now.
>>
>> It's just that I still see that we are missing a general concept. E.g. we applied workaround on top of workaround until it didn't crashed any more instead of saying ok that is the design does that work? Is it valid? etc...
>
> Yes, that seems to have been our destiny for a while now :) :(
>
> What I'm afraid of right now is that with the callbacks vs.
> drm_sched_fence we now potentially have several distinct mechanisms for
> doing things. The hardware fence is clearly the relevant
> synchronization object for telling when a job is completed; yet, we
> also have s_fence->finished.
Not quite, the s_fence->finished is what is relevant to the outside. The HW fence is only relevant to the inside of the scheduler.
> Using it (for what?) is even encouraged by the docu:
>
> /**
> * @finished: this fence is what will be signaled by the scheduler
> * when the job is completed.
> *
> * When setting up an out fence for the job, you should use
> * this, since it's available immediately upon
> * drm_sched_job_init(), and the fence returned by the driver
> * from run_job() won't be created until the dependencies have
> * resolved.
> */
That comment sounds correct to me. Drivers should mostly use s_fence->finished and not the HW fence to determine if something is done.
> Anyways.
> I think this is a big topic very suitable for our work shop at XDC. I
> also have some ideas about paths forward that I want to present.
Sounds good.
Halve a year ago I was more or less ready to suggest to rip out the scheduler and start from scratch again, but now it more and more looks like we have light at the end of the tunnel.
Christian.
>
>
> P.
>
>>
>>> That said, such an RFC would obviously be great. We can discuss the
>>> paragraph above there, if you want.
>>
>> I will try to hack something together. Not necessarily complete but it should show the direction.
>>
>> Christian.
>>
>>>
>>>
>>> Regards
>>> P.
>
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-08-11 14:17 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-24 14:01 [PATCH] drm/sched: Extend and update documentation Philipp Stanner
2025-07-24 15:07 ` Philipp Stanner
2025-08-05 9:05 ` Christian König
2025-08-05 10:22 ` Philipp Stanner
2025-08-07 14:15 ` Christian König
2025-08-11 9:50 ` Philipp Stanner
2025-08-11 14:17 ` Christian König
2025-08-04 15:41 ` Christian König
-- strict thread matches above, loose matches on Subject: below --
2024-11-15 10:35 Philipp Stanner
2024-11-21 16:05 ` Alice Ryhl
2024-12-05 13:20 ` Philipp Stanner
2024-11-26 16:31 ` Tvrtko Ursulin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).