From: Philipp Stanner <phasta@mailbox.org>
To: "Maíra Canal" <mcanal@igalia.com>,
"Matthew Brost" <matthew.brost@intel.com>,
"Danilo Krummrich" <dakr@kernel.org>,
"Philipp Stanner" <phasta@kernel.org>,
"Christian König" <ckoenig.leichtzumerken@gmail.com>,
"Tvrtko Ursulin" <tvrtko.ursulin@igalia.com>,
"Simona Vetter" <simona@ffwll.ch>,
"David Airlie" <airlied@gmail.com>,
"Melissa Wen" <mwen@igalia.com>,
"Lucas Stach" <l.stach@pengutronix.de>,
"Russell King" <linux+etnaviv@armlinux.org.uk>,
"Christian Gmeiner" <christian.gmeiner@gmail.com>,
"Lucas De Marchi" <lucas.demarchi@intel.com>,
"Thomas Hellström" <thomas.hellstrom@linux.intel.com>,
"Rodrigo Vivi" <rodrigo.vivi@intel.com>,
"Boris Brezillon" <boris.brezillon@collabora.com>,
"Rob Herring" <robh@kernel.org>,
"Steven Price" <steven.price@arm.com>,
"Liviu Dudau" <liviu.dudau@arm.com>
Cc: kernel-dev@igalia.com, dri-devel@lists.freedesktop.org,
etnaviv@lists.freedesktop.org, intel-xe@lists.freedesktop.org
Subject: Re: [PATCH v5 2/8] drm/sched: Allow drivers to skip the reset and keep on running
Date: Wed, 09 Jul 2025 15:08:25 +0200 [thread overview]
Message-ID: <d1ef5cca7c63ecbfd7f1aad04952442eb54dd42e.camel@mailbox.org> (raw)
In-Reply-To: <20250708-sched-skip-reset-v5-2-2612b601f01a@igalia.com>
On Tue, 2025-07-08 at 10:25 -0300, Maíra Canal wrote:
> When the DRM scheduler times out, it's possible that the GPU isn't
> hung;
> instead, a job just took unusually long (longer than the timeout) but
> is
> still running, and there is, thus, no reason to reset the hardware.
> This
> can occur in two scenarios:
>
> 1. The job is taking longer than the timeout, but the driver
> determined
> through a GPU-specific mechanism that the hardware is still
> making
> progress. Hence, the driver would like the scheduler to skip the
> timeout and treat the job as still pending from then onward.
> This
> happens in v3d, Etnaviv, and Xe.
> 2. Timeout has fired before the free-job worker. Consequently, the
> scheduler calls `sched->ops->timedout_job()` for a job that
> isn't
> timed out.
>
> These two scenarios are problematic because the job was removed from
> the
> `sched->pending_list` before calling `sched->ops->timedout_job()`,
> which
> means that when the job finishes, it won't be freed by the scheduler
> though `sched->ops->free_job()` - leading to a memory leak.
>
> To solve these problems, create a new `drm_gpu_sched_stat`, called
> DRM_GPU_SCHED_STAT_NO_HANG, which allows a driver to skip the reset.
> The
> new status will indicate that the job must be reinserted into
> `sched->pending_list`, and the hardware / driver will still complete
> that
> job.
>
> Signed-off-by: Maíra Canal <mcanal@igalia.com>
Reviewed-by: Philipp Stanner <phasta@kernel.org>
> ---
> drivers/gpu/drm/scheduler/sched_main.c | 46
> ++++++++++++++++++++++++++++++++--
> include/drm/gpu_scheduler.h | 3 +++
> 2 files changed, 47 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index
> 0f32e2cb43d6af294408968a970990f9f5c47bee..657846d56dacd4f26fffc954fc3
> d025c1e6bfc9f 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -374,11 +374,16 @@ static void drm_sched_run_free_queue(struct
> drm_gpu_scheduler *sched)
> {
> struct drm_sched_job *job;
>
> - spin_lock(&sched->job_list_lock);
> job = list_first_entry_or_null(&sched->pending_list,
> struct drm_sched_job, list);
> if (job && dma_fence_is_signaled(&job->s_fence->finished))
> __drm_sched_run_free_queue(sched);
> +}
> +
> +static void drm_sched_run_free_queue_unlocked(struct
> drm_gpu_scheduler *sched)
> +{
> + spin_lock(&sched->job_list_lock);
> + drm_sched_run_free_queue(sched);
> spin_unlock(&sched->job_list_lock);
> }
>
> @@ -531,6 +536,32 @@ static void drm_sched_job_begin(struct
> drm_sched_job *s_job)
> spin_unlock(&sched->job_list_lock);
> }
>
> +/**
> + * drm_sched_job_reinsert_on_false_timeout - reinsert the job on a
> false timeout
> + * @sched: scheduler instance
> + * @job: job to be reinserted on the pending list
> + *
> + * In the case of a "false timeout" - when a timeout occurs but the
> GPU isn't
> + * hung and is making progress, the scheduler must reinsert the job
> back into
> + * @sched->pending_list. Otherwise, the job and its resources won't
> be freed
> + * through the &struct drm_sched_backend_ops.free_job callback.
> + *
> + * This function must be used in "false timeout" cases only.
> + */
> +static void drm_sched_job_reinsert_on_false_timeout(struct
> drm_gpu_scheduler *sched,
> + struct
> drm_sched_job *job)
> +{
> + spin_lock(&sched->job_list_lock);
> + list_add(&job->list, &sched->pending_list);
> +
> + /* After reinserting the job, the scheduler enqueues the
> free-job work
> + * again if ready. Otherwise, a signaled job could be added
> to the
> + * pending list, but never freed.
> + */
> + drm_sched_run_free_queue(sched);
> + spin_unlock(&sched->job_list_lock);
> +}
> +
> static void drm_sched_job_timedout(struct work_struct *work)
> {
> struct drm_gpu_scheduler *sched;
> @@ -564,6 +595,9 @@ static void drm_sched_job_timedout(struct
> work_struct *work)
> job->sched->ops->free_job(job);
> sched->free_guilty = false;
> }
> +
> + if (status == DRM_GPU_SCHED_STAT_NO_HANG)
> + drm_sched_job_reinsert_on_false_timeout(sche
> d, job);
> } else {
> spin_unlock(&sched->job_list_lock);
> }
> @@ -586,6 +620,10 @@ static void drm_sched_job_timedout(struct
> work_struct *work)
> * This function is typically used for reset recovery (see the docu
> of
> * drm_sched_backend_ops.timedout_job() for details). Do not call it
> for
> * scheduler teardown, i.e., before calling drm_sched_fini().
> + *
> + * As it's only used for reset recovery, drivers must not call this
> function
> + * in their &struct drm_sched_backend_ops.timedout_job callback when
> they
> + * skip a reset using &enum
> drm_gpu_sched_stat.DRM_GPU_SCHED_STAT_NO_HANG.
> */
> void drm_sched_stop(struct drm_gpu_scheduler *sched, struct
> drm_sched_job *bad)
> {
> @@ -671,6 +709,10 @@ EXPORT_SYMBOL(drm_sched_stop);
> * drm_sched_backend_ops.timedout_job() for details). Do not call it
> for
> * scheduler startup. The scheduler itself is fully operational
> after
> * drm_sched_init() succeeded.
> + *
> + * As it's only used for reset recovery, drivers must not call this
> function
> + * in their &struct drm_sched_backend_ops.timedout_job callback when
> they
> + * skip a reset using &enum
> drm_gpu_sched_stat.DRM_GPU_SCHED_STAT_NO_HANG.
> */
> void drm_sched_start(struct drm_gpu_scheduler *sched, int errno)
> {
> @@ -1192,7 +1234,7 @@ static void drm_sched_free_job_work(struct
> work_struct *w)
> if (job)
> sched->ops->free_job(job);
>
> - drm_sched_run_free_queue(sched);
> + drm_sched_run_free_queue_unlocked(sched);
> drm_sched_run_job_queue(sched);
> }
>
> diff --git a/include/drm/gpu_scheduler.h
> b/include/drm/gpu_scheduler.h
> index
> 83e5c00d8dd9a83ab20547a93d6fc572de97616e..257d21d8d1d2c4f035d6d4882e1
> 59de59b263c76 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -393,11 +393,14 @@ struct drm_sched_job {
> * @DRM_GPU_SCHED_STAT_NONE: Reserved. Do not use.
> * @DRM_GPU_SCHED_STAT_RESET: The GPU hung and successfully reset.
> * @DRM_GPU_SCHED_STAT_ENODEV: Error: Device is not available
> anymore.
> + * @DRM_GPU_SCHED_STAT_NO_HANG: Contrary to scheduler's assumption,
> the GPU
> + * did not hang and is still running.
> */
> enum drm_gpu_sched_stat {
> DRM_GPU_SCHED_STAT_NONE,
> DRM_GPU_SCHED_STAT_RESET,
> DRM_GPU_SCHED_STAT_ENODEV,
> + DRM_GPU_SCHED_STAT_NO_HANG,
> };
>
> /**
>
next prev parent reply other threads:[~2025-07-09 13:08 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-08 13:25 [PATCH v5 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_NO_HANG Maíra Canal
2025-07-08 13:25 ` [PATCH v5 1/8] drm/sched: Rename DRM_GPU_SCHED_STAT_NOMINAL to DRM_GPU_SCHED_STAT_RESET Maíra Canal
2025-07-08 13:25 ` [PATCH v5 2/8] drm/sched: Allow drivers to skip the reset and keep on running Maíra Canal
2025-07-09 13:08 ` Philipp Stanner [this message]
2025-07-11 13:22 ` Christian König
2025-07-11 13:37 ` Philipp Stanner
2025-07-11 15:20 ` Christian König
2025-07-11 17:23 ` Matthew Brost
2025-07-14 9:10 ` Christian König
2025-07-13 19:03 ` Maíra Canal
2025-07-14 9:23 ` Christian König
2025-07-14 10:16 ` Philipp Stanner
2025-07-14 11:46 ` Christian König
2025-07-11 14:35 ` Maíra Canal
2025-07-08 13:25 ` [PATCH v5 3/8] drm/sched: Make timeout KUnit tests faster Maíra Canal
2025-07-08 13:25 ` [PATCH v5 4/8] drm/sched: Add new test for DRM_GPU_SCHED_STAT_NO_HANG Maíra Canal
2025-07-08 13:25 ` [PATCH v5 5/8] drm/v3d: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset Maíra Canal
2025-07-08 13:25 ` [PATCH v5 6/8] drm/etnaviv: " Maíra Canal
2025-07-08 13:25 ` [PATCH v5 7/8] drm/xe: " Maíra Canal
2025-07-08 18:35 ` Matthew Brost
2025-07-08 13:25 ` [PATCH v5 8/8] drm/panfrost: " Maíra Canal
2025-07-09 13:14 ` [PATCH v5 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_NO_HANG Philipp Stanner
2025-07-10 11:27 ` Maíra Canal
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d1ef5cca7c63ecbfd7f1aad04952442eb54dd42e.camel@mailbox.org \
--to=phasta@mailbox.org \
--cc=airlied@gmail.com \
--cc=boris.brezillon@collabora.com \
--cc=christian.gmeiner@gmail.com \
--cc=ckoenig.leichtzumerken@gmail.com \
--cc=dakr@kernel.org \
--cc=dri-devel@lists.freedesktop.org \
--cc=etnaviv@lists.freedesktop.org \
--cc=intel-xe@lists.freedesktop.org \
--cc=kernel-dev@igalia.com \
--cc=l.stach@pengutronix.de \
--cc=linux+etnaviv@armlinux.org.uk \
--cc=liviu.dudau@arm.com \
--cc=lucas.demarchi@intel.com \
--cc=matthew.brost@intel.com \
--cc=mcanal@igalia.com \
--cc=mwen@igalia.com \
--cc=phasta@kernel.org \
--cc=robh@kernel.org \
--cc=rodrigo.vivi@intel.com \
--cc=simona@ffwll.ch \
--cc=steven.price@arm.com \
--cc=thomas.hellstrom@linux.intel.com \
--cc=tvrtko.ursulin@igalia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).