dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: Philipp Stanner <phasta@mailbox.org>
To: "Maíra Canal" <mcanal@igalia.com>,
	"Matthew Brost" <matthew.brost@intel.com>,
	"Danilo Krummrich" <dakr@kernel.org>,
	"Philipp Stanner" <phasta@kernel.org>,
	"Christian König" <ckoenig.leichtzumerken@gmail.com>,
	"Tvrtko Ursulin" <tvrtko.ursulin@igalia.com>,
	"Simona Vetter" <simona@ffwll.ch>,
	"David Airlie" <airlied@gmail.com>,
	"Melissa Wen" <mwen@igalia.com>,
	"Lucas Stach" <l.stach@pengutronix.de>,
	"Russell King" <linux+etnaviv@armlinux.org.uk>,
	"Christian Gmeiner" <christian.gmeiner@gmail.com>,
	"Lucas De Marchi" <lucas.demarchi@intel.com>,
	"Thomas Hellström" <thomas.hellstrom@linux.intel.com>,
	"Rodrigo Vivi" <rodrigo.vivi@intel.com>,
	"Boris Brezillon" <boris.brezillon@collabora.com>,
	"Rob Herring" <robh@kernel.org>,
	"Steven Price" <steven.price@arm.com>,
	"Liviu Dudau" <liviu.dudau@arm.com>
Cc: kernel-dev@igalia.com, dri-devel@lists.freedesktop.org,
	etnaviv@lists.freedesktop.org, intel-xe@lists.freedesktop.org,
	"Min Ma" <min.ma@amd.com>, "Lizhi Hou" <lizhi.hou@amd.com>,
	"Oded Gabbay" <ogabbay@kernel.org>,
	"Frank Binns" <frank.binns@imgtec.com>,
	"Matt Coster" <matt.coster@imgtec.com>,
	"Qiang Yu" <yuq825@gmail.com>, "Lyude Paul" <lyude@redhat.com>,
	"Alex Deucher" <alexander.deucher@amd.com>,
	"Christian König" <christian.koenig@amd.com>
Subject: Re: [PATCH v5 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_NO_HANG
Date: Wed, 09 Jul 2025 15:14:33 +0200	[thread overview]
Message-ID: <71d67f799ccadb2858747cac516e04cff53e9234.camel@mailbox.org> (raw)
In-Reply-To: <20250708-sched-skip-reset-v5-0-2612b601f01a@igalia.com>

On Tue, 2025-07-08 at 10:25 -0300, Maíra Canal wrote:
> TL;DR: The only two patches that are lacking R-b's are:
> 
> [PATCH 2/8] drm/sched: Allow drivers to skip the reset and keep on running
> [PATCH 7/8] drm/xe: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset
>   -> If Intel CI succeeds, it's Reviewed-by: Matthew Brost <matthew.brost@intel.com>
> 
> For those two patches, it would be great to gather feedback and/or R-b's,
> particularly from the Intel folks.
> 
> Thanks for all the reviews so far!
> 
> ---
> When the DRM scheduler times out, it's possible that the GPU isn't hung;
> instead, a job may still be running, and there may be no valid reason to
> reset the hardware. This can occur in two situations:
> 
>   1. The GPU exposes some mechanism that ensures the GPU is still making
>      progress. By checking this mechanism, the driver can safely skip the
>      reset, re-arm the timeout, and allow the job to continue running until
>      completion. This is the case for v3d, Etnaviv, and Xe.
> 
>   2. Timeout has fired before the free-job worker. Consequently, the
>      scheduler calls `timedout_job()` for a job that isn't timed out.
> 
> These two scenarios are problematic because the job was removed from the
> `sched->pending_list` before calling `sched->ops->timedout_job()`, which
> means that when the job finishes, it won't be freed by the scheduler
> though `sched->ops->free_job()`. As a result, the job and its resources
> won't be freed, leading to a memory leak.
> 
> For v3d specifically, we have observed that these memory leaks can be
> significant in certain scenarios, as reported by users in [1][2]. To
> address this situation, I submitted a patch similar to commit 704d3d60fec4
> ("drm/etnaviv: don't block scheduler when GPU is still active") for v3d [3].
> This patch has already landed in drm-misc-fixes and successfully resolved
> the users' issues.
> 
> However, as I mentioned in [3], exposing the scheduler's internals within
> the drivers isn't ideal and I believe this specific situation can be
> addressed within the DRM scheduler framework.
> 
> This series aims to resolve this issue by adding a new DRM sched status
> that allows a driver to skip the reset. This new status will indicate that
> the job should be reinserted into the pending list, and the driver will
> still signal its completion.
> 
> [1] https://gitlab.freedesktop.org/mesa/mesa/-/issues/12227
> [2] https://github.com/raspberrypi/linux/issues/6817
> [3] https://lore.kernel.org/dri-devel/20250430210643.57924-1-mcanal@igalia.com/T/
> 
> Best Regards,
> - Maíra
> 
> ---
> v1 -> v2:
> 
> - Fix several grammar nits across the documentation and commit messages.
> - Drop "drm/sched: Always free the job after the timeout" (Tvrtko Ursulin)
> - [1/8] NEW PATCH: Rename DRM_GPU_SCHED_STAT_NOMINAL to a more semantic
> 	name (Tvrtko Ursulin, Philipp Stanner)
> - [2/8] Rename DRM_GPU_SCHED_STAT_RUNNING to DRM_GPU_SCHED_STAT_NO_HANG (Tvrtko Ursulin, Philipp Stanner)
> - [2/8] Requeue free-job work after reinserting the job to the pending list (Matthew Brost)
> - [2/8] Create a helper function to reinsert the job (Philipp Stanner)
> - [2/8] Rewrite the commit message (Philipp Stanner)
> - [2/8] Add a comment to `drm_sched_start()` documentation, similar to what
> 	was commented in `drm_sched_stop()` (Philipp Stanner)
> - [3/8] Keep HZ as timeout for `drm_mock_sched_job_wait_scheduled()` (Tvrtko Ursulin)
> - [4/8] Use a job flag to indicate that `timedout_job()` should skip the
> 	reset (Tvrtko Ursulin)
> - [7/8] Use DRM_GPU_SCHED_STAT_NO_HANG to re-arm the timer in other cases
> 	as well (Matthew Brost)
> - Link to v1: https://lore.kernel.org/r/20250503-sched-skip-reset-v1-0-ed0d6701a3fe@igalia.com
> 
> v2 -> v3:
> 
> - [2/8] Address comments about the commit message (Philipp Stanner)
> - [2/8] Improve comments and documentation style (Philipp Stanner)
> - [3/8] Rename the commit title to "drm/sched: Make timeout KUnit tests faster" (Philipp Stanner)
> - [3/8] Add Tvrtko's R-b (Tvrtko Ursulin)
> - [4/8] Instead of setting up a job duration, advance it manually (Tvrtko Ursulin)
> - [4/8] Wait for 2 * MOCK_TIMEOUT instead of MOCK_TIMEOUT (Tvrtko Ursulin)
> - [5/8, 6/8, 7/8, 8/8] Use Philipp's suggestion to improve the commit messages (Philipp Stanner)
> - Link to v2: https://lore.kernel.org/r/20250530-sched-skip-reset-v2-0-c40a8d2d8daa@igalia.com
> 
> v3 -> v4:
> 
> - [1/8] s/betters/better and Philipp's R-b (Philipp Stanner)
> - [2/8] Apply some documentation nits (Philipp Stanner)
> - [3/8] Add Philipp's A-b  (Philipp Stanner)
> - [4/8, 5/8] Add Tvrtko's R-b (Tvrtko Ursulin)
> - [6/8] Add Lucas' R-b (Lucas Stach)
> - Link to v3: https://lore.kernel.org/r/20250618-sched-skip-reset-v3-0-8be5cca2725d@igalia.com
> 
> v4 -> v5:
> 
> - Rebased on top of drm-tip (for Intel CI)
> - [2/8] Reword the commit message (Philipp Stanner)
> - [2/8] Reword several comments (Philipp Stanner)
> - [4/8] Add Philipp's R-b (Philipp Stanner)
> - Link to v4: https://lore.kernel.org/r/20250707-sched-skip-reset-v4-0-036c0f0f584f@igalia.com
> 
> ---
> Maíra Canal (8):
>       drm/sched: Rename DRM_GPU_SCHED_STAT_NOMINAL to DRM_GPU_SCHED_STAT_RESET
>       drm/sched: Allow drivers to skip the reset and keep on running
>       drm/sched: Make timeout KUnit tests faster
>       drm/sched: Add new test for DRM_GPU_SCHED_STAT_NO_HANG
>       drm/v3d: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset
>       drm/etnaviv: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset
>       drm/xe: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset
>       drm/panfrost: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset
> 
>  drivers/accel/amdxdna/aie2_ctx.c                 |  2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c          |  2 +-
>  drivers/gpu/drm/etnaviv/etnaviv_sched.c          | 16 +++----
>  drivers/gpu/drm/imagination/pvr_queue.c          |  4 +-
>  drivers/gpu/drm/lima/lima_sched.c                |  6 +--
>  drivers/gpu/drm/nouveau/nouveau_exec.c           |  2 +-
>  drivers/gpu/drm/nouveau/nouveau_sched.c          |  2 +-
>  drivers/gpu/drm/panfrost/panfrost_job.c          | 10 ++---
>  drivers/gpu/drm/panthor/panthor_mmu.c            |  2 +-
>  drivers/gpu/drm/panthor/panthor_sched.c          |  2 +-
>  drivers/gpu/drm/scheduler/sched_main.c           | 48 +++++++++++++++++++--
>  drivers/gpu/drm/scheduler/tests/mock_scheduler.c |  7 ++-
>  drivers/gpu/drm/scheduler/tests/sched_tests.h    |  1 +
>  drivers/gpu/drm/scheduler/tests/tests_basic.c    | 55 ++++++++++++++++++++++--
>  drivers/gpu/drm/v3d/v3d_sched.c                  | 18 ++------
>  drivers/gpu/drm/xe/xe_guc_submit.c               | 14 ++----
>  include/drm/gpu_scheduler.h                      |  7 ++-
>  17 files changed, 137 insertions(+), 61 deletions(-)

Does not apply to drm-misc-next:

Applying: drm/sched: Rename DRM_GPU_SCHED_STAT_NOMINAL to DRM_GPU_SCHED_STAT_RESET
error: patch failed: drivers/gpu/drm/etnaviv/etnaviv_sched.c:87
error: drivers/gpu/drm/etnaviv/etnaviv_sched.c: patch does not apply
Patch failed at 0001 drm/sched: Rename DRM_GPU_SCHED_STAT_NOMINAL to DRM_GPU_SCHED_STAT_RESET

Awkward. That file has last been touched months ago. On what branch is
your series based?

Can you rebase?

From my POV you could also apply it yourself. Looks all good.

P.



> ---
> base-commit: 8b32b5509128873da8ecfc06beefcb58927eb50b
> change-id: 20250502-sched-skip-reset-bf7c163233da
> 


  parent reply	other threads:[~2025-07-09 13:14 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-08 13:25 [PATCH v5 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_NO_HANG Maíra Canal
2025-07-08 13:25 ` [PATCH v5 1/8] drm/sched: Rename DRM_GPU_SCHED_STAT_NOMINAL to DRM_GPU_SCHED_STAT_RESET Maíra Canal
2025-07-08 13:25 ` [PATCH v5 2/8] drm/sched: Allow drivers to skip the reset and keep on running Maíra Canal
2025-07-09 13:08   ` Philipp Stanner
2025-07-11 13:22   ` Christian König
2025-07-11 13:37     ` Philipp Stanner
2025-07-11 15:20       ` Christian König
2025-07-11 17:23         ` Matthew Brost
2025-07-14  9:10           ` Christian König
2025-07-13 19:03         ` Maíra Canal
2025-07-14  9:23           ` Christian König
2025-07-14 10:16             ` Philipp Stanner
2025-07-14 11:46               ` Christian König
2025-07-11 14:35     ` Maíra Canal
2025-07-08 13:25 ` [PATCH v5 3/8] drm/sched: Make timeout KUnit tests faster Maíra Canal
2025-07-08 13:25 ` [PATCH v5 4/8] drm/sched: Add new test for DRM_GPU_SCHED_STAT_NO_HANG Maíra Canal
2025-07-08 13:25 ` [PATCH v5 5/8] drm/v3d: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset Maíra Canal
2025-07-08 13:25 ` [PATCH v5 6/8] drm/etnaviv: " Maíra Canal
2025-07-08 13:25 ` [PATCH v5 7/8] drm/xe: " Maíra Canal
2025-07-08 18:35   ` Matthew Brost
2025-07-08 13:25 ` [PATCH v5 8/8] drm/panfrost: " Maíra Canal
2025-07-09 13:14 ` Philipp Stanner [this message]
2025-07-10 11:27   ` [PATCH v5 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_NO_HANG Maíra Canal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=71d67f799ccadb2858747cac516e04cff53e9234.camel@mailbox.org \
    --to=phasta@mailbox.org \
    --cc=airlied@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=boris.brezillon@collabora.com \
    --cc=christian.gmeiner@gmail.com \
    --cc=christian.koenig@amd.com \
    --cc=ckoenig.leichtzumerken@gmail.com \
    --cc=dakr@kernel.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=etnaviv@lists.freedesktop.org \
    --cc=frank.binns@imgtec.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=kernel-dev@igalia.com \
    --cc=l.stach@pengutronix.de \
    --cc=linux+etnaviv@armlinux.org.uk \
    --cc=liviu.dudau@arm.com \
    --cc=lizhi.hou@amd.com \
    --cc=lucas.demarchi@intel.com \
    --cc=lyude@redhat.com \
    --cc=matt.coster@imgtec.com \
    --cc=matthew.brost@intel.com \
    --cc=mcanal@igalia.com \
    --cc=min.ma@amd.com \
    --cc=mwen@igalia.com \
    --cc=ogabbay@kernel.org \
    --cc=phasta@kernel.org \
    --cc=robh@kernel.org \
    --cc=rodrigo.vivi@intel.com \
    --cc=simona@ffwll.ch \
    --cc=steven.price@arm.com \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=tvrtko.ursulin@igalia.com \
    --cc=yuq825@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).