amd-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V9 00/36] Reset improvements
@ 2025-06-17  3:07 Alex Deucher
  2025-06-17  3:07 ` [PATCH 01/36] drm/amdgpu: switch job hw_fence to amdgpu_fence Alex Deucher
                   ` (35 more replies)
  0 siblings, 36 replies; 60+ messages in thread
From: Alex Deucher @ 2025-06-17  3:07 UTC (permalink / raw)
  To: amd-gfx, christian.koenig, sasundar; +Cc: Alex Deucher

This set improves per queue reset support for a number of IPs.
When we reset the queue, the queue is lost so we need
to re-emit the unprocessed state from subsequent submissions.
To that end, in order to make sure we actually restore
unprocessed state, we need to enable legacy enforce isolation
so that we can safely re-emit the unprocessed state.  If
we don't multiple jobs can run in parallel and we may not
end up resetting the correct one.  This is similar to how
windows handles queues.  This also gives us correct guilty
tracking for GC.

Tested on GC 10 and 11 chips with a game running and
then running hang tests.  The game pauses when the
hang happens, then continues after the queue reset.

I tried this same approach and GC8 and 9, but it
was not as reliable as soft recovery.  As such, I've dropped
the KGQ reset code for pre-GC10.

The same approach is extended to SDMA and VCN.
They don't need enforce isolation because those engines
are single threaded so they always operate serially.

Rework re-emit to signal the seq number of the bad job and
verify that to verify that the reset worked, then re-emit the
rest of the non-guilty state.  This way we are not waiting on
the rest of the state to complete, and if the subsequent state
also contains a bad job, we'll end up in queue reset again rather
than adapter reset.

Git tree:
https://gitlab.freedesktop.org/agd5f/linux/-/commits/kq_resets?ref_type=heads

v4: Drop explicit padding patches
    Drop new timeout macro
    Rework re-emit sequence
v5: Add a helper for reemit
    Convert VCN, JPEG, SDMA to use new helpers
v6: Update SDMA 4.4.2 to use new helpers
    Move ptr tracking to amdgpu_fence
    Skip all jobs from the bad context on the ring
v7: Rework the backup logic
    Move and clean up the guilty logic for engine resets
    Integrate suggestions from Christian
    Add JPEG 4.0.5 support
v8: Add non-guilty ring backup handling
    Clean up new function signatures
    Reorder some bug fixes to the start of the series
v9: Clean up fence_emit
    SDMA 5.x fixes
    Add new reset helpers
    sched wqueue stop/start cleanup
    Add support for VCNs without unified queues

Alex Deucher (35):
  drm/amdgpu: switch job hw_fence to amdgpu_fence
  drm/amdgpu: remove job parameter from amdgpu_fence_emit()
  drm/amdgpu: remove fence slab
  drm/amdgpu: enable legacy enforce isolation by default
  drm/amdgpu/sdma5.x: suspend KFD queues in ring reset
  drm/amdgpu/sdma5: init engine reset mutex
  drm/amdgpu/sdma5.2: init engine reset mutex
  drm/amdgpu: update ring reset function signature
  drm/amdgpu: move force completion into ring resets
  drm/amdgpu: move guilty handling into ring resets
  drm/amdgpu: move scheduler wqueue handling into callbacks
  drm/amdgpu: track ring state associated with a job
  drm/amdgpu/gfx9: re-emit unprocessed state on kcq reset
  drm/amdgpu/gfx9.4.3: re-emit unprocessed state on kcq reset
  drm/amdgpu/gfx10: re-emit unprocessed state on ring reset
  drm/amdgpu/gfx11: re-emit unprocessed state on ring reset
  drm/amdgpu/gfx12: re-emit unprocessed state on ring reset
  drm/amdgpu/sdma6: re-emit unprocessed state on ring reset
  drm/amdgpu/sdma7: re-emit unprocessed state on ring reset
  drm/amdgpu/jpeg2: re-emit unprocessed state on ring reset
  drm/amdgpu/jpeg2.5: re-emit unprocessed state on ring reset
  drm/amdgpu/jpeg3: re-emit unprocessed state on ring reset
  drm/amdgpu/jpeg4: re-emit unprocessed state on ring reset
  drm/amdgpu/jpeg4.0.3: re-emit unprocessed state on ring reset
  drm/amdgpu/jpeg4.0.5: add queue reset
  drm/amdgpu/jpeg5: add queue reset
  drm/amdgpu/jpeg5.0.1: re-emit unprocessed state on ring reset
  drm/amdgpu/vcn4: re-emit unprocessed state on ring reset
  drm/amdgpu/vcn4.0.3: re-emit unprocessed state on ring reset
  drm/amdgpu/vcn4.0.5: re-emit unprocessed state on ring reset
  drm/amdgpu/vcn5: re-emit unprocessed state on ring reset
  drm/amdgpu/vcn: add a helper framework for engine resets
  drm/amdgpu/vcn2: implement ring reset
  drm/amdgpu/vcn2.5: implement ring reset
  drm/amdgpu/vcn3: implement ring reset

Christian König (1):
  drm/amdgpu: rework queue reset scheduler interaction

 drivers/gpu/drm/amd/amdgpu/amdgpu.h         |   3 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c     |   5 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c   | 175 +++++++++++++-------
 drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c      |  19 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c     |  60 ++-----
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.h     |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c    |  59 +++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h    |  43 ++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c    |  17 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c     |  64 +++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h     |   6 +-
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c      |  42 ++---
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c      |  33 ++--
 drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c      |  33 ++--
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c       |   9 +-
 drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c     |  11 +-
 drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c      |   7 +-
 drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c      |   7 +-
 drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c      |   7 +-
 drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c      |   7 +-
 drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c    |   7 +-
 drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_5.c    |  11 ++
 drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_0.c    |  14 ++
 drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c    |   7 +-
 drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c    |  49 +++---
 drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c      |  17 +-
 drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c      |  17 +-
 drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c      |  25 ++-
 drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c      |  25 ++-
 drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c       |  25 +++
 drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c       |  24 +++
 drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c       |  24 +++
 drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c       |   8 +-
 drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c     |   9 +-
 drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c     |   8 +-
 drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c     |   8 +-
 38 files changed, 629 insertions(+), 275 deletions(-)

-- 
2.49.0


^ permalink raw reply	[flat|nested] 60+ messages in thread
* [PATCH V14 00/36] Reset improvements
@ 2025-07-07 19:03 Alex Deucher
  2025-07-07 19:03 ` [PATCH 33/36] drm/amdgpu/vcn: add a helper framework for engine resets Alex Deucher
  0 siblings, 1 reply; 60+ messages in thread
From: Alex Deucher @ 2025-07-07 19:03 UTC (permalink / raw)
  To: amd-gfx, christian.koenig, sasundar; +Cc: Alex Deucher

This set improves per queue reset support for a number of IPs.
When we reset the queue, the queue is lost so we need
to re-emit the unprocessed state from subsequent submissions.
This is handled in gfx/compute queues via switch buffer and
pipeline sync packets.  However, you can still end up with
parallel execution across queues.  For correctness in that
cause, enforce isolation needs to be enabled.  That can
impact certain use cases however and in most cases, the
guilty job is correctly identified even without enforce isolation.

Tested on GC 10 and 11 chips with a game running and
then running hang tests.  The game pauses when the
hang happens, then continues after the queue reset.

The same approach is extended to SDMA and VCN.
They don't need enforce isolation because those engines
are single threaded so they always operate serially.

Rework re-emit to signal the seq number of the bad job and
verify that to verify that the reset worked, then re-emit the
rest of the non-guilty state.  This way we are not waiting on
the rest of the state to complete, and if the subsequent state
also contains a bad job, we'll end up in queue reset again rather
than adapter reset.

Tested with HangTestSuite and IGT reset/deadlock tests.

Patches apply to the amd-staging-drm-next or drm-next branches in my
git tree.

Git tree:
https://gitlab.freedesktop.org/agd5f/linux/-/commits/kq_resets?ref_type=heads

The IGT deadlock tests need the following fixes to properly handle -ETIME fences:
https://patchwork.freedesktop.org/series/150724/

v4: Drop explicit padding patches
    Drop new timeout macro
    Rework re-emit sequence
v5: Add a helper for reemit
    Convert VCN, JPEG, SDMA to use new helpers
v6: Update SDMA 4.4.2 to use new helpers
    Move ptr tracking to amdgpu_fence
    Skip all jobs from the bad context on the ring
v7: Rework the backup logic
    Move and clean up the guilty logic for engine resets
    Integrate suggestions from Christian
    Add JPEG 4.0.5 support
v8: Add non-guilty ring backup handling
    Clean up new function signatures
    Reorder some bug fixes to the start of the series
v9: Clean up fence_emit
    SDMA 5.x fixes
    Add new reset helpers
    sched wqueue stop/start cleanup
    Add support for VCNs without unified queues
v10: Drop enforce isolation default change
     Add more documentation
     Clean up ring backup logic
v11: SDMA6/7 fixes
v12: Ring backup and reemit fixes
     SDMA cleanups
     SDMA5.x reemit support
     GFX10 KGQ reset fix
v13: drop SDMA cleaups, they caused regressions in some IGT tests
v14: Split out reset fixes as separate patches
     Add additional error handling for VCN and JPEG
     Update commit messages per feedback

Alex Deucher (36):
  drm/amdgpu/gfx9: fix kiq locking in KCQ reset
  drm/amdgpu/gfx9.4.3: fix kiq locking in KCQ reset
  drm/amdgpu/gfx10: fix kiq locking in KCQ reset
  drm/amdgpu: clean up sdma reset functions
  drm/amdgpu/jpeg2: add additional ring reset error checking
  drm/amdgpu/jpeg3: add additional ring reset error checking
  drm/amdgpu/jpeg4: add additional ring reset error checking
  drm/amdgpu/vcn4: add additional ring reset error checking
  drm/amdgpu/vcn4.0.5: add additional ring reset error checking
  drm/amdgpu/vcn5: add additional ring reset error checking
  drm/amdgpu: track ring state associated with a fence
  drm/amdgpu/gfx9: re-emit unprocessed state on kcq reset
  drm/amdgpu/gfx9.4.3: re-emit unprocessed state on kcq reset
  drm/amdgpu/gfx10: re-emit unprocessed state on ring reset
  drm/amdgpu/gfx11: re-emit unprocessed state on ring reset
  drm/amdgpu/gfx12: re-emit unprocessed state on ring reset
  drm/amdgpu/sdma5: re-emit unprocessed state on ring reset
  drm/amdgpu/sdma5.2: re-emit unprocessed state on ring reset
  drm/amdgpu/sdma6: re-emit unprocessed state on ring reset
  drm/amdgpu/sdma7: re-emit unprocessed state on ring reset
  drm/amdgpu/jpeg2: re-emit unprocessed state on ring reset
  drm/amdgpu/jpeg2.5: re-emit unprocessed state on ring reset
  drm/amdgpu/jpeg3: re-emit unprocessed state on ring reset
  drm/amdgpu/jpeg4: re-emit unprocessed state on ring reset
  drm/amdgpu/jpeg4.0.3: re-emit unprocessed state on ring reset
  drm/amdgpu/jpeg4.0.5: add queue reset
  drm/amdgpu/jpeg5: add queue reset
  drm/amdgpu/jpeg5.0.1: re-emit unprocessed state on ring reset
  drm/amdgpu/vcn4: re-emit unprocessed state on ring reset
  drm/amdgpu/vcn4.0.3: re-emit unprocessed state on ring reset
  drm/amdgpu/vcn4.0.5: re-emit unprocessed state on ring reset
  drm/amdgpu/vcn5: re-emit unprocessed state on ring reset
  drm/amdgpu/vcn: add a helper framework for engine resets
  drm/amdgpu/vcn2: implement ring reset
  drm/amdgpu/vcn2.5: implement ring reset
  drm/amdgpu/vcn3: implement ring reset

 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 90 +++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c    | 15 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c   |  4 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c  | 67 +++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h  | 18 +++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c   | 76 +++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h   |  6 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c    |  4 +
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c    | 41 ++---------
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c    | 35 +--------
 drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c    | 35 +--------
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c     | 12 +--
 drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c   | 12 +--
 drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c    | 13 ++--
 drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c    | 11 +--
 drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c    | 13 ++--
 drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c    | 13 ++--
 drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c  | 11 +--
 drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_5.c  | 17 +++++
 drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_0.c  | 20 +++++
 drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c  | 11 +--
 drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c    | 17 ++++-
 drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c    | 17 ++++-
 drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c    | 20 ++---
 drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c    | 20 ++---
 drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c     | 15 ++++
 drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c     | 14 ++++
 drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c     | 16 ++++
 drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c     | 14 ++--
 drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c   | 10 +--
 drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c   | 14 ++--
 drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c   | 14 ++--
 32 files changed, 465 insertions(+), 230 deletions(-)

-- 
2.50.0


^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2025-07-07 19:04 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-17  3:07 [PATCH V9 00/36] Reset improvements Alex Deucher
2025-06-17  3:07 ` [PATCH 01/36] drm/amdgpu: switch job hw_fence to amdgpu_fence Alex Deucher
2025-06-17  9:42   ` Christian König
2025-06-17  3:07 ` [PATCH 02/36] drm/amdgpu: remove job parameter from amdgpu_fence_emit() Alex Deucher
2025-06-17 11:44   ` Christian König
2025-06-17 13:46     ` Alex Deucher
2025-06-17 13:49       ` Alex Deucher
2025-06-18  7:15         ` Christian König
2025-06-18 22:32     ` Alex Deucher
2025-06-17  3:07 ` [PATCH 03/36] drm/amdgpu: remove fence slab Alex Deucher
2025-06-17 11:49   ` Christian König
2025-06-17  3:07 ` [PATCH 04/36] drm/amdgpu: enable legacy enforce isolation by default Alex Deucher
2025-06-17  3:07 ` [PATCH 05/36] drm/amdgpu/sdma5.x: suspend KFD queues in ring reset Alex Deucher
2025-06-17  3:07 ` [PATCH 06/36] drm/amdgpu/sdma5: init engine reset mutex Alex Deucher
2025-06-17  5:50   ` Zhang, Jesse(Jie)
2025-06-17  6:09   ` Zhang, Jesse(Jie)
2025-06-17 11:50   ` Christian König
2025-06-17  3:07 ` [PATCH 07/36] drm/amdgpu/sdma5.2: " Alex Deucher
2025-06-17  6:08   ` Zhang, Jesse(Jie)
2025-06-17  3:07 ` [PATCH 08/36] drm/amdgpu: update ring reset function signature Alex Deucher
2025-06-17 12:20   ` Christian König
2025-06-17  3:07 ` [PATCH 09/36] drm/amdgpu: rework queue reset scheduler interaction Alex Deucher
2025-06-17  3:07 ` [PATCH 10/36] drm/amdgpu: move force completion into ring resets Alex Deucher
2025-06-17  3:07 ` [PATCH 11/36] drm/amdgpu: move guilty handling " Alex Deucher
2025-06-17 12:28   ` Christian König
2025-06-17  3:07 ` [PATCH 12/36] drm/amdgpu: move scheduler wqueue handling into callbacks Alex Deucher
2025-06-17  3:07 ` [PATCH 13/36] drm/amdgpu: track ring state associated with a job Alex Deucher
2025-06-18 14:53   ` Christian König
2025-06-17  3:07 ` [PATCH 14/36] drm/amdgpu/gfx9: re-emit unprocessed state on kcq reset Alex Deucher
2025-06-17  3:07 ` [PATCH 15/36] drm/amdgpu/gfx9.4.3: " Alex Deucher
2025-06-17  3:07 ` [PATCH 16/36] drm/amdgpu/gfx10: re-emit unprocessed state on ring reset Alex Deucher
2025-06-17  3:07 ` [PATCH 17/36] drm/amdgpu/gfx11: " Alex Deucher
2025-06-17  3:07 ` [PATCH 18/36] drm/amdgpu/gfx12: " Alex Deucher
2025-06-17  3:07 ` [PATCH 19/36] drm/amdgpu/sdma6: " Alex Deucher
2025-06-17  3:07 ` [PATCH 20/36] drm/amdgpu/sdma7: " Alex Deucher
2025-06-17  3:08 ` [PATCH 21/36] drm/amdgpu/jpeg2: " Alex Deucher
2025-06-17  3:08 ` [PATCH 22/36] drm/amdgpu/jpeg2.5: " Alex Deucher
2025-06-17  3:08 ` [PATCH 23/36] drm/amdgpu/jpeg3: " Alex Deucher
2025-06-17  3:08 ` [PATCH 24/36] drm/amdgpu/jpeg4: " Alex Deucher
2025-06-17  3:08 ` [PATCH 25/36] drm/amdgpu/jpeg4.0.3: " Alex Deucher
2025-06-17  3:08 ` [PATCH 26/36] drm/amdgpu/jpeg4.0.5: add queue reset Alex Deucher
2025-06-17  3:08 ` [PATCH 27/36] drm/amdgpu/jpeg5: " Alex Deucher
2025-06-17  3:08 ` [PATCH 28/36] drm/amdgpu/jpeg5.0.1: re-emit unprocessed state on ring reset Alex Deucher
2025-06-17  3:08 ` [PATCH 29/36] drm/amdgpu/vcn4: " Alex Deucher
2025-06-17  3:08 ` [PATCH 30/36] drm/amdgpu/vcn4.0.3: " Alex Deucher
2025-06-17  3:08 ` [PATCH 31/36] drm/amdgpu/vcn4.0.5: " Alex Deucher
2025-06-17  3:08 ` [PATCH 32/36] drm/amdgpu/vcn5: " Alex Deucher
2025-06-17  3:08 ` [PATCH 33/36] drm/amdgpu/vcn: add a helper framework for engine resets Alex Deucher
2025-06-17  4:30   ` Sundararaju, Sathishkumar
2025-06-17  6:10     ` Sundararaju, Sathishkumar
2025-06-17 13:09       ` Alex Deucher
2025-06-17 16:49         ` Sundararaju, Sathishkumar
2025-06-17  3:08 ` [PATCH 34/36] drm/amdgpu/vcn2: implement ring reset Alex Deucher
2025-06-17  3:08 ` [PATCH 35/36] drm/amdgpu/vcn2.5: " Alex Deucher
2025-06-17  3:08 ` [PATCH 36/36] drm/amdgpu/vcn3: " Alex Deucher
2025-06-17 19:22   ` Sundararaju, Sathishkumar
2025-06-17 20:14     ` Alex Deucher
2025-06-18  7:35       ` Sundararaju, Sathishkumar
2025-06-18 14:16         ` Sundararaju, Sathishkumar
  -- strict thread matches above, loose matches on Subject: below --
2025-07-07 19:03 [PATCH V14 00/36] Reset improvements Alex Deucher
2025-07-07 19:03 ` [PATCH 33/36] drm/amdgpu/vcn: add a helper framework for engine resets Alex Deucher

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).