All of lore.kernel.org
 help / color / mirror / Atom feed
From: Christopher Snowhill <chris@kode54.net>
To: Alex Deucher <alexdeucher@gmail.com>
Cc: <amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH 00/34] GC per queue reset
Date: Tue, 23 Jul 2024 01:50:18 -0700	[thread overview]
Message-ID: <87a5i8jxv9.fsf@gmail.com> (raw)
In-Reply-To: <CADnq5_NHzGzPe73Ks8au=_up87PTJU11mHpCxVcQBNcWkW-b8w@mail.gmail.com>

Alex Deucher <alexdeucher@gmail.com> writes:

> On Thu, Jul 18, 2024 at 10:15 AM Alex Deucher <alexander.deucher@amd.com> wrote:
>>
>> This adds preliminary support for GC per queue reset.  In this
>> case, only the jobs currently in the queue are lost.  If this
>> fails, we fall back to a full adapter reset.
>
> Also available here via git:
> https://gitlab.freedesktop.org/agd5f/linux/-/commits/amd-staging-drm-next-queue-reset

Just tested this, after encountering the double-add crash trying to
reset after a GPU hang. It doesn't seem to gracefully recover from this
particular GPU hang, but at least now it resets properly. Still not
going to attempt to run it against KDE / Plasma 6.1.3 on Arch, as that
loves to hang if there's any Xwayland involved in the GPU reset event.

However, under labwc-git with my own PR applied to it, it recovers okay,
though Xwayland eventually crashes and is restarted by labwc. Here's a
dmesg log excerpt of the reset and recovery event:

[  189.830630] amdgpu 0000:0a:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=52410, emitted seq=52412
[  189.830642] amdgpu 0000:0a:00.0: amdgpu: Process information: process Stray-Win64-Shi pid 11560 thread vkd3d_queue pid 11719
[  190.099191] amdgpu 0000:0a:00.0: amdgpu: GPU reset begin!
[  190.457702] amdgpu 0000:0a:00.0: amdgpu: Dumping IP State
[  190.459418] amdgpu 0000:0a:00.0: amdgpu: Dumping IP State Completed
[  190.459420] amdgpu 0000:0a:00.0: amdgpu: MODE1 reset
[  190.459423] amdgpu 0000:0a:00.0: amdgpu: GPU mode1 reset
[  190.459483] amdgpu 0000:0a:00.0: amdgpu: GPU smu mode1 reset
[  190.967464] amdgpu 0000:0a:00.0: amdgpu: GPU reset succeeded, trying to resume
[  190.967824] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[  190.967912] [drm] VRAM is lost due to GPU reset!
[  190.967914] amdgpu 0000:0a:00.0: amdgpu: PSP is resuming...
[  191.042264] amdgpu 0000:0a:00.0: amdgpu: reserve 0xa00000 from 0x82fd000000 for PSP TMR
[  191.143003] amdgpu 0000:0a:00.0: amdgpu: RAS: optional ras ta ucode is not available
[  191.156566] amdgpu 0000:0a:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[  191.156572] amdgpu 0000:0a:00.0: amdgpu: SMU is resuming...
[  191.156576] amdgpu 0000:0a:00.0: amdgpu: smu driver if version = 0x0000000e, smu fw if version = 0x00000012, smu fw program = 0, version = 0x00413e00 (65.62.0)
[  191.156580] amdgpu 0000:0a:00.0: amdgpu: SMU driver if version not matched
[  191.156609] amdgpu 0000:0a:00.0: amdgpu: use vbios provided pptable
[  191.215750] amdgpu 0000:0a:00.0: amdgpu: SMU is resumed successfully!
[  191.217023] [drm] DMUB hardware initialized: version=0x02020020
[  191.530005] [drm] kiq ring mec 2 pipe 1 q 0
[  191.532863] amdgpu 0000:0a:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[  191.532866] amdgpu 0000:0a:00.0: amdgpu: ring gfx_0.1.0 uses VM inv eng 1 on hub 0
[  191.532867] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 4 on hub 0
[  191.532869] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 5 on hub 0
[  191.532870] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[  191.532871] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[  191.532872] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[  191.532874] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[  191.532875] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[  191.532876] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[  191.532878] amdgpu 0000:0a:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 12 on hub 0
[  191.532879] amdgpu 0000:0a:00.0: amdgpu: ring sdma0 uses VM inv eng 13 on hub 0
[  191.532880] amdgpu 0000:0a:00.0: amdgpu: ring sdma1 uses VM inv eng 14 on hub 0
[  191.532881] amdgpu 0000:0a:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
[  191.532883] amdgpu 0000:0a:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8
[  191.532884] amdgpu 0000:0a:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8
[  191.532885] amdgpu 0000:0a:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
[  191.536522] amdgpu 0000:0a:00.0: amdgpu: recover vram bo from shadow start
[  191.555443] amdgpu 0000:0a:00.0: amdgpu: recover vram bo from shadow done
[  191.555471] amdgpu 0000:0a:00.0: amdgpu: GPU reset(2) succeeded!
[  191.555663] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

Yes, I can reliably hang my gfx ring if I run Stray with -dx12 switch
applied. In-game, though, not on the title screen.


> Alex
>
>>
>> Alex Deucher (19):
>>   drm/amdgpu/mes: add API for legacy queue reset
>>   drm/amdgpu/mes11: add API for legacy queue reset
>>   drm/amdgpu/mes12: add API for legacy queue reset
>>   drm/amdgpu/mes: add API for user queue reset
>>   drm/amdgpu/mes11: add API for user queue reset
>>   drm/amdgpu/mes12: add API for user queue reset
>>   drm/amdgpu: add new ring reset callback
>>   drm/amdgpu: add per ring reset support (v2)
>>   drm/amdgpu/gfx11: add ring reset callbacks
>>   drm/amdgpu/gfx11: rename gfx_v11_0_gfx_init_queue()
>>   drm/amdgpu/gfx10: add ring reset callbacks
>>   drm/amdgpu/gfx10: rework reset sequence
>>   drm/amdgpu/gfx9: add ring reset callback
>>   drm/amdgpu/gfx9.4.3: add ring reset callback
>>   drm/amdgpu/gfx12: add ring reset callbacks
>>   drm/amdgpu/gfx12: fallback to driver reset compute queue directly
>>   drm/amdgpu/gfx11: enter safe mode before touching CP_INT_CNTL
>>   drm/amdgpu/gfx11: add a mutex for the gfx semaphore
>>   drm/amdgpu/gfx11: export gfx_v11_0_request_gfx_index_mutex()
>>
>> Jiadong Zhu (13):
>>   drm/amdgpu/gfx11: wait for reset done before remap
>>   drm/amdgpu/gfx10: remap queue after reset successfully
>>   drm/amdgpu/gfx10: wait for reset done before remap
>>   drm/amdgpu/gfx9: remap queue after reset successfully
>>   drm/amdgpu/gfx9: wait for reset done before remap
>>   drm/amdgpu/gfx9.4.3: remap queue after reset successfully
>>   drm/amdgpu/gfx_9.4.3: wait for reset done before remap
>>   drm/amdgpu/gfx: add a new kiq_pm4_funcs callback for reset_hw_queue
>>   drm/amdgpu/gfx9: implement reset_hw_queue for gfx9
>>   drm/amdgpu/gfx9.4.3: implement reset_hw_queue for gfx9.4.3
>>   drm/amdgpu/mes: modify mes api for mmio queue reset
>>   drm/amdgpu/mes: implement amdgpu_mes_reset_hw_queue_mmio
>>   drm/amdgpu/mes11: implement mmio queue reset for gfx11
>>
>> Prike Liang (2):
>>   drm/amdgpu: increase the reset counter for the queue reset
>>   drm/amdgpu/gfx11: fallback to driver reset compute queue directly (v2)
>>
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   1 +
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |   6 +
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  18 +++
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c    |  88 ++++++++++++
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h    |  37 +++++
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   2 +
>>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 158 ++++++++++++++++++++-
>>  drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c     | 117 +++++++++++++--
>>  drivers/gpu/drm/amd/amdgpu/gfx_v11_0.h     |   3 +
>>  drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c     |  95 ++++++++++++-
>>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      | 126 +++++++++++++++-
>>  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c    | 125 +++++++++++++++-
>>  drivers/gpu/drm/amd/amdgpu/mes_v11_0.c     | 132 +++++++++++++++++
>>  drivers/gpu/drm/amd/amdgpu/mes_v12_0.c     |  54 +++++++
>>  14 files changed, 930 insertions(+), 32 deletions(-)
>>
>> --
>> 2.45.2
>>

      reply	other threads:[~2024-07-23  8:56 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
2024-07-18 14:07 ` [PATCH 01/34] drm/amdgpu/mes: add API for legacy " Alex Deucher
2024-07-18 14:07 ` [PATCH 02/34] drm/amdgpu/mes11: " Alex Deucher
2024-07-18 14:07 ` [PATCH 03/34] drm/amdgpu/mes12: " Alex Deucher
2024-07-18 14:07 ` [PATCH 04/34] drm/amdgpu/mes: add API for user " Alex Deucher
2024-07-18 14:07 ` [PATCH 05/34] drm/amdgpu/mes11: " Alex Deucher
2024-07-18 14:07 ` [PATCH 06/34] drm/amdgpu/mes12: " Alex Deucher
2024-07-18 14:07 ` [PATCH 07/34] drm/amdgpu: add new ring reset callback Alex Deucher
2024-07-18 14:07 ` [PATCH 08/34] drm/amdgpu: add per ring reset support (v2) Alex Deucher
2024-07-18 14:07 ` [PATCH 09/34] drm/amdgpu: increase the reset counter for the queue reset Alex Deucher
2024-07-18 14:07 ` [PATCH 10/34] drm/amdgpu/gfx11: add ring reset callbacks Alex Deucher
2024-07-18 14:07 ` [PATCH 11/34] drm/amdgpu/gfx11: fallback to driver reset compute queue directly (v2) Alex Deucher
2024-07-18 14:07 ` [PATCH 12/34] drm/amdgpu/gfx11: rename gfx_v11_0_gfx_init_queue() Alex Deucher
2024-07-18 14:07 ` [PATCH 13/34] drm/amdgpu/gfx11: wait for reset done before remap Alex Deucher
2024-07-18 14:07 ` [PATCH 14/34] drm/amdgpu/gfx10: add ring reset callbacks Alex Deucher
2024-07-18 14:07 ` [PATCH 15/34] drm/amdgpu/gfx10: remap queue after reset successfully Alex Deucher
2024-07-18 14:07 ` [PATCH 16/34] drm/amdgpu/gfx10: wait for reset done before remap Alex Deucher
2024-07-18 14:07 ` [PATCH 17/34] drm/amdgpu/gfx10: rework reset sequence Alex Deucher
2024-07-18 14:07 ` [PATCH 18/34] drm/amdgpu/gfx9: add ring reset callback Alex Deucher
2024-07-18 14:07 ` [PATCH 19/34] drm/amdgpu/gfx9: remap queue after reset successfully Alex Deucher
2024-07-18 14:07 ` [PATCH 20/34] drm/amdgpu/gfx9: wait for reset done before remap Alex Deucher
2024-07-18 14:07 ` [PATCH 21/34] drm/amdgpu/gfx9.4.3: add ring reset callback Alex Deucher
2024-07-18 14:07 ` [PATCH 22/34] drm/amdgpu/gfx9.4.3: remap queue after reset successfully Alex Deucher
2024-07-18 14:07 ` [PATCH 23/34] drm/amdgpu/gfx_9.4.3: wait for reset done before remap Alex Deucher
2024-07-18 14:07 ` [PATCH 24/34] drm/amdgpu/gfx12: add ring reset callbacks Alex Deucher
2024-07-18 14:07 ` [PATCH 25/34] drm/amdgpu/gfx12: fallback to driver reset compute queue directly Alex Deucher
2024-07-18 14:07 ` [PATCH 26/34] drm/amdgpu/gfx: add a new kiq_pm4_funcs callback for reset_hw_queue Alex Deucher
2024-07-18 14:07 ` [PATCH 27/34] drm/amdgpu/gfx9: implement reset_hw_queue for gfx9 Alex Deucher
2024-07-18 14:07 ` [PATCH 28/34] drm/amdgpu/gfx9.4.3: implement reset_hw_queue for gfx9.4.3 Alex Deucher
2024-07-18 14:07 ` [PATCH 29/34] drm/amdgpu/mes: modify mes api for mmio queue reset Alex Deucher
2024-07-18 14:07 ` [PATCH 30/34] drm/amdgpu/mes: implement amdgpu_mes_reset_hw_queue_mmio Alex Deucher
2024-07-18 14:07 ` [PATCH 31/34] drm/amdgpu/gfx11: enter safe mode before touching CP_INT_CNTL Alex Deucher
2024-07-18 14:07 ` [PATCH 32/34] drm/amdgpu/gfx11: add a mutex for the gfx semaphore Alex Deucher
2024-07-18 14:07 ` [PATCH 33/34] drm/amdgpu/gfx11: export gfx_v11_0_request_gfx_index_mutex() Alex Deucher
2024-07-18 14:07 ` [PATCH 34/34] drm/amdgpu/mes11: implement mmio queue reset for gfx11 Alex Deucher
2024-07-18 16:29 ` [PATCH 00/34] GC per queue reset Friedrich Vock
2024-07-19 13:39   ` Alex Deucher
2024-07-19 22:52     ` Alex Deucher
2024-07-24  9:20     ` Zhu, Jiadong
2024-07-25  7:44       ` Friedrich Vock
2024-07-18 16:54 ` Alex Deucher
2024-07-23  8:50   ` Christopher Snowhill [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87a5i8jxv9.fsf@gmail.com \
    --to=chris@kode54.net \
    --cc=alexdeucher@gmail.com \
    --cc=amd-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.