From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 4CE83C3DA63 for ; Tue, 23 Jul 2024 08:56:35 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id AF10210E4DD; Tue, 23 Jul 2024 08:56:34 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=kode54.net header.i=@kode54.net header.b="yFogl0Q7"; dkim-atps=neutral X-Greylist: delayed 366 seconds by postgrey-1.36 at gabe; Tue, 23 Jul 2024 08:56:31 UTC Received: from out-172.mta1.migadu.com (out-172.mta1.migadu.com [95.215.58.172]) by gabe.freedesktop.org (Postfix) with ESMTPS id 9C22E10E4DD for ; Tue, 23 Jul 2024 08:56:31 +0000 (UTC) X-Envelope-To: alexdeucher@gmail.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kode54.net; s=key1; t=1721724622; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=FJ3RA1yKeLZMFVlJfElG7YMB12PyUVcx2YZHgBMuUAI=; b=yFogl0Q7ptXVu3daKTd70JXFypgU2Hkh5TosY78cTgwcNliX1iWOBh2ZQCs9OjlmyhothO 0VaIckMVw6U8Hdc72GKuS8qY1nezCuekr6MQxgcEiRaVLDOcd61r9pjsfvXFzmC6gX8NCD hA/8rSjW977E0rWxhiZYDhwgrWSIzymXNJcUQzrBhXbGj18NDWfXdfxRmWfp+B+cRfLjZn E096OU0I9Qm9hWAbreK101S+bq72U2d3PE+FqO0/N9bJ3dv3HV0NuiCfuEVLNy3yZYtwgO KmBLq8d29WHTmRzvAW/k/TqjmoLh55DRN2qyXRy+NFdqv84illYRlc2BxarxQw== X-Envelope-To: amd-gfx@lists.freedesktop.org X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Christopher Snowhill To: Alex Deucher Cc: Subject: Re: [PATCH 00/34] GC per queue reset In-Reply-To: References: <20240718140733.1731004-1-alexander.deucher@amd.com> Date: Tue, 23 Jul 2024 01:50:18 -0700 Message-ID: <87a5i8jxv9.fsf@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT X-BeenThere: amd-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion list for AMD gfx List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: amd-gfx-bounces@lists.freedesktop.org Sender: "amd-gfx" Alex Deucher writes: > On Thu, Jul 18, 2024 at 10:15=E2=80=AFAM Alex Deucher wrote: >> >> This adds preliminary support for GC per queue reset. In this >> case, only the jobs currently in the queue are lost. If this >> fails, we fall back to a full adapter reset. > > Also available here via git: > https://gitlab.freedesktop.org/agd5f/linux/-/commits/amd-staging-drm-next= -queue-reset Just tested this, after encountering the double-add crash trying to reset after a GPU hang. It doesn't seem to gracefully recover from this particular GPU hang, but at least now it resets properly. Still not going to attempt to run it against KDE / Plasma 6.1.3 on Arch, as that loves to hang if there's any Xwayland involved in the GPU reset event. However, under labwc-git with my own PR applied to it, it recovers okay, though Xwayland eventually crashes and is restarted by labwc. Here's a dmesg log excerpt of the reset and recovery event: [ 189.830630] amdgpu 0000:0a:00.0: amdgpu: ring gfx_0.0.0 timeout, signale= d seq=3D52410, emitted seq=3D52412 [ 189.830642] amdgpu 0000:0a:00.0: amdgpu: Process information: process St= ray-Win64-Shi pid 11560 thread vkd3d_queue pid 11719 [ 190.099191] amdgpu 0000:0a:00.0: amdgpu: GPU reset begin! [ 190.457702] amdgpu 0000:0a:00.0: amdgpu: Dumping IP State [ 190.459418] amdgpu 0000:0a:00.0: amdgpu: Dumping IP State Completed [ 190.459420] amdgpu 0000:0a:00.0: amdgpu: MODE1 reset [ 190.459423] amdgpu 0000:0a:00.0: amdgpu: GPU mode1 reset [ 190.459483] amdgpu 0000:0a:00.0: amdgpu: GPU smu mode1 reset [ 190.967464] amdgpu 0000:0a:00.0: amdgpu: GPU reset succeeded, trying to = resume [ 190.967824] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000= ). [ 190.967912] [drm] VRAM is lost due to GPU reset! [ 190.967914] amdgpu 0000:0a:00.0: amdgpu: PSP is resuming... [ 191.042264] amdgpu 0000:0a:00.0: amdgpu: reserve 0xa00000 from 0x82fd000= 000 for PSP TMR [ 191.143003] amdgpu 0000:0a:00.0: amdgpu: RAS: optional ras ta ucode is n= ot available [ 191.156566] amdgpu 0000:0a:00.0: amdgpu: SECUREDISPLAY: securedisplay ta= ucode is not available [ 191.156572] amdgpu 0000:0a:00.0: amdgpu: SMU is resuming... [ 191.156576] amdgpu 0000:0a:00.0: amdgpu: smu driver if version =3D 0x000= 0000e, smu fw if version =3D 0x00000012, smu fw program =3D 0, version =3D = 0x00413e00 (65.62.0) [ 191.156580] amdgpu 0000:0a:00.0: amdgpu: SMU driver if version not match= ed [ 191.156609] amdgpu 0000:0a:00.0: amdgpu: use vbios provided pptable [ 191.215750] amdgpu 0000:0a:00.0: amdgpu: SMU is resumed successfully! [ 191.217023] [drm] DMUB hardware initialized: version=3D0x02020020 [ 191.530005] [drm] kiq ring mec 2 pipe 1 q 0 [ 191.532863] amdgpu 0000:0a:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng = 0 on hub 0 [ 191.532866] amdgpu 0000:0a:00.0: amdgpu: ring gfx_0.1.0 uses VM inv eng = 1 on hub 0 [ 191.532867] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng= 4 on hub 0 [ 191.532869] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng= 5 on hub 0 [ 191.532870] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng= 6 on hub 0 [ 191.532871] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng= 7 on hub 0 [ 191.532872] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng= 8 on hub 0 [ 191.532874] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng= 9 on hub 0 [ 191.532875] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng= 10 on hub 0 [ 191.532876] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng= 11 on hub 0 [ 191.532878] amdgpu 0000:0a:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv en= g 12 on hub 0 [ 191.532879] amdgpu 0000:0a:00.0: amdgpu: ring sdma0 uses VM inv eng 13 o= n hub 0 [ 191.532880] amdgpu 0000:0a:00.0: amdgpu: ring sdma1 uses VM inv eng 14 o= n hub 0 [ 191.532881] amdgpu 0000:0a:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng = 0 on hub 8 [ 191.532883] amdgpu 0000:0a:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv en= g 1 on hub 8 [ 191.532884] amdgpu 0000:0a:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv en= g 4 on hub 8 [ 191.532885] amdgpu 0000:0a:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5= on hub 8 [ 191.536522] amdgpu 0000:0a:00.0: amdgpu: recover vram bo from shadow sta= rt [ 191.555443] amdgpu 0000:0a:00.0: amdgpu: recover vram bo from shadow done [ 191.555471] amdgpu 0000:0a:00.0: amdgpu: GPU reset(2) succeeded! [ 191.555663] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize = parser -125! Yes, I can reliably hang my gfx ring if I run Stray with -dx12 switch applied. In-game, though, not on the title screen. > Alex > >> >> Alex Deucher (19): >> drm/amdgpu/mes: add API for legacy queue reset >> drm/amdgpu/mes11: add API for legacy queue reset >> drm/amdgpu/mes12: add API for legacy queue reset >> drm/amdgpu/mes: add API for user queue reset >> drm/amdgpu/mes11: add API for user queue reset >> drm/amdgpu/mes12: add API for user queue reset >> drm/amdgpu: add new ring reset callback >> drm/amdgpu: add per ring reset support (v2) >> drm/amdgpu/gfx11: add ring reset callbacks >> drm/amdgpu/gfx11: rename gfx_v11_0_gfx_init_queue() >> drm/amdgpu/gfx10: add ring reset callbacks >> drm/amdgpu/gfx10: rework reset sequence >> drm/amdgpu/gfx9: add ring reset callback >> drm/amdgpu/gfx9.4.3: add ring reset callback >> drm/amdgpu/gfx12: add ring reset callbacks >> drm/amdgpu/gfx12: fallback to driver reset compute queue directly >> drm/amdgpu/gfx11: enter safe mode before touching CP_INT_CNTL >> drm/amdgpu/gfx11: add a mutex for the gfx semaphore >> drm/amdgpu/gfx11: export gfx_v11_0_request_gfx_index_mutex() >> >> Jiadong Zhu (13): >> drm/amdgpu/gfx11: wait for reset done before remap >> drm/amdgpu/gfx10: remap queue after reset successfully >> drm/amdgpu/gfx10: wait for reset done before remap >> drm/amdgpu/gfx9: remap queue after reset successfully >> drm/amdgpu/gfx9: wait for reset done before remap >> drm/amdgpu/gfx9.4.3: remap queue after reset successfully >> drm/amdgpu/gfx_9.4.3: wait for reset done before remap >> drm/amdgpu/gfx: add a new kiq_pm4_funcs callback for reset_hw_queue >> drm/amdgpu/gfx9: implement reset_hw_queue for gfx9 >> drm/amdgpu/gfx9.4.3: implement reset_hw_queue for gfx9.4.3 >> drm/amdgpu/mes: modify mes api for mmio queue reset >> drm/amdgpu/mes: implement amdgpu_mes_reset_hw_queue_mmio >> drm/amdgpu/mes11: implement mmio queue reset for gfx11 >> >> Prike Liang (2): >> drm/amdgpu: increase the reset counter for the queue reset >> drm/amdgpu/gfx11: fallback to driver reset compute queue directly (v2) >> >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 + >> drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h | 6 + >> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 18 +++ >> drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 88 ++++++++++++ >> drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h | 37 +++++ >> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 2 + >> drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 158 ++++++++++++++++++++- >> drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 117 +++++++++++++-- >> drivers/gpu/drm/amd/amdgpu/gfx_v11_0.h | 3 + >> drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 95 ++++++++++++- >> drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 126 +++++++++++++++- >> drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 125 +++++++++++++++- >> drivers/gpu/drm/amd/amdgpu/mes_v11_0.c | 132 +++++++++++++++++ >> drivers/gpu/drm/amd/amdgpu/mes_v12_0.c | 54 +++++++ >> 14 files changed, 930 insertions(+), 32 deletions(-) >> >> -- >> 2.45.2 >>