From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id CD8E7CD8C9D for ; Thu, 11 Jun 2026 20:26:53 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 6B98910E869; Thu, 11 Jun 2026 20:26:53 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="fR+lYCFa"; dkim-atps=neutral Received: from mail-wr1-f41.google.com (mail-wr1-f41.google.com [209.85.221.41]) by gabe.freedesktop.org (Postfix) with ESMTPS id 8317A10E869 for ; Thu, 11 Jun 2026 20:26:52 +0000 (UTC) Received: by mail-wr1-f41.google.com with SMTP id ffacd0b85a97d-45eec22fab7so96530f8f.3 for ; Thu, 11 Jun 2026 13:26:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1781209611; x=1781814411; darn=lists.freedesktop.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=U6UKCSOJYwha3jIl0JTKzlw3XdatHr2oqfRRYtrV7eM=; b=fR+lYCFaHtcoqrfY88JhXYHTk7QffR9CjXtmyVuGssEWHkXhWw4ptpqzLeC5fYeOc3 yyC5bqag1ARWboGXukWfQJNGu9uAVGtfciBDw4I+t2mC1riBGoPepCt/tpVEA+yMImPs Jh7oOhAWfXjvvPBJpFFSfM/c3SSQIB7FFEKA9pX+1BX6NBIeONizZri8zp6YNMCZ9bS3 Xn7LfAGgHdJ9PHOsa55UcLQa41UVPLq3YTVrIqshQ0reUPZGLYKOEYmTRCLoosmQY5+o fqZL9Yw0jf8OuSIBPhI71RpSugISv0UuHs+fOBGkghYJpKrss/ue5D6mRDqVNN+lKsRo drTg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1781209611; x=1781814411; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=U6UKCSOJYwha3jIl0JTKzlw3XdatHr2oqfRRYtrV7eM=; b=JDjH42fvOdDkhmyIbNpUrCB5n8l+EnQxOkD5N9FWxwUaGoYVdeatRJ5TuxWwuWq4CF CP9pVybJaNhW6gJx3lo7H2ifS9L8jOHE2UOorEkjrABZq8jNV2Sjj8uLXnF422L19jO1 dSNnc2AYQ4nHpG4aFBWJ8lAaCZSVZdi/CB5lVg2/biQ/SAWOM7HIhlQ25otCcjJHZlfQ 75gOTXhJyaUi5Xwwd49DRDeMEM9ZOQf3UbaNnq+hvjK7s8jisuOUf03X6WNK5n0P1+T8 kyQgGHU5BEQfZTVhqQ5/ToCpfWquoFBJjsI3fLIf5wiagIp8nSSj7UdfrFcbyU60K1f4 XFxA== X-Gm-Message-State: AOJu0YyOyTatXnZmmxhCnKCfegdNIcPeoircA/KARSfQB9ZtxRFXj/4Z YMGlGxOp/bt85ap7/wjy0QphLd7c8eeGKsC8xJYAEJ2ZFONmpOLH605G X-Gm-Gg: Acq92OEBHXndlOHvZdyIP7g44BuMMsDNSPt4UBjFJBzhK5UjJmAkdowHhI7VgxeuQN4 saoBz2PzExwOJvetBuqFPtZl+8ZyqpIjn+Se0FehggsRkRlZN3O0WvNZa+xf4OBsnq1mrm2qyni Khm+uKcug3EejZCEYwX/O7kBQbWs0bnz2rvispNuNrQ9v4iE0rshKOGbEyEXA7PQu0oGenXUW4s R+U0mEPd5lgdvcu6c+yI3tWsMmVCg1VWYHt66GxiMZgO4FKSzZJzQ0ZXMCdHSyqWeoXr4YID6Ww CfhLgMMz42V/BFG+JQ5AGfKT5UJA/zGaPUkrXauQJCVVP7ASiTmZA1sj5dZ2OxrBIy8RrQrKh/+ bH50vHQk3dPCMKVB6mvY9q9T4cdD1iCLnulVctScnos8sfgIZkjeOTsamUz8pWflvc0wZE+Rrj1 1CGouUus7/7wlMHyCsvUVLZQAJhVz9Su0jh8QKHQwI83wM2E/aOb8EzG3sNswqg+/qEluOBg== X-Received: by 2002:a05:600c:4fd4:b0:490:bb59:63b7 with SMTP id 5b1f17b1804b1-490e561f1f7mr56350485e9.28.1781209610589; Thu, 11 Jun 2026 13:26:50 -0700 (PDT) Received: from timur-hyperion.localnet (54001290.dsl.pool.telekom.hu. [84.0.18.144]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-490ea7db9c6sm9433255e9.8.2026.06.11.13.26.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 11 Jun 2026 13:26:50 -0700 (PDT) From: Timur =?UTF-8?B?S3Jpc3TDs2Y=?= To: Alex Deucher , Christian =?UTF-8?B?S8O2bmln?= , Jiqian Chen Cc: amd-gfx@lists.freedesktop.org, Samuel Pitoiset , Tvrtko Ursulin , Huang Rui , Huang Trigger , Jiqian Chen Subject: Re: [PATCH v2 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2 Date: Thu, 11 Jun 2026 22:26:48 +0200 Message-ID: <3694190.dWV9SEqChM@timur-hyperion> In-Reply-To: <20260611055715.1142135-1-Jiqian.Chen@amd.com> References: <20260611055715.1142135-1-Jiqian.Chen@amd.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" X-BeenThere: amd-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion list for AMD gfx List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: amd-gfx-bounces@lists.freedesktop.org Sender: "amd-gfx" On Thursday, June 11, 2026 7:57:15=E2=80=AFAM Central European Summer Time = Jiqian Chen=20 wrote: > For Renior APU with gfx9, in some test scenarios with disabling > ring_reset, like accessing an unmapped invalid address, it can > trigger a gpu job timeout event, then driver uses Mode2 reset > to reset GPU, but after Mode2 compute Ring test and IB test fail > randomly. It because the CPC and CPF are still stuck after Mode2, > that causes compute Ring test fail. What's more, the HQDs of > MECs are still active, that causes MECs use stale HQDs when MECs > are unhalted before driver restore MQDs, then causes compute IB > tests fail. >=20 > So, add sequences to reset CPC and CPF after Mode2, and de-active > HQDs of MECs before unhalting MECs. >=20 > Signed-off-by: Jiqian Chen > --- > v1->v2 changes: > * Move my sequences into a new function gfx_v9_0_cp_mode2_clear_state > * Add reset Mode2 method check to the if condition that call my sequences >=20 > v1: > Hi all, >=20 > My board is Renior APU with gfx9, smu12. I run a testcase that > accesses an invalid address to trigger a amdgpu_job_timedout() > with disabling ring_reset, so that driver will call mode2 reset > directly. After mode2 reset I found compute Ring tests and compute > IB tests fail randomly on random compute ring. >=20 > We checked the scan dump of GPU, we can see the CPC and CPF are > still stuck, that caused Compute Ring tests fail. >=20 > I added printings in driver codes (gfx_v9_0_cp_resume), and found > the HQDs of MECs are still active, that may cause MECs use stale > HQDs when MECs are unhalted before mapping compute queues (restoring > MQDs to HQDs). >=20 > So, I send this patch to fix above problems. > There are two main changes of my patch: > One is to reset CPC and CPF before resuming KCQ. > Another is to disable HQDs beofre unhalting MECs. Hi, Indeed I've seen similar issues on other GPUs, as I've been looking into=20 improving GPU recovery. Instead of forcing the HQD_ACTIVE to zero, I suggest to deactivate the HQD= =20 before reset. We should introduce a gfx_v9_0_deactivate_hqd() function simi= lar=20 to what gfx_v8_0_deactivate_hqd() is doing, and call that from somewhere in= =20 gfx_v9_0_hw_fini() when disabling the compute queues. In fact, it looks like it already deactivates HQD, but only for the KIQ and= =20 only when it isn't in reset or suspend. That looks wrong to me and I think = it=20 should do that for all compute queues (in addition to the KIQ) either=20 unconditionally or before a mode2 reset. What do you think? I don't have a Renoir APU yet but if you need help, I can try to see if I c= an=20 reproduce something like this on a Vega 10 dGPU. Best regards, Timur > --- > drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 44 +++++++++++++++++++++++++++ > 1 file changed, 44 insertions(+) >=20 > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c > b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c index 47721d0c3781..d3ef45aa299a > 100644 > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c > @@ -3942,6 +3942,46 @@ static int gfx_v9_0_kcq_resume(struct amdgpu_device > *adev) return amdgpu_gfx_enable_kcq(adev, 0); > } >=20 > +static void gfx_v9_0_cp_mode2_clear_state(struct amdgpu_device *adev) > +{ > + u32 tmp; > + int i, j, k; > + > + /* > + * CPC and CPF are still stuck after Mode2 reset, that causes later > + * compute ring test fail and then loop Mode2 reset infinitely > + */ > + tmp =3D RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET); > + tmp =3D REG_SET_FIELD(tmp, GRBM_SOFT_RESET, SOFT_RESET_CPC, 1); > + tmp =3D REG_SET_FIELD(tmp, GRBM_SOFT_RESET, SOFT_RESET_CPF, 1); > + WREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET, tmp); > + tmp =3D RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET); > + udelay(50); > + > + tmp &=3D ~(GRBM_SOFT_RESET__SOFT_RESET_CPC_MASK | > + GRBM_SOFT_RESET__SOFT_RESET_CPF_MASK); > + WREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET, tmp); > + tmp =3D RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET); > + udelay(50); > + > + /* > + * CP_HQD_ACTIVE survives Mode2 reset. Deactivate every MEC HQD to > + * prevent MEC use stale HQD when MEC unhalted before restoring=20 MQD. > + * Otherwise, later compute IB test may fail > + */ > + for (i =3D 0; i < adev->gfx.mec.num_mec; i++) { > + for (j =3D 0; j < adev->gfx.mec.num_pipe_per_mec; j++) { > + for (k =3D 0; k < adev- >gfx.mec.num_queue_per_pipe; k++) { > + mutex_lock(&adev->srbm_mutex); > + soc15_grbm_select(adev, i + 1, j,=20 k, 0, 0); > + WREG32_SOC15_RLC(GC, 0,=20 mmCP_HQD_ACTIVE, 0); > + soc15_grbm_select(adev, 0, 0, 0,=20 0, 0); > + mutex_unlock(&adev->srbm_mutex); > + } > + } > + } > +} > + > static int gfx_v9_0_cp_resume(struct amdgpu_device *adev) > { > int r, i; > @@ -3967,6 +4007,10 @@ static int gfx_v9_0_cp_resume(struct amdgpu_device > *adev) gfx_v9_0_cp_gfx_enable(adev, false); > gfx_v9_0_cp_compute_enable(adev, false); >=20 > + if ((adev->flags & AMD_IS_APU) && amdgpu_in_reset(adev) && > + amdgpu_asic_reset_method(adev) =3D=3D=20 AMD_RESET_METHOD_MODE2) > + gfx_v9_0_cp_mode2_clear_state(adev); > + > r =3D gfx_v9_0_kiq_resume(adev); > if (r) > return r;