From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 0D329CD98CE for ; Fri, 12 Jun 2026 13:37:38 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 830C410EA18; Fri, 12 Jun 2026 13:37:38 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="SQS1KGrW"; dkim-atps=neutral Received: from mail-wr1-f43.google.com (mail-wr1-f43.google.com [209.85.221.43]) by gabe.freedesktop.org (Postfix) with ESMTPS id CA36F10EA18 for ; Fri, 12 Jun 2026 13:37:36 +0000 (UTC) Received: by mail-wr1-f43.google.com with SMTP id ffacd0b85a97d-45ef616daf6so1011896f8f.3 for ; Fri, 12 Jun 2026 06:37:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1781271455; x=1781876255; darn=lists.freedesktop.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=vE8aWVHh1qr+zpGueDWqMsWW3pHMlD0JWqKerG4rz6o=; b=SQS1KGrWwcBXiuenqVeLOATr1okxMNPyT7ZFqC2LJM/yK+PQbioppaZaw0g8sNcCpW ERoo9ZPocdC00NaYZ+tLTfTKfnBxuvx+9G12DTd7vfOfmrumY0F3SEKKaNywjPG82/8L emO7ITlw7ADxYAvwDYrvnY3VgyMn9RtQLqM23PIM919udNdbLFsVztL76eZOZg366KAD TiBqDx+UJlE2JvWnYwzUXk2S0/oTsG0YOyASW9uzQjJC0G6SFGw4yKB/jeyscG5aY5j6 5D/uHzA9P9Wpxw5ai/jWB3w4OU1FWB2xvBO8PmST6qOvAHv8GubF+jh2XBxMxg8FQLqI z+Zg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1781271455; x=1781876255; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=vE8aWVHh1qr+zpGueDWqMsWW3pHMlD0JWqKerG4rz6o=; b=eaRwQtQXHFhO4NFcd1gz0gFxyry4C5qxzdxpI4UhMaLmcqooMBr8DjrDtNsZuGLBNb as2cfDpKbEFv4KUx/orrglYsTBek6vi8skLoTMaef8nxkdwJdDC8MZu5AWtolEFe/y/V Iv75TmU0zTvBEd72GWM/MMldCo2gZTllQEQr54NXmMOyOrN9uUiBQxZmsbP0tsSyheaY q7g8OPoUembd7YS5xXHkdKRbFAlKEP3OSSk3WxHcMbt4wH+r0nxfAqU8KP0DXT0Su6Xp hECwD2xBnl43BRMJnF65POYHbDtnYDtkLyTc3BAxY6PNRJZSXEsxqiXt0sIq6x2RMLJP 0rwA== X-Gm-Message-State: AOJu0YzRFRrfdB5Mh/vI2B8xXdhQOeSbtIpelxosyQrvJ+1nK65Ki6YZ f6FI/dFgKy6FgUpMTKDjelQ+8VScYUEGEOCLr0A2+cQ4PTnd4T+59nfE X-Gm-Gg: Acq92OEobvfsd+E+gCX4EIoXZhb1JT6krVO1Am8lRQs8azdyhedjW2lAHw22mR4KoW7 CCPAxgq/wSLMKanx20gpdpXuPFcM5QJEvspH/CjnlzisjgF05o2/9U7ZWRheD7wbiJaHX1rra2Y ykhlvDvP3em5LRSceqnKjv2ukpfzfh0Ufo6dwLdn3FYorkbN1/PmsoCcKWsFo04A+77422ztrJk IY0E9FQYXa+8Mg2EMZn9xAjfAT/V5lnRneme9f2O4W+gsYSQPDOEAvW2CXigwIBAsXyS53H3R9q eVaM3OifyHnnfOUhxufOc7lpDgRS97JVTAtNanxLX71nY71vKg/fz33zlCcUelosdBiDuDsficj GIQ29GW6Cpvhd7JXAal/fJ4t4x7g1KDazTLwGWOFvtYf5vL6epe8jrUlIfCvH8m2dR8sz+NjE/Z DAGEyg6ameLcsy1y5GZnSv5RXkmFes08xxS6zw0VNQRHfZFKPTyAkkxQ25DYlo4hStX/yTJA== X-Received: by 2002:a05:600c:468d:b0:490:9588:bdae with SMTP id 5b1f17b1804b1-490ec4ee664mr43806615e9.18.1781271454951; Fri, 12 Jun 2026 06:37:34 -0700 (PDT) Received: from timur-hyperion.localnet (54001290.dsl.pool.telekom.hu. [84.0.18.144]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-490e2ca1a43sm159472805e9.8.2026.06.12.06.37.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 12 Jun 2026 06:37:34 -0700 (PDT) From: Timur =?UTF-8?B?S3Jpc3TDs2Y=?= To: Alex Deucher , Christian =?UTF-8?B?S8O2bmln?= , Jiqian Chen Cc: amd-gfx@lists.freedesktop.org, Samuel Pitoiset , Tvrtko Ursulin , Huang Rui , Huang Trigger , Jiqian Chen Subject: Re: [PATCH v3 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2 Date: Fri, 12 Jun 2026 15:37:33 +0200 Message-ID: <4951358.vXUDI8C0e8@timur-hyperion> In-Reply-To: <20260612092654.1632603-1-Jiqian.Chen@amd.com> References: <20260612092654.1632603-1-Jiqian.Chen@amd.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" X-BeenThere: amd-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion list for AMD gfx List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: amd-gfx-bounces@lists.freedesktop.org Sender: "amd-gfx" On Friday, June 12, 2026 11:26:54=E2=80=AFAM Central European Summer Time J= iqian Chen=20 wrote: > For Renior APU with gfx9, in some test scenarios with disabling > ring_reset, like accessing an unmapped invalid address, it can > trigger a gpu job timeout event, then driver uses Mode2 reset > to reset GPU, but after Mode2 compute Ring test and IB test fail > randomly. It because the HQDs of MECs are always active before or > after Mode2, that causes MECs use stale HQDs when MECs are unhalted > before driver restore MQDs, and causes CPC and CPF are still stuck > after Mode2, then causes compute Ring and IB tests fail. >=20 > So, add sequences to deactivate HQDs of MECs in suspend IP function > of the resetting process. >=20 > v2: Move all sequences into a new function gfx_v9_0_cp_mode2_clear_state > (Ray Huang) To check reset Mode2 method in the if condition (Ray Huang) > v3: Move all sequences before Mode2 instead of after Mode2 (Timur Krist= =C3=B3f) >=20 > Signed-off-by: Jiqian Chen Looks good, thank you! Reviewed-by: Timur Krist=C3=B3f > --- > v2->v3 changes: > * Move all sequencess before Mode2 instead of after Mode2, and add a new > function gfx_v9_0_deactivate_kcq_hqd to do the disable compute HQDs > sequences. > Then the resetting CPC and CPF are not needed since we have already > move all sequences before Mode2 and they are not stuck >=20 > v1->v2 changes: > * Move my sequences into a new function gfx_v9_0_cp_mode2_clear_state > * Add reset Mode2 method check to the if condition that call my sequences >=20 > v1: > Hi all, >=20 > My board is Renior APU with gfx9, smu12. I run a testcase that > accesses an invalid address to trigger a amdgpu_job_timedout() > with disabling ring_reset, so that driver will call mode2 reset > directly. After mode2 reset I found compute Ring tests and compute > IB tests fail randomly on random compute ring. >=20 > We checked the scan dump of GPU, we can see the CPC and CPF are > still stuck, that caused Compute Ring tests fail. >=20 > I added printings in driver codes (gfx_v9_0_cp_resume), and found > the HQDs of MECs are still active, that may cause MECs use stale > HQDs when MECs are unhalted before mapping compute queues (restoring > MQDs to HQDs). >=20 > So, I send this patch to fix above problems. > There are two main changes of my patch: > One is to reset CPC and CPF before resuming KCQ. > Another is to disable HQDs beofre unhalting MECs. > --- > drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 37 +++++++++++++++++++++++++++ > 1 file changed, 37 insertions(+) >=20 > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c > b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c index 90bbddb45730..0c01701488e7 > 100644 > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c > @@ -4071,6 +4071,39 @@ static int gfx_v9_0_hw_init(struct amdgpu_ip_block > *ip_block) return r; > } >=20 > +static void gfx_v9_0_deactivate_kcq_hqd(struct amdgpu_device *adev) > +{ > + for (int i =3D 0; i < adev->gfx.num_compute_rings; i++) { > + u32 tmp; > + struct amdgpu_ring *ring =3D &adev->gfx.compute_ring[i]; > + > + mutex_lock(&adev->srbm_mutex); > + soc15_grbm_select(adev, ring->me, ring->pipe, ring- >queue, 0, 0); > + tmp =3D RREG32_SOC15(GC, 0, mmCP_HQD_ACTIVE); > + /* disable the queue if it's active */ > + if (tmp & CP_HQD_ACTIVE__ACTIVE_MASK) { > + int j; > + > + WREG32_SOC15(GC, 0, mmCP_HQD_DEQUEUE_REQUEST,=20 1); > + for (j =3D 0; j < adev->usec_timeout; j++) { > + tmp =3D RREG32_SOC15(GC, 0,=20 mmCP_HQD_ACTIVE); > + if (!(tmp &=20 CP_HQD_ACTIVE__ACTIVE_MASK)) > + break; > + udelay(1); > + } > + if (j =3D=3D AMDGPU_MAX_USEC_TIMEOUT) { > + DRM_DEBUG("comp_%u_%u_%u dequeue=20 request failed.\n", > + =09 ring->me, ring->pipe, ring->queue); > + /* Manual disable if dequeue=20 request times out */ > + WREG32_SOC15(GC, 0,=20 mmCP_HQD_ACTIVE, 0); > + } > + WREG32_SOC15(GC, 0, mmCP_HQD_DEQUEUE_REQUEST,=20 0); > + } > + soc15_grbm_select(adev, 0, 0, 0, 0, 0); > + mutex_unlock(&adev->srbm_mutex); > + } > +} > + > static int gfx_v9_0_hw_fini(struct amdgpu_ip_block *ip_block) > { > struct amdgpu_device *adev =3D ip_block->adev; > @@ -4095,6 +4128,10 @@ static int gfx_v9_0_hw_fini(struct amdgpu_ip_block > *ip_block) return 0; > } >=20 > + if ((adev->flags & AMD_IS_APU) && amdgpu_in_reset(adev) && > + amdgpu_asic_reset_method(adev) =3D=3D=20 AMD_RESET_METHOD_MODE2) > + gfx_v9_0_deactivate_kcq_hqd(adev); > + > /* Use deinitialize sequence from CAIL when unbinding device from=20 driver, > * otherwise KIQ is hanging when binding back > */