From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 6944CCD98D2 for ; Tue, 16 Jun 2026 07:49:48 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id CDE3810E725; Tue, 16 Jun 2026 07:49:47 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=kernel.org header.i=@kernel.org header.b="bhqt1dpg"; dkim-atps=neutral Received: from tor.source.kernel.org (tor.source.kernel.org [172.105.4.254]) by gabe.freedesktop.org (Postfix) with ESMTPS id 4540710E725 for ; Tue, 16 Jun 2026 07:49:46 +0000 (UTC) Received: from smtp.kernel.org (quasi.space.kernel.org [100.103.45.18]) by tor.source.kernel.org (Postfix) with ESMTP id 5ED7A60154; Tue, 16 Jun 2026 07:49:45 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id EC7351F000E9; Tue, 16 Jun 2026 07:49:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1781596185; bh=3Ug0HhUJb41NJJHRiLmhy7ai5OQPV6bxRmWWbbXicAI=; h=From:Subject:Reply-To:To:Cc:In-Reply-To:References:Date; b=bhqt1dpgq7a6o40gHZZm7g7EdKMgMThVCE376w/tkFUY0TCvAGXSZRUMpKBel4W55 /7vBcCYYNAChLwXSUTqzmx2kByChSoHSj7xf3ZvM97gD5PcOomNUwdF4gCKpBwJ8+7 77wNCsRbeeJ+1GhHBF4XeCuX7GSbp1vqRz7iAbAAYXyb2zLAHgpX8MPvY1KVQad90x kxetFGVLh3gxMZq2rrU+Ra1mk9680U+Rc2XYtAuZvnxuDc3YUtYFJ7NnpQ7Bqh3cX6 pGD7K3WtrKlZW9r4hSXy++K8UGtcbF3EVmRgSYgnz8JzI5RxMiqLttT+546/YYfi9I hyNJxwnrjX3tw== From: sashiko-bot@kernel.org Subject: Re: [PATCH V2] accel/amdxdna: Prevent PM resume deadlock in hwctx_sync_debug_bo() To: "Lizhi Hou" Cc: dri-devel@lists.freedesktop.org In-Reply-To: <20260616073555.3548595-1-lizhi.hou@amd.com> References: <20260616073555.3548595-1-lizhi.hou@amd.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Date: Tue, 16 Jun 2026 07:49:44 +0000 Message-Id: <20260616074944.EC7351F000E9@smtp.kernel.org> X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: sashiko-reviews@lists.linux.dev Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Thank you for your contribution! Sashiko AI review found 3 potential issue(= s) to consider: Pre-existing issues: - [Critical] Stack corruption via use-after-free when userspace resets sync= obj during a debug BO sync/config. - [High] PM resume deadlock in `amdxdna_drm_config_hwctx_ioctl()`. - [High] Deadlock preventing TDR (Timeout Detection and Recovery) from reco= vering hung hardware. -- commit 8a4736ca1e30a73a36f619abfba2f12bdcd407ef Author: Lizhi Hou accel/amdxdna: Prevent PM resume deadlock in hwctx_sync_debug_bo() This patch fixes a deadlock by moving the PM resume call before acquiring the device lock in amdxdna_hwctx_sync_debug_bo. > diff --git a/drivers/accel/amdxdna/amdxdna_ctx.c b/drivers/accel/amdxdna/= amdxdna_ctx.c > index 855da8c79a1cd..b7dbb60bca711 100644 > --- a/drivers/accel/amdxdna/amdxdna_ctx.c > +++ b/drivers/accel/amdxdna/amdxdna_ctx.c [ ... ] > @@ -412,16 +412,25 @@ int amdxdna_hwctx_sync_debug_bo(struct amdxdna_clie= nt *client, u32 debug_bo_hdl) > if (!gobj) > return -EINVAL; > =20 > + ret =3D amdxdna_pm_resume_get(xdna); > + if (ret) { > + XDNA_ERR(xdna, "Resume failed, ret %d", ret); > + goto put_obj; > + } > + [Severity: High] This isn't a bug introduced by this patch, but does the exact same deadlock pattern still exist in amdxdna_drm_config_hwctx_ioctl()? When userspace issues DRM_AMDXDNA_HWCTX_ASSIGN_DBG_BUF, the ioctl wrapper acquires the lock: amdxdna_drm_config_hwctx_ioctl() { ... guard(mutex)(&xdna->dev_lock); ... } It then calls the hwctx_config callback, which traces down to aie2_hwctx_cfg_debug_bo(), calling amdxdna_cmd_submit(). amdxdna_cmd_submit() unconditionally calls amdxdna_pm_resume_get(). If the device is suspended, this triggers a synchronous PM resume callback, which attempts to acquire xdna->dev_lock, resulting in the same deadlock. Unlike the patched code here, the ioctl wrapper doesn't seem to resume the device before taking the lock. > abo =3D to_xdna_obj(gobj); > - guard(mutex)(&xdna->dev_lock); > + mutex_lock(&xdna->dev_lock); [Severity: High] This is a pre-existing issue, but holding dev_lock here while calling down to the hardware backend seems to cause a deadlock with TDR recovery. Both amdxdna_hwctx_sync_debug_bo() and amdxdna_drm_config_hwctx_ioctl() hold xdna->dev_lock while invoking backend operations. These operations submit a job and wait indefinitely: aie2_cmd_wait() { ... dma_fence_wait_timeout(out_fence, false, MAX_SCHEDULE_TIMEOUT); ... } If the hardware hangs, aie2_cmd_wait() blocks. The DRM scheduler timeout triggers and executes aie2_sched_job_timedout() to reset the hardware. However, the timeout handler attempts to acquire the lock: aie2_sched_job_timedout() { ... guard(mutex)(&xdna->dev_lock); ... } Since the waiting thread already holds dev_lock and won't release it until the job completes, does this deadlock the recovery thread, rendering the system unable to recover from hardware hangs? > hwctx =3D xa_load(&client->hwctx_xa, abo->assigned_hwctx); > if (!hwctx) { > ret =3D -EINVAL; > - goto put_obj; > + goto unlock; > } > =20 > ret =3D xdna->dev_info->ops->hwctx_sync_debug_bo(hwctx, debug_bo_hdl); [Severity: Critical] This is a pre-existing issue, but could this synchronous wait lead to stack corruption if userspace resets the syncobj? In aie2_hwctx_sync_debug_bo() and aie2_hwctx_cfg_debug_bo(), a command structure is allocated on the stack: aie2_hwctx_sync_debug_bo() { struct amdxdna_drv_cmd cmd =3D { 0 }; ... } This is submitted and cached in the DRM scheduler job (job->drv_cmd =3D &cm= d). The thread then waits using aie2_cmd_wait(), which fetches the out-fence via drm_syncobj_fence_get(hwctx->priv->syncobj). Since hwctx->syncobj_hdl is exported to userspace upon context creation, what happens if userspace erroneously or maliciously calls DRM_IOCTL_SYNCOBJ_RESET?=20 drm_syncobj_fence_get() will return NULL, causing aie2_cmd_wait() to return early. The caller function then exits, popping the cmd struct from the stack while the DRM scheduler job is still executing. When the job completes, the response handler writes the result: aie2_sched_drvcmd_resp_handler() { ... job->drv_cmd->result =3D readl(data); ... } Would this lead to an arbitrary write-what-where stack corruption? > =20 > +unlock: > + mutex_unlock(&xdna->dev_lock); > + amdxdna_pm_suspend_put(xdna); > put_obj: > drm_gem_object_put(gobj); > return ret; --=20 Sashiko AI review =C2=B7 https://sashiko.dev/#/patchset/20260616073555.3548= 595-1-lizhi.hou@amd.com?part=3D1