[PATCH V2] accel/amdxdna: Prevent PM resume deadlock in hwctx_sync_debug

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH V2] accel/amdxdna: Prevent PM resume deadlock in hwctx_sync_debug_bo()
@ 2026-06-16  7:35 Lizhi Hou
  2026-06-16  7:49 ` sashiko-bot
  0 siblings, 1 reply; 2+ messages in thread
From: Lizhi Hou @ 2026-06-16  7:35 UTC (permalink / raw)
  To: ogabbay, quic_jhugo, dri-devel, mario.limonciello, shuvampandey1,
	karol.wachowski
  Cc: Lizhi Hou, linux-kernel, max.zhen, sonal.santan

amdxdna_hwctx_sync_debug_bo() invokes the hardware
hwctx_sync_debug_bo() callback while holding xdna->dev_lock.

The callback may call amdxdna_cmd_submit(), which in turn calls
amdxdna_pm_resume_get(). If the device is suspended,
amdxdna_pm_resume_get() may synchronously execute
amdxdna_pm_resume(), which also acquires xdna->dev_lock, resulting
in a deadlock.

Avoid the deadlock by calling amdxdna_pm_resume_get() before holding
xdna->dev_lock.

Fixes: 7ea046838021 ("accel/amdxdna: Support firmware debug buffer")
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
---
V2:
  Fix jumping forward over the guard(mutex) declaration.
 
 drivers/accel/amdxdna/amdxdna_ctx.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/accel/amdxdna/amdxdna_ctx.c b/drivers/accel/amdxdna/amdxdna_ctx.c
index ff6c3e8e5a15..2f3dfb155d1d 100644
--- a/drivers/accel/amdxdna/amdxdna_ctx.c
+++ b/drivers/accel/amdxdna/amdxdna_ctx.c
@@ -386,16 +386,25 @@ int amdxdna_hwctx_sync_debug_bo(struct amdxdna_client *client, u32 debug_bo_hdl)
 	if (!gobj)
 		return -EINVAL;
 
+	ret = amdxdna_pm_resume_get(xdna);
+	if (ret) {
+		XDNA_ERR(xdna, "Resume failed, ret %d", ret);
+		goto put_obj;
+	}
+
 	abo = to_xdna_obj(gobj);
-	guard(mutex)(&xdna->dev_lock);
+	mutex_lock(&xdna->dev_lock);
 	hwctx = xa_load(&client->hwctx_xa, abo->assigned_hwctx);
 	if (!hwctx) {
 		ret = -EINVAL;
-		goto put_obj;
+		goto unlock;
 	}
 
 	ret = xdna->dev_info->ops->hwctx_sync_debug_bo(hwctx, debug_bo_hdl);
 
+unlock:
+	mutex_unlock(&xdna->dev_lock);
+	amdxdna_pm_suspend_put(xdna);
 put_obj:
 	drm_gem_object_put(gobj);
 	return ret;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [PATCH V2] accel/amdxdna: Prevent PM resume deadlock in hwctx_sync_debug_bo()
  2026-06-16  7:35 [PATCH V2] accel/amdxdna: Prevent PM resume deadlock in hwctx_sync_debug_bo() Lizhi Hou
@ 2026-06-16  7:49 ` sashiko-bot
  0 siblings, 0 replies; 2+ messages in thread
From: sashiko-bot @ 2026-06-16  7:49 UTC (permalink / raw)
  To: Lizhi Hou; +Cc: dri-devel

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:

Pre-existing issues:
- [Critical] Stack corruption via use-after-free when userspace resets syncobj during a debug BO sync/config.
- [High] PM resume deadlock in `amdxdna_drm_config_hwctx_ioctl()`.
- [High] Deadlock preventing TDR (Timeout Detection and Recovery) from recovering hung hardware.
--

commit 8a4736ca1e30a73a36f619abfba2f12bdcd407ef
Author: Lizhi Hou <lizhi.hou@amd.com>
accel/amdxdna: Prevent PM resume deadlock in hwctx_sync_debug_bo()

This patch fixes a deadlock by moving the PM resume call before acquiring
the device lock in amdxdna_hwctx_sync_debug_bo.

> diff --git a/drivers/accel/amdxdna/amdxdna_ctx.c b/drivers/accel/amdxdna/amdxdna_ctx.c
> index 855da8c79a1cd..b7dbb60bca711 100644
> --- a/drivers/accel/amdxdna/amdxdna_ctx.c
> +++ b/drivers/accel/amdxdna/amdxdna_ctx.c

[ ... ]

> @@ -412,16 +412,25 @@ int amdxdna_hwctx_sync_debug_bo(struct amdxdna_client *client, u32 debug_bo_hdl)
>  	if (!gobj)
>  		return -EINVAL;
>  
> +	ret = amdxdna_pm_resume_get(xdna);
> +	if (ret) {
> +		XDNA_ERR(xdna, "Resume failed, ret %d", ret);
> +		goto put_obj;
> +	}
> +

[Severity: High]
This isn't a bug introduced by this patch, but does the exact same deadlock
pattern still exist in amdxdna_drm_config_hwctx_ioctl()?

When userspace issues DRM_AMDXDNA_HWCTX_ASSIGN_DBG_BUF, the ioctl wrapper
acquires the lock:

amdxdna_drm_config_hwctx_ioctl() {
    ...
    guard(mutex)(&xdna->dev_lock);
    ...
}

It then calls the hwctx_config callback, which traces down to
aie2_hwctx_cfg_debug_bo(), calling amdxdna_cmd_submit().
amdxdna_cmd_submit() unconditionally calls amdxdna_pm_resume_get().

If the device is suspended, this triggers a synchronous PM resume callback,
which attempts to acquire xdna->dev_lock, resulting in the same deadlock.
Unlike the patched code here, the ioctl wrapper doesn't seem to resume
the device before taking the lock.

>  	abo = to_xdna_obj(gobj);
> -	guard(mutex)(&xdna->dev_lock);
> +	mutex_lock(&xdna->dev_lock);

[Severity: High]
This is a pre-existing issue, but holding dev_lock here while calling down
to the hardware backend seems to cause a deadlock with TDR recovery.

Both amdxdna_hwctx_sync_debug_bo() and amdxdna_drm_config_hwctx_ioctl()
hold xdna->dev_lock while invoking backend operations. These operations
submit a job and wait indefinitely:

aie2_cmd_wait() {
    ...
    dma_fence_wait_timeout(out_fence, false, MAX_SCHEDULE_TIMEOUT);
    ...
}

If the hardware hangs, aie2_cmd_wait() blocks. The DRM scheduler timeout
triggers and executes aie2_sched_job_timedout() to reset the hardware.
However, the timeout handler attempts to acquire the lock:

aie2_sched_job_timedout() {
    ...
    guard(mutex)(&xdna->dev_lock);
    ...
}

Since the waiting thread already holds dev_lock and won't release it until
the job completes, does this deadlock the recovery thread, rendering the
system unable to recover from hardware hangs?

>  	hwctx = xa_load(&client->hwctx_xa, abo->assigned_hwctx);
>  	if (!hwctx) {
>  		ret = -EINVAL;
> -		goto put_obj;
> +		goto unlock;
>  	}
>  
>  	ret = xdna->dev_info->ops->hwctx_sync_debug_bo(hwctx, debug_bo_hdl);

[Severity: Critical]
This is a pre-existing issue, but could this synchronous wait lead to stack
corruption if userspace resets the syncobj?

In aie2_hwctx_sync_debug_bo() and aie2_hwctx_cfg_debug_bo(), a command
structure is allocated on the stack:

aie2_hwctx_sync_debug_bo() {
    struct amdxdna_drv_cmd cmd = { 0 };
    ...
}

This is submitted and cached in the DRM scheduler job (job->drv_cmd = &cmd).
The thread then waits using aie2_cmd_wait(), which fetches the out-fence
via drm_syncobj_fence_get(hwctx->priv->syncobj).

Since hwctx->syncobj_hdl is exported to userspace upon context creation,
what happens if userspace erroneously or maliciously calls
DRM_IOCTL_SYNCOBJ_RESET? 

drm_syncobj_fence_get() will return NULL, causing aie2_cmd_wait() to return
early. The caller function then exits, popping the cmd struct from the
stack while the DRM scheduler job is still executing.

When the job completes, the response handler writes the result:

aie2_sched_drvcmd_resp_handler() {
    ...
    job->drv_cmd->result = readl(data);
    ...
}

Would this lead to an arbitrary write-what-where stack corruption?

>  
> +unlock:
> +	mutex_unlock(&xdna->dev_lock);
> +	amdxdna_pm_suspend_put(xdna);
>  put_obj:
>  	drm_gem_object_put(gobj);
>  	return ret;

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260616073555.3548595-1-lizhi.hou@amd.com?part=1

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-06-16  7:49 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-16  7:35 [PATCH V2] accel/amdxdna: Prevent PM resume deadlock in hwctx_sync_debug_bo() Lizhi Hou
2026-06-16  7:49 ` sashiko-bot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.