All of lore.kernel.org
 help / color / mirror / Atom feed
From: sashiko-bot@kernel.org
To: "Lizhi Hou" <lizhi.hou@amd.com>
Cc: dri-devel@lists.freedesktop.org
Subject: Re: [PATCH V2] accel/amdxdna: Prevent PM resume deadlock in hwctx_sync_debug_bo()
Date: Tue, 16 Jun 2026 07:49:44 +0000	[thread overview]
Message-ID: <20260616074944.EC7351F000E9@smtp.kernel.org> (raw)
In-Reply-To: <20260616073555.3548595-1-lizhi.hou@amd.com>

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:

Pre-existing issues:
- [Critical] Stack corruption via use-after-free when userspace resets syncobj during a debug BO sync/config.
- [High] PM resume deadlock in `amdxdna_drm_config_hwctx_ioctl()`.
- [High] Deadlock preventing TDR (Timeout Detection and Recovery) from recovering hung hardware.
--

commit 8a4736ca1e30a73a36f619abfba2f12bdcd407ef
Author: Lizhi Hou <lizhi.hou@amd.com>
accel/amdxdna: Prevent PM resume deadlock in hwctx_sync_debug_bo()

This patch fixes a deadlock by moving the PM resume call before acquiring
the device lock in amdxdna_hwctx_sync_debug_bo.

> diff --git a/drivers/accel/amdxdna/amdxdna_ctx.c b/drivers/accel/amdxdna/amdxdna_ctx.c
> index 855da8c79a1cd..b7dbb60bca711 100644
> --- a/drivers/accel/amdxdna/amdxdna_ctx.c
> +++ b/drivers/accel/amdxdna/amdxdna_ctx.c

[ ... ]

> @@ -412,16 +412,25 @@ int amdxdna_hwctx_sync_debug_bo(struct amdxdna_client *client, u32 debug_bo_hdl)
>  	if (!gobj)
>  		return -EINVAL;
>  
> +	ret = amdxdna_pm_resume_get(xdna);
> +	if (ret) {
> +		XDNA_ERR(xdna, "Resume failed, ret %d", ret);
> +		goto put_obj;
> +	}
> +

[Severity: High]
This isn't a bug introduced by this patch, but does the exact same deadlock
pattern still exist in amdxdna_drm_config_hwctx_ioctl()?

When userspace issues DRM_AMDXDNA_HWCTX_ASSIGN_DBG_BUF, the ioctl wrapper
acquires the lock:

amdxdna_drm_config_hwctx_ioctl() {
    ...
    guard(mutex)(&xdna->dev_lock);
    ...
}

It then calls the hwctx_config callback, which traces down to
aie2_hwctx_cfg_debug_bo(), calling amdxdna_cmd_submit().
amdxdna_cmd_submit() unconditionally calls amdxdna_pm_resume_get().

If the device is suspended, this triggers a synchronous PM resume callback,
which attempts to acquire xdna->dev_lock, resulting in the same deadlock.
Unlike the patched code here, the ioctl wrapper doesn't seem to resume
the device before taking the lock.

>  	abo = to_xdna_obj(gobj);
> -	guard(mutex)(&xdna->dev_lock);
> +	mutex_lock(&xdna->dev_lock);

[Severity: High]
This is a pre-existing issue, but holding dev_lock here while calling down
to the hardware backend seems to cause a deadlock with TDR recovery.

Both amdxdna_hwctx_sync_debug_bo() and amdxdna_drm_config_hwctx_ioctl()
hold xdna->dev_lock while invoking backend operations. These operations
submit a job and wait indefinitely:

aie2_cmd_wait() {
    ...
    dma_fence_wait_timeout(out_fence, false, MAX_SCHEDULE_TIMEOUT);
    ...
}

If the hardware hangs, aie2_cmd_wait() blocks. The DRM scheduler timeout
triggers and executes aie2_sched_job_timedout() to reset the hardware.
However, the timeout handler attempts to acquire the lock:

aie2_sched_job_timedout() {
    ...
    guard(mutex)(&xdna->dev_lock);
    ...
}

Since the waiting thread already holds dev_lock and won't release it until
the job completes, does this deadlock the recovery thread, rendering the
system unable to recover from hardware hangs?

>  	hwctx = xa_load(&client->hwctx_xa, abo->assigned_hwctx);
>  	if (!hwctx) {
>  		ret = -EINVAL;
> -		goto put_obj;
> +		goto unlock;
>  	}
>  
>  	ret = xdna->dev_info->ops->hwctx_sync_debug_bo(hwctx, debug_bo_hdl);

[Severity: Critical]
This is a pre-existing issue, but could this synchronous wait lead to stack
corruption if userspace resets the syncobj?

In aie2_hwctx_sync_debug_bo() and aie2_hwctx_cfg_debug_bo(), a command
structure is allocated on the stack:

aie2_hwctx_sync_debug_bo() {
    struct amdxdna_drv_cmd cmd = { 0 };
    ...
}

This is submitted and cached in the DRM scheduler job (job->drv_cmd = &cmd).
The thread then waits using aie2_cmd_wait(), which fetches the out-fence
via drm_syncobj_fence_get(hwctx->priv->syncobj).

Since hwctx->syncobj_hdl is exported to userspace upon context creation,
what happens if userspace erroneously or maliciously calls
DRM_IOCTL_SYNCOBJ_RESET? 

drm_syncobj_fence_get() will return NULL, causing aie2_cmd_wait() to return
early. The caller function then exits, popping the cmd struct from the
stack while the DRM scheduler job is still executing.

When the job completes, the response handler writes the result:

aie2_sched_drvcmd_resp_handler() {
    ...
    job->drv_cmd->result = readl(data);
    ...
}

Would this lead to an arbitrary write-what-where stack corruption?

>  
> +unlock:
> +	mutex_unlock(&xdna->dev_lock);
> +	amdxdna_pm_suspend_put(xdna);
>  put_obj:
>  	drm_gem_object_put(gobj);
>  	return ret;

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260616073555.3548595-1-lizhi.hou@amd.com?part=1

      reply	other threads:[~2026-06-16  7:49 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-16  7:35 [PATCH V2] accel/amdxdna: Prevent PM resume deadlock in hwctx_sync_debug_bo() Lizhi Hou
2026-06-16  7:49 ` sashiko-bot [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260616074944.EC7351F000E9@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=lizhi.hou@amd.com \
    --cc=sashiko-reviews@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.