Re: [PATCH 2/6] drm/msm: Recover HW before retire hung submit

All of lore.kernel.org
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: "Akhil P Oommen" <akhilpo@oss.qualcomm.com>
Cc: dri-devel@lists.freedesktop.org
Subject: Re: [PATCH 2/6] drm/msm: Recover HW before retire hung submit
Date: Thu, 04 Jun 2026 20:27:20 +0000	[thread overview]
Message-ID: <20260604202721.924201F00893@smtp.kernel.org> (raw)
In-Reply-To: <20260605-assorted-fixes-june-v1-2-2caa04f7287c@oss.qualcomm.com>

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

Pre-existing issues:
- [High] Early fence signaling allows concurrent unmapping of shared BOs before the hung GPU is stopped, reproducing the exact pagefaults the patch intended to fix.
- [High] Concurrent GPU progress during hang recovery causes the recovery logic to double-increment the fence and silently drop a subsequent, unexecuted submit.
--

commit 3c515db3794c4f1061da14c3adffcf64bbdd2fc5
Author: Jie Zhang <jie.zhang@oss.qualcomm.com>

drm/msm: Recover HW before retire hung submit

During recovery, it is not safe to retire the hung submit before we
recover the GPU. Retiring the submit triggers BO free and that can
result in GPU pagefaults since the GPU may be actively accessing those
BOs.

To fix this, retire the submits after gpu recovery is complete in
recover_worker().

> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> index 18ed00e5f143b..9ac7740a87f01 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.c
> +++ b/drivers/gpu/drm/msm/msm_gpu.c
> @@ -552,11 +552,11 @@ static void recover_worker(struct kthread_work *work)
>  		msm_update_fence(ring->fctx, fence);
>  	}

[Severity: High]
This isn't a bug introduced by this patch, but can concurrent GPU progress
during hang recovery cause the recovery logic to double-increment the fence
and silently drop a subsequent, unexecuted submit?

In recover_worker(), the hung submit is identified early using:

    submit = find_submit(cur_ring, cur_ring->memptrs->fence + 1);

Because the GPU is not halted during the time it takes to capture the crash
state, it may legitimately finish the hung submit and concurrently write
the next fence value to ring->memptrs->fence.

Later in the function (just above this hunk), the loop reads the updated
fence:

    uint32_t fence = ring->memptrs->fence;
    ...
    if (ring == cur_ring)
        ring->memptrs->fence = ++fence;

This explicitly increments the already-updated fence and writes the next
value back to shared memory. retire_submits() will then retire the
unexecuted submit corresponding to that next fence, skipping it entirely
while userspace receives a success signal.

[Severity: High]
This is a pre-existing issue, but does early fence signaling still allow
concurrent unmapping of shared BOs before the hung GPU is stopped,
reproducing the exact pagefaults this patch intended to fix?

While this patch moves retire_submits() to happen after
gpu->funcs->recover(), the call to msm_update_fence() remains before the
recovery.

Because msm_update_fence() advances the fence context's completed_fence to
include the hung submit, any concurrent thread calling
dma_fence_is_signaled() (such as userspace polling a sync_file) will
observe the updated completed_fence.

This causes dma_fence_is_signaled() to wake up all waiters and allows them
to unmap the shared memory from the GPU page tables while the GPU is still
actively executing.

Should the loop containing msm_update_fence() also be moved after
gpu->funcs->recover() to completely close this window?

>  
> +	gpu->funcs->recover(gpu);
> +
>  	/* retire completed submits, plus the one that hung: */
>  	retire_submits(gpu);
>  
> -	gpu->funcs->recover(gpu);
> -
>  	/*
>  	 * Replay all remaining submits starting with highest priority
>  	 * ring

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260605-assorted-fixes-june-v1-0-2caa04f7287c@oss.qualcomm.com?part=2

next prev parent reply	other threads:[~2026-06-04 20:27 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-04 20:08 [PATCH 0/6] drm/msm: Assorted fixes - June/26 Akhil P Oommen
2026-06-04 20:08 ` [PATCH 1/6] drm/msm/a6xx: Fix stale rpmh votes after suspend Akhil P Oommen
2026-06-04 20:21   ` sashiko-bot
2026-06-05 13:09   ` Neil Armstrong
2026-06-06 12:04   ` Dmitry Baryshkov
2026-06-04 20:08 ` [PATCH 2/6] drm/msm: Recover HW before retire hung submit Akhil P Oommen
2026-06-04 20:27   ` sashiko-bot [this message]
2026-06-04 20:08 ` [PATCH 3/6] drm/msm/a6xx: Fix A663 GPUCC register list for state capture Akhil P Oommen
2026-06-06 12:04   ` Dmitry Baryshkov
2026-06-04 20:08 ` [PATCH 4/6] drm/msm/a6xx: Fix A621 " Akhil P Oommen
2026-06-06 12:05   ` Dmitry Baryshkov
2026-06-04 20:08 ` [PATCH 5/6] drm/msm/a6xx: Fix IRQ storm during msm_recovery test Akhil P Oommen
2026-06-04 20:25   ` sashiko-bot
2026-06-05  6:50   ` Rob Clark
2026-06-04 20:08 ` [PATCH 6/6] drm/msm: Fix task_struct reference leak in recover_worker Akhil P Oommen
2026-06-04 20:28   ` sashiko-bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260604202721.924201F00893@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=akhilpo@oss.qualcomm.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=sashiko-reviews@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.