Intel-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Sebastian Brzezinka <sebastian.brzezinka@intel.com>
To: Tvrtko Ursulin <tursulin@ursulin.net>,
	Sebastian Brzezinka <sebastian.brzezinka@intel.com>,
	<intel-gfx@lists.freedesktop.org>
Cc: <andi.shyti@linux.intel.com>, <krzysztof.karas@intel.com>,
	<stable@vger.kernel.org>
Subject: Re: [PATCH] drm/i915: skip __i915_request_skip() for already signaled requests
Date: Wed, 13 May 2026 13:38:47 +0200	[thread overview]
Message-ID: <DIHIPV1FJITR.1FJEZMGRDSR7I@intel.com> (raw)
In-Reply-To: <c9c1270b-f724-45dd-a66d-f7b30f6c6087@ursulin.net>

Hi,

On Wed May 13, 2026 at 10:47 AM CEST, Tvrtko Ursulin wrote:
>
> On 16/04/2026 12:31, Sebastian Brzezinka wrote:
>> After a GPU reset the HWSP is zeroed, so previously completed
>> requests appear incomplete. If such a request is picked up during
>> reset_rewind() and marked guilty, i915_request_set_error_once()
>> returns early (fence already signaled), leaving fence.error without
>> a fatal error code. The subsequent __i915_request_skip() then hits:
>> ```
>> GEM_BUG_ON(!fatal_error(rq->fence.error))
>> ```
>> 
>> Fixes a kernel BUG observed on Sandy Bridge (Gen6) during
>> heartbeat-triggered engine resets.
>> ```
>> kernel BUG at drivers/gpu/drm/i915/i915_request.c:556!
>> RIP: __i915_request_skip+0x15e/0x1d0 [i915]
>> ...
>> __i915_request_reset+0x212/0xa70 [i915]
>> reset_rewind+0xe4/0x280 [i915]
>> intel_gt_reset+0x30d/0x5b0 [i915]
>> heartbeat+0x516/0x530 [i915]
>> ```
>> 
>> Guard __i915_request_skip() with i915_request_signaled(), if the
>> fence is already signaled, the ring content is committed and there
>> is nothing left to skip.
>> 
>> Cc: stable@vger.kernel.org
>> Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/work_items/13729
>> Fixes: 36e191f0644b ("drm/i915: Apply i915_request_skip() on submission")
>> Signed-off-by: Sebastian Brzezinka <sebastian.brzezinka@intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/intel_reset.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
>> index 37272871b0f2..b728a5171e93 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
>> @@ -133,7 +133,8 @@ void __i915_request_reset(struct i915_request *rq, bool guilty)
>>   	rcu_read_lock(); /* protect the GEM context */
>>   	if (guilty) {
>>   		i915_request_set_error_once(rq, -EIO);
>> -		__i915_request_skip(rq);
>> +		if (!i915_request_signaled(rq))
>> +			__i915_request_skip(rq);
>
> I spotted this patch in drm-intel-fixes today so some questions.
>
> If the request is okay why is setting error and marking it guilty left?
The request can still be guilty even if it already looks signaled
after reset. The important point is that i915_request_set_error_once()
will return early once the request is already signaled, so it may not
actually inject the error. That leaves __i915_request_skip() with no
error to work with, which is why the guard is needed.

>
> 1)
> How confident are you of the Fixes: target? That patch is six years old 
> but the Closes: issue is only from last year? Do internal Intel log have 
> evidence bug was there in between those two dates? How sporadic was it? 
a
I’m quite confident the fix is correct, and it should not break anything
if it is missing some corner detail. This is an extremely rare issue,
specific to Sandy Bridge, but we know it was present at least as far
back as 2022 from work item 5774. The bug only shows up when the right
reset timing lines up, which is why it is so sporadic.

> Were you able to verify the fix easily or with difficulty and how?
I verified the fix mainly by code analysis. In the worst case, it should
not break anything because the change only skips __i915_request_skip()
when the request is already signaled, and  the ring content is already
committed, so there is nothing left to skip.

>
> 2)
> Is the issue only that the order of setting the error code and the bug 
> on got swapped?
>
> Ie. before 36e191f0644b
>
> __i915_request_reset
>   -> i915_request_skip
>         GEM_BUG_ON(!IS_ERR_VALUE((long)error));
>         dma_fence_set_error(&rq->fence, error);
>
> After:
>
> __i915_request_reset
>   i915_request_set_error_once
>   -> i915_request_skip
>
Yes, exactly. In the old code, i915_request_skip(rq, error) always set
the fence error first, before doing anything else:
```
GEM_BUG_ON(!IS_ERR_VALUE((long)error));
dma_fence_set_error(&rq->fence, error);
```

So even if the request was already signaled, the error was still
recorded. That is the important difference.  After 36e191f0644b, that
logic was split into two steps:
```
i915_request_set_error_once(rq, -EIO);
__i915_request_skip(rq);
```

Now i915_request_set_error_once() can return early when the request is
already signaled, which means the error may never get set at all. That
is why the new guard is needed around __i915_request_skip(). The old
code did not have this problem because the error was set unconditionally
inside i915_request_skip().

> If that is the case commit message should have been clearer on both 
> questions.
could you tell me how I should send the corrected commit message? Is it
enough to send it here, or should I send a new version to the mailing
list?

-- 
Best regards,
Sebastian


  reply	other threads:[~2026-05-13 11:39 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-16 11:31 [PATCH] drm/i915: skip __i915_request_skip() for already signaled requests Sebastian Brzezinka
2026-04-16 13:19 ` ✓ i915.CI.BAT: success for " Patchwork
2026-04-16 23:10 ` ✗ i915.CI.Full: failure " Patchwork
2026-04-20  9:18 ` [PATCH] " Krzysztof Karas
2026-04-28 15:10   ` Andi Shyti
2026-04-20 18:04 ` ✓ i915.CI.BAT: success for drm/i915: skip __i915_request_skip() for already signaled requests (rev2) Patchwork
2026-04-20 20:08 ` ✓ i915.CI.Full: " Patchwork
2026-05-08 12:40 ` [PATCH] drm/i915: skip __i915_request_skip() for already signaled requests Andi Shyti
2026-05-13  8:47 ` Tvrtko Ursulin
2026-05-13 11:38   ` Sebastian Brzezinka [this message]
2026-05-13 15:11     ` Tvrtko Ursulin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DIHIPV1FJITR.1FJEZMGRDSR7I@intel.com \
    --to=sebastian.brzezinka@intel.com \
    --cc=andi.shyti@linux.intel.com \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=krzysztof.karas@intel.com \
    --cc=stable@vger.kernel.org \
    --cc=tursulin@ursulin.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox