From: John Harrison <john.c.harrison@intel.com>
To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>,
<Intel-GFX@Lists.FreeDesktop.Org>,
Matthew Brost <matthew.brost@intel.com>
Cc: DRI-Devel@Lists.FreeDesktop.Org
Subject: Re: [Intel-gfx] [PATCH 01/12] drm/i915: Remove bogus GEM_BUG_ON in unpark
Date: Fri, 22 Jul 2022 12:09:38 -0700 [thread overview]
Message-ID: <1934cccf-a596-d6dc-4417-63109b92303f@intel.com> (raw)
In-Reply-To: <e566ff59-68ce-c712-1619-da64c917c70a@linux.intel.com>
On 7/21/2022 02:24, Tvrtko Ursulin wrote:
> On 21/07/2022 01:54, John Harrison wrote:
>> On 7/19/2022 02:42, Tvrtko Ursulin wrote:
>>> On 19/07/2022 01:05, John Harrison wrote:
>>>> On 7/18/2022 05:15, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 13/07/2022 00:31, John.C.Harrison@Intel.com wrote:
>>>>>> From: Matthew Brost <matthew.brost@intel.com>
>>>>>>
>>>>>> Remove bogus GEM_BUG_ON which compared kernel context timeline
>>>>>> seqno to
>>>>>> seqno in memory on engine PM unpark. If a GT reset occurred these
>>>>>> values
>>>>>> might not match as a kernel context could be skipped. This bug was
>>>>>> hidden by always switching to a kernel context on park (execlists
>>>>>> requirement).
>>>>>
>>>>> Reset of the kernel context? Under which circumstances does that
>>>>> happen?
>>>> As per description, the issue is with full GT reset.
>>>>
>>>>>
>>>>> It is unclear if the claim is this to be a general problem or the
>>>>> assert is only invalid with the GuC. Lack of a CI reported issue
>>>>> suggests it is not a generic problem?
>>>> Currently it is not an issue because we always switch to the kernel
>>>> context because that's how execlists works and the entire driver is
>>>> fundamentally based on execlist operation. When we stop using the
>>>> kernel context as a (non-functional) barrier when using GuC
>>>> submission, then you would see an issue without this fix.
>
> Let me pick this point to try again - I am simply asking for a clear
> description of steps which lead to the problem, instead of, what I
> find are, generic and hard to penetrate statements of invalid
> assumptions etc.
>
> I picked this spot because of this: "When we stop using the kernel
> context as a (non-functional) barrier when using GuC submission, then
> you would see an issue without this fix."
>
> I point to 363324292710 ("drm/i915/guc: Don't call
> switch_to_kernel_context with GuC submission"). Hence it is not when
> but it already happened. Which in my mind then does not compute - I
> can't grok the explanation which appears to fall over on the first claim.
>
> Or perhaps the bug on is already firing today on every GuC enabled
> machine in the CI? In which case there is a Fixes: link to be added?
>
> I have asked about, if we have 363324292710, and if this patch is
> removing the seqno bug on, why it is not removing something more in
> __engine_unpark, gated on "is guc"? Like ss there a point to
> sanitizing the context which wasn't lost, because it wasn't used to
> park the engine with?
>
> Or if the problem can't be hit with execlists (in case reset claim
> from the commit message misleading), why shouldn't the bug on be
> changed to contain the !guc condition instead of being remove?
>
> I am simply asking for a clear explanation of the conditions and steps
> which lead to the bug on incorrectly firing. It doesn't have to be
> long text or anything like that, just clear so we can close this and
> move on.
>
> Regards,
>
> Tvrtko
@Matthew Brost, it's your patch, do you recall the details of when it
was going bang? I vaguely recall something about it being hit in local
testing pre-merge rather than by CI post-merge.
John.
>
>>>
>>> Issue is with GuC, GuC and full reset, or with full reset regardless
>>> of the backend?
>> The issue is with code making invalid assumptions. The assumption is
>> currently not failing because the execlist backend requires the use
>> of a barrier context for a bunch of operations. The GuC backend does
>> not require this. In fact, the barrier context does not function as a
>> barrier when the scheduler is external to i915. Hence the desire to
>> remove the use of the barrier context from generic i915 operation and
>> make it only used when in execlist mode. At that point, the invalid
>> assumption will no longer work and the BUG will fire.
>>
>>>
>>> If issue is only with GuC patch should have drm/i915/guc prefix as
>>> minimum. But if it actually only becomes a problem when GuC backend
>>> stops parking with the kernel context when I think the whole unpark
>>> code should be refactored in a cleaner way than just removing the
>>> one assert. Otherwise what is the point of leaving everything else
>>> in there?
>>>
>>> Or if the issue is backend agnostic, *if* full reset happens to hit
>>> during parking, then it is different. Wouldn't that be a race with
>>> parking and reset which probably shouldn't happen to start with.
>>>
>> The issue is neither with GuC nor with resets, GT or otherwise. The
>> issue is with generic i915 code making assumptions about backend
>> implementations that are only correct for the execlist implementation.
>>
>> John.
>>
>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>>>
>>>> John.
>>>>
>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Tvrtko
>>>>>
>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>> ---
>>>>>> drivers/gpu/drm/i915/gt/intel_engine_pm.c | 2 --
>>>>>> 1 file changed, 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>>>>> b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>>>>> index b0a4a2dbe3ee9..fb3e1599d04ec 100644
>>>>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>>>>> @@ -68,8 +68,6 @@ static int __engine_unpark(struct intel_wakeref
>>>>>> *wf)
>>>>>> ce->timeline->seqno,
>>>>>> READ_ONCE(*ce->timeline->hwsp_seqno),
>>>>>> ce->ring->emit);
>>>>>> - GEM_BUG_ON(ce->timeline->seqno !=
>>>>>> - READ_ONCE(*ce->timeline->hwsp_seqno));
>>>>>> }
>>>>>> if (engine->unpark)
>>>>
>>
next prev parent reply other threads:[~2022-07-22 19:09 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-07-12 23:31 [Intel-gfx] [PATCH 00/12] Random assortment of (mostly) GuC related patches John.C.Harrison
2022-07-12 23:31 ` [Intel-gfx] [PATCH 01/12] drm/i915: Remove bogus GEM_BUG_ON in unpark John.C.Harrison
2022-07-18 12:15 ` Tvrtko Ursulin
2022-07-19 0:05 ` John Harrison
2022-07-19 9:42 ` Tvrtko Ursulin
2022-07-21 0:54 ` John Harrison
2022-07-21 9:24 ` Tvrtko Ursulin
2022-07-22 19:09 ` John Harrison [this message]
2022-07-12 23:31 ` [Intel-gfx] [PATCH 02/12] drm/i915/guc: Don't call ring_is_idle in GuC submission John.C.Harrison
2022-07-18 12:26 ` Tvrtko Ursulin
2022-07-19 0:09 ` John Harrison
2022-07-19 9:49 ` Tvrtko Ursulin
2022-07-19 10:14 ` Tvrtko Ursulin
2022-07-12 23:31 ` [Intel-gfx] [PATCH 03/12] drm/i915/guc: Fix issues with live_preempt_cancel John.C.Harrison
2022-07-12 23:31 ` [Intel-gfx] [PATCH 04/12] drm/i915/guc: Add GuC <-> kernel time stamp translation information John.C.Harrison
2022-07-12 23:31 ` [Intel-gfx] [PATCH 05/12] drm/i915/guc: Record CTB info in error logs John.C.Harrison
2022-07-12 23:31 ` [Intel-gfx] [PATCH 06/12] drm/i915/guc: Use streaming loads to speed up dumping the guc log John.C.Harrison
2022-07-22 20:05 ` John Harrison
2022-07-12 23:31 ` [Intel-gfx] [PATCH 07/12] drm/i915/guc: Route semaphores to GuC for Gen12+ John.C.Harrison
2022-07-13 0:51 ` Matthew Brost
2022-07-15 17:21 ` Ceraolo Spurio, Daniele
2022-07-12 23:31 ` [Intel-gfx] [PATCH 08/12] drm/i915/guc: Add selftest for a hung GuC John.C.Harrison
2022-07-12 23:31 ` [Intel-gfx] [PATCH 09/12] drm/i915/selftest: Cope with not having an RCS engine John.C.Harrison
2022-07-13 0:48 ` Matthew Brost
2022-07-12 23:31 ` [Intel-gfx] [PATCH 10/12] drm/i915/guc: Support larger contexts on newer hardware John.C.Harrison
2022-07-18 12:35 ` Tvrtko Ursulin
2022-07-19 0:13 ` John Harrison
2022-07-19 9:56 ` Tvrtko Ursulin
2022-07-22 19:32 ` John Harrison
2022-07-25 11:24 ` Tvrtko Ursulin
2022-07-12 23:31 ` [Intel-gfx] [PATCH 11/12] drm/i915/guc: Don't abort on CTB_UNUSED status John.C.Harrison
2022-07-18 12:36 ` Tvrtko Ursulin
2022-07-19 0:16 ` John Harrison
2022-07-12 23:31 ` [Intel-gfx] [PATCH 12/12] drm/i915/guc: Add a helper for log buffer size John.C.Harrison
2022-07-13 0:46 ` Matthew Brost
2022-07-13 0:31 ` [Intel-gfx] ✗ Fi.CI.BUILD: warning for Random assortment of (mostly) GuC related patches Patchwork
2022-07-13 20:09 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Random assortment of (mostly) GuC related patches (rev2) Patchwork
2022-07-13 20:27 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2022-07-14 1:41 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1934cccf-a596-d6dc-4417-63109b92303f@intel.com \
--to=john.c.harrison@intel.com \
--cc=DRI-Devel@Lists.FreeDesktop.Org \
--cc=Intel-GFX@Lists.FreeDesktop.Org \
--cc=matthew.brost@intel.com \
--cc=tvrtko.ursulin@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox