public inbox for intel-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed
From: John Harrison <john.c.harrison@intel.com>
To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>,
	<Intel-GFX@Lists.FreeDesktop.Org>,
	Matthew Brost <matthew.brost@intel.com>
Cc: DRI-Devel@Lists.FreeDesktop.Org
Subject: Re: [Intel-gfx] [PATCH 01/12] drm/i915: Remove bogus GEM_BUG_ON in unpark
Date: Fri, 22 Jul 2022 12:09:38 -0700	[thread overview]
Message-ID: <1934cccf-a596-d6dc-4417-63109b92303f@intel.com> (raw)
In-Reply-To: <e566ff59-68ce-c712-1619-da64c917c70a@linux.intel.com>

On 7/21/2022 02:24, Tvrtko Ursulin wrote:
> On 21/07/2022 01:54, John Harrison wrote:
>> On 7/19/2022 02:42, Tvrtko Ursulin wrote:
>>> On 19/07/2022 01:05, John Harrison wrote:
>>>> On 7/18/2022 05:15, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 13/07/2022 00:31, John.C.Harrison@Intel.com wrote:
>>>>>> From: Matthew Brost <matthew.brost@intel.com>
>>>>>>
>>>>>> Remove bogus GEM_BUG_ON which compared kernel context timeline 
>>>>>> seqno to
>>>>>> seqno in memory on engine PM unpark. If a GT reset occurred these 
>>>>>> values
>>>>>> might not match as a kernel context could be skipped. This bug was
>>>>>> hidden by always switching to a kernel context on park (execlists
>>>>>> requirement).
>>>>>
>>>>> Reset of the kernel context? Under which circumstances does that 
>>>>> happen?
>>>> As per description, the issue is with full GT reset.
>>>>
>>>>>
>>>>> It is unclear if the claim is this to be a general problem or the 
>>>>> assert is only invalid with the GuC. Lack of a CI reported issue 
>>>>> suggests it is not a generic problem?
>>>> Currently it is not an issue because we always switch to the kernel 
>>>> context because that's how execlists works and the entire driver is 
>>>> fundamentally based on execlist operation. When we stop using the 
>>>> kernel context as a (non-functional) barrier when using GuC 
>>>> submission, then you would see an issue without this fix.
>
> Let me pick this point to try again - I am simply asking for a clear 
> description of steps which lead to the problem, instead of, what I 
> find are, generic and hard to penetrate statements of invalid 
> assumptions etc.
>
> I picked this spot because of this: "When we stop using the kernel 
> context as a (non-functional) barrier when using GuC submission, then 
> you would see an issue without this fix."
>
> I point to 363324292710 ("drm/i915/guc: Don't call 
> switch_to_kernel_context with GuC submission"). Hence it is not when 
> but it already happened. Which in my mind then does not compute - I 
> can't grok the explanation which appears to fall over on the first claim.
>
> Or perhaps the bug on is already firing today on every GuC enabled 
> machine in the CI? In which case there is a Fixes: link to be added?
>
> I have asked about, if we have 363324292710, and if this patch is 
> removing the seqno bug on, why it is not removing something more in 
> __engine_unpark, gated on "is guc"? Like ss there a point to 
> sanitizing the context which wasn't lost, because it wasn't used to 
> park the engine with?
>
> Or if the problem can't be hit with execlists (in case reset claim 
> from the commit message misleading), why shouldn't the bug on be 
> changed to contain the !guc condition instead of being remove?
>
> I am simply asking for a clear explanation of the conditions and steps 
> which lead to the bug on incorrectly firing. It doesn't have to be 
> long text or anything like that, just clear so we can close this and 
> move on.
>
> Regards,
>
> Tvrtko
@Matthew Brost, it's your patch, do you recall the details of when it 
was going bang? I vaguely recall something about it being hit in local 
testing pre-merge rather than by CI post-merge.

John.

>
>>>
>>> Issue is with GuC, GuC and full reset, or with full reset regardless 
>>> of the backend?
>> The issue is with code making invalid assumptions. The assumption is 
>> currently not failing because the execlist backend requires the use 
>> of a barrier context for a bunch of operations. The GuC backend does 
>> not require this. In fact, the barrier context does not function as a 
>> barrier when the scheduler is external to i915. Hence the desire to 
>> remove the use of the barrier context from generic i915 operation and 
>> make it only used when in execlist mode. At that point, the invalid 
>> assumption will no longer work and the BUG will fire.
>>
>>>
>>> If issue is only with GuC patch should have drm/i915/guc prefix as 
>>> minimum. But if it actually only becomes a problem when GuC backend 
>>> stops parking with the kernel context when I think the whole unpark 
>>> code should be refactored in a cleaner way than just removing the 
>>> one assert. Otherwise what is the point of leaving everything else 
>>> in there?
>>>
>>> Or if the issue is backend agnostic, *if* full reset happens to hit 
>>> during parking, then it is different. Wouldn't that be a race with 
>>> parking and reset which probably shouldn't happen to start with.
>>>
>> The issue is neither with GuC nor with resets, GT or otherwise. The 
>> issue is with generic i915 code making assumptions about backend 
>> implementations that are only correct for the execlist implementation.
>>
>> John.
>>
>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>>>
>>>> John.
>>>>
>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Tvrtko
>>>>>
>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/i915/gt/intel_engine_pm.c | 2 --
>>>>>>   1 file changed, 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c 
>>>>>> b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>>>>> index b0a4a2dbe3ee9..fb3e1599d04ec 100644
>>>>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>>>>> @@ -68,8 +68,6 @@ static int __engine_unpark(struct intel_wakeref 
>>>>>> *wf)
>>>>>>                ce->timeline->seqno,
>>>>>> READ_ONCE(*ce->timeline->hwsp_seqno),
>>>>>>                ce->ring->emit);
>>>>>> -        GEM_BUG_ON(ce->timeline->seqno !=
>>>>>> - READ_ONCE(*ce->timeline->hwsp_seqno));
>>>>>>       }
>>>>>>         if (engine->unpark)
>>>>
>>


  reply	other threads:[~2022-07-22 19:09 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-12 23:31 [Intel-gfx] [PATCH 00/12] Random assortment of (mostly) GuC related patches John.C.Harrison
2022-07-12 23:31 ` [Intel-gfx] [PATCH 01/12] drm/i915: Remove bogus GEM_BUG_ON in unpark John.C.Harrison
2022-07-18 12:15   ` Tvrtko Ursulin
2022-07-19  0:05     ` John Harrison
2022-07-19  9:42       ` Tvrtko Ursulin
2022-07-21  0:54         ` John Harrison
2022-07-21  9:24           ` Tvrtko Ursulin
2022-07-22 19:09             ` John Harrison [this message]
2022-07-12 23:31 ` [Intel-gfx] [PATCH 02/12] drm/i915/guc: Don't call ring_is_idle in GuC submission John.C.Harrison
2022-07-18 12:26   ` Tvrtko Ursulin
2022-07-19  0:09     ` John Harrison
2022-07-19  9:49       ` Tvrtko Ursulin
2022-07-19 10:14         ` Tvrtko Ursulin
2022-07-12 23:31 ` [Intel-gfx] [PATCH 03/12] drm/i915/guc: Fix issues with live_preempt_cancel John.C.Harrison
2022-07-12 23:31 ` [Intel-gfx] [PATCH 04/12] drm/i915/guc: Add GuC <-> kernel time stamp translation information John.C.Harrison
2022-07-12 23:31 ` [Intel-gfx] [PATCH 05/12] drm/i915/guc: Record CTB info in error logs John.C.Harrison
2022-07-12 23:31 ` [Intel-gfx] [PATCH 06/12] drm/i915/guc: Use streaming loads to speed up dumping the guc log John.C.Harrison
2022-07-22 20:05   ` John Harrison
2022-07-12 23:31 ` [Intel-gfx] [PATCH 07/12] drm/i915/guc: Route semaphores to GuC for Gen12+ John.C.Harrison
2022-07-13  0:51   ` Matthew Brost
2022-07-15 17:21   ` Ceraolo Spurio, Daniele
2022-07-12 23:31 ` [Intel-gfx] [PATCH 08/12] drm/i915/guc: Add selftest for a hung GuC John.C.Harrison
2022-07-12 23:31 ` [Intel-gfx] [PATCH 09/12] drm/i915/selftest: Cope with not having an RCS engine John.C.Harrison
2022-07-13  0:48   ` Matthew Brost
2022-07-12 23:31 ` [Intel-gfx] [PATCH 10/12] drm/i915/guc: Support larger contexts on newer hardware John.C.Harrison
2022-07-18 12:35   ` Tvrtko Ursulin
2022-07-19  0:13     ` John Harrison
2022-07-19  9:56       ` Tvrtko Ursulin
2022-07-22 19:32         ` John Harrison
2022-07-25 11:24           ` Tvrtko Ursulin
2022-07-12 23:31 ` [Intel-gfx] [PATCH 11/12] drm/i915/guc: Don't abort on CTB_UNUSED status John.C.Harrison
2022-07-18 12:36   ` Tvrtko Ursulin
2022-07-19  0:16     ` John Harrison
2022-07-12 23:31 ` [Intel-gfx] [PATCH 12/12] drm/i915/guc: Add a helper for log buffer size John.C.Harrison
2022-07-13  0:46   ` Matthew Brost
2022-07-13  0:31 ` [Intel-gfx] ✗ Fi.CI.BUILD: warning for Random assortment of (mostly) GuC related patches Patchwork
2022-07-13 20:09 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Random assortment of (mostly) GuC related patches (rev2) Patchwork
2022-07-13 20:27 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2022-07-14  1:41 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1934cccf-a596-d6dc-4417-63109b92303f@intel.com \
    --to=john.c.harrison@intel.com \
    --cc=DRI-Devel@Lists.FreeDesktop.Org \
    --cc=Intel-GFX@Lists.FreeDesktop.Org \
    --cc=matthew.brost@intel.com \
    --cc=tvrtko.ursulin@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox