From: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com>
To: Rodrigo Vivi <rodrigo.vivi@intel.com>,
Andi Shyti <andi.shyti@linux.intel.com>
Cc: "intel-gfx@lists.freedesktop.org"
<intel-gfx@lists.freedesktop.org>,
"stable@vger.kernel.org" <stable@vger.kernel.org>,
"dri-devel@lists.freedesktop.org"
<dri-devel@lists.freedesktop.org>,
"chris@chris-wilson.co.uk" <chris@chris-wilson.co.uk>
Subject: Re: [Intel-gfx] [PATCH] drm/i915/gt: Reset twice
Date: Thu, 22 Dec 2022 11:28:34 +0200 [thread overview]
Message-ID: <4a3f841d-e0ab-dfd9-a6c0-08e2e04c7b6f@intel.com> (raw)
In-Reply-To: <Y5t+gEMl/XFpAh4N@intel.com>
On 12/15/22 10:07 PM, Rodrigo Vivi wrote:
> On Wed, Dec 14, 2022 at 11:37:19PM +0100, Andi Shyti wrote:
>> Hi Rodrigo,
>>
>> On Tue, Dec 13, 2022 at 01:18:48PM +0000, Vivi, Rodrigo wrote:
>>> On Tue, 2022-12-13 at 00:08 +0100, Andi Shyti wrote:
>>>> Hi Rodrigo,
>>>>
>>>> On Mon, Dec 12, 2022 at 11:55:10AM -0500, Rodrigo Vivi wrote:
>>>>> On Mon, Dec 12, 2022 at 05:13:38PM +0100, Andi Shyti wrote:
>>>>>> From: Chris Wilson <chris@chris-wilson.co.uk>
>>>>>>
>>>>>> After applying an engine reset, on some platforms like
>>>>>> Jasperlake, we
>>>>>> occasionally detect that the engine state is not cleared until
>>>>>> shortly
>>>>>> after the resume. As we try to resume the engine with volatile
>>>>>> internal
>>>>>> state, the first request fails with a spurious CS event (it looks
>>>>>> like
>>>>>> it reports a lite-restore to the hung context, instead of the
>>>>>> expected
>>>>>> idle->active context switch).
>>>>>>
>>>>>> Signed-off-by: Chris Wilson <hris@chris-wilson.co.uk>
>>>>>
>>>>> There's a typo in the signature email I'm afraid...
>>>>
>>>> oh yes, I forgot the 'C' :)
>>>
>>> you forgot?
>>> who signed it off?
>>
>> Chris, but as I was copy/pasting SoB's I might have
>> unintentionally removed the 'c'.
>>
>>>>> Other than that, have we checked the possibility of using the
>>>>> driver-initiated-flr bit
>>>>> instead of this second loop? That should be the right way to
>>>>> guarantee everything is
>>>>> cleared on gen11+...
>>>>
>>>> maybe I am misinterpreting it, but is FLR the same as resetting
>>>> hardware domains individually?
>>>
>>> No, it is bigger than that... almost the PCI FLR with some exceptions:
>>>
>>> https://lists.freedesktop.org/archives/intel-gfx/2022-December/313956.html
>>
>> yes, exactly... I would use FLR feedback if I was performing an
>> FLR reset. But here I'm not doing that, here I'm simply gating
>> off some power domains. It happens that those power domains turn
>> on and off engines making them reset.
>
> is this issue only seeing when this reset is called from the
> sanitize functions at probe and resumes?
> Or from any kind of gt reset?
>
> I don't remember seeing any reference link to the bug in the patch,
> hence I'm assuming this is happening in any kind of gt reset that
> ends up in this function.
>
>>
>> FLR doesn't have anything to do here, also because if you want to
>> reset a single engine you go through this function, instead of
>> resetting the whole GPU with whatever is annexed.
>
> yeap. That might be to extreme depending on the case. But all that
> I asked was if we were considering this option since this is the
> recommended way of reseting our engines nowadays.
>
>>
>> This patch is not fixing the "reset" concept of i915, but it's
>> fixing a missing feedback that happens in one single platform
>> when trying to gate on/off a domain.
>
> But it is changing the reset concept and timeouts for all the reset
> cases in all the platforms.
>
>>
>> Maybe I am completely off track, but I don't see connection with
>> FLR here.
>
> The point is that if a reset is needed, for any reason,
> the recommended way for Jasperlake, and any other newer platforms,
> is to use the FLR rather than the engine reset. But we are using
> the engine reset, and now twice, rather then attempt the recommended
> way.
>
>>
>> (besides FLR might not be present in all the platforms)
>
> This issue is also not present in all the platforms and you are still
> increasing the loops and delay for all the platforms.
>
>>
>> Thanks a lot for your inputs,
>
> have we looked to the Jasperlake workarounds to see if we are missing
> anything there that could help us in this case instead of this extreme
> approach of randomly increasing timeouts and attempts for all the
> platforms?
>
Hi, Rodrigo
JSL_WA (Bspec: 33451) doesn't seem to have a WA similar to this issue.
(Please correct me if I can't find it.)
>> Andi
>>
>>>> How am I supposed to use driver_initiated_flr() in this context?
>>>
>>> Some drivers are not even using this gt reset anymore and going
>>> directly with the driver-initiated FLR in case that guc reset failed.
>>>
>>> I believe we can still keep the gt reset in our case as we currently
>>> have, but instead of keep retrying it until it succeeds we probably
>>> should go to the next level and do the driver initiated FLR when the GT
>>> reset failed.
>>>
>>>>
>>>> Thanks,
>>>> Andi
>>>>
>>>>>> Cc: stable@vger.kernel.org
>>>>>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>>>>>> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
>>>>>> ---
>>>>>> drivers/gpu/drm/i915/gt/intel_reset.c | 34
>>>>>> ++++++++++++++++++++++-----
>>>>>> 1 file changed, 28 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>> b/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>> index ffde89c5835a4..88dfc0c5316ff 100644
>>>>>> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>> @@ -268,6 +268,7 @@ static int ilk_do_reset(struct intel_gt *gt,
>>>>>> intel_engine_mask_t engine_mask,
>>>>>> static int gen6_hw_domain_reset(struct intel_gt *gt, u32
>>>>>> hw_domain_mask)
>>>>>> {
>>>>>> struct intel_uncore *uncore = gt->uncore;
>>>>>> + int loops = 2;
>>>>>> int err;
>>>>>>
>>>>>> /*
>>>>>> @@ -275,18 +276,39 @@ static int gen6_hw_domain_reset(struct
>>>>>> intel_gt *gt, u32 hw_domain_mask)
>>>>>> * for fifo space for the write or forcewake the chip for
>>>>>> * the read
>>>>>> */
>>>>>> - intel_uncore_write_fw(uncore, GEN6_GDRST,
>>>>>> hw_domain_mask);
>>>>>> + do {
>>>>>> + intel_uncore_write_fw(uncore, GEN6_GDRST,
>>>>>> hw_domain_mask);
>>>>>>
>>>>>> - /* Wait for the device to ack the reset requests */
>>>>>> - err = __intel_wait_for_register_fw(uncore,
>>>>>> - GEN6_GDRST,
>>>>>> hw_domain_mask, 0,
>>>>>> - 500, 0,
>>>>>> - NULL);
>>>>>> + /*
>>>>>> + * Wait for the device to ack the reset requests.
>>>>>> + *
>>>>>> + * On some platforms, e.g. Jasperlake, we see see
>>>>>> that the
>>>>>> + * engine register state is not cleared until
>>>>>> shortly after
>>>>>> + * GDRST reports completion, causing a failure as
>>>>>> we try
>>>>>> + * to immediately resume while the internal state
>>>>>> is still
>>>>>> + * in flux. If we immediately repeat the reset,
>>>>>> the second
>>>>>> + * reset appears to serialise with the first, and
>>>>>> since
>>>>>> + * it is a no-op, the registers should retain
>>>>>> their reset
>>>>>> + * value. However, there is still a concern that
>>>>>> upon
>>>>>> + * leaving the second reset, the internal engine
>>>>>> state
>>>>>> + * is still in flux and not ready for resuming.
>>>>>> + */
>>>>>> + err = __intel_wait_for_register_fw(uncore,
>>>>>> GEN6_GDRST,
>>>>>> +
>>>>>> hw_domain_mask, 0,
>>>>>> + 2000, 0,
>>>>>> + NULL);
Andi, fast_timeout_us is increased from 500 to 2000, and if it fails, it
tries to reset it once more. How was this value of 2000 calculated?
>>>>>> + } while (err == 0 && --loops);
>>>>>> if (err)
>>>>>> GT_TRACE(gt,
>>>>>> "Wait for 0x%08x engines reset
>>>>>> failed\n",
>>>>>> hw_domain_mask);
Did GT_TRACE report an error in a situation where the problem was reported?
>>>>>>
>>>>>> + /*
>>>>>> + * As we have observed that the engine state is still
>>>>>> volatile
>>>>>> + * after GDRST is acked, impose a small delay to let
>>>>>> everything settle.
>>>>>> + */
>>>>>> + udelay(50);
udelay(50) affects all platforms that can call gen6_hw_domain_reset(),
is that intended?
Br,
G.G.
>>>>>> +
>>>>>> return err;
>>>>>> }
>>>>>>
>>>>>> --
>>>>>> 2.38.1
>>>>>>
>>>
next prev parent reply other threads:[~2022-12-22 9:29 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-12-12 16:13 [Intel-gfx] [PATCH] drm/i915/gt: Reset twice Andi Shyti
2022-12-12 16:55 ` Rodrigo Vivi
2022-12-12 23:08 ` Andi Shyti
2022-12-13 13:18 ` Vivi, Rodrigo
2022-12-14 22:37 ` Andi Shyti
2022-12-15 20:07 ` Rodrigo Vivi
2022-12-22 9:28 ` Gwan-gyeong Mun [this message]
2022-12-22 13:47 ` Andi Shyti
2022-12-23 6:24 ` Gwan-gyeong Mun
2022-12-12 18:34 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for " Patchwork
2022-12-12 18:46 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2022-12-13 10:11 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4a3f841d-e0ab-dfd9-a6c0-08e2e04c7b6f@intel.com \
--to=gwan-gyeong.mun@intel.com \
--cc=andi.shyti@linux.intel.com \
--cc=chris@chris-wilson.co.uk \
--cc=dri-devel@lists.freedesktop.org \
--cc=intel-gfx@lists.freedesktop.org \
--cc=rodrigo.vivi@intel.com \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox