All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nirmoy Das <nirmoy.das@linux.intel.com>
To: Andi Shyti <andi.shyti@linux.intel.com>
Cc: intel-gfx <intel-gfx@lists.freedesktop.org>,
	dri-devel <dri-devel@lists.freedesktop.org>,
	Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>,
	Nirmoy Das <nirmoy.das@intel.com>,
	Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>,
	Andi Shyti <andi.shyti@kernel.org>
Subject: Re: [PATCH 2/2] drm/i915: Don't treat FLR resets as errors
Date: Wed, 22 May 2024 11:07:45 +0200	[thread overview]
Message-ID: <af8150dc-c36f-478f-a0f5-b5dfed272b30@linux.intel.com> (raw)
In-Reply-To: <Zkx9-RHwTEnblEXo@ashyti-mobl2.lan>

Hi Andi,

On 5/21/2024 12:56 PM, Andi Shyti wrote:
> Hi Nirmoy,
>
> On Fri, May 17, 2024 at 10:13:37PM +0200, Nirmoy Das wrote:
>> Hi Andi,
>>
>> On 5/17/2024 9:34 PM, Andi Shyti wrote:
>>
>>      Hi Nirmoy,
>>
>>      On Fri, May 17, 2024 at 04:00:02PM +0200, Nirmoy Das wrote:
>>
>>          On 5/17/2024 1:25 PM, Andi Shyti wrote:
>>
>>              If we timeout while waiting for an FLR reset, there is nothing we
>>              can do and i915 doesn't have any control on it. In any case the
>>              system is still perfectly usable
>>
>>          If a FLR reset fails then we will have a dead GPU, I don't think the GPU is
>>          usable without a cold reboot.
>>
>>      fact is that the GPU keeps going and even though the timeout has
>>      expired, the system moves to the next phase.
>>
>> The current test might look like it is has passed, but if you look into the
>> subsequent tests you can see a dead GPU:
>>
>> <7>[  369.168121] pci 0000:00:02.0: [drm:intel_uncore_fini_mmio [i915]] Triggering Driver-FLR
>> <3>[  372.170189] pci 0000:00:02.0: [drm] *ERROR* Driver-FLR-teardown wait completion failed! -110
>> <7>[  372.437630] [IGT] i915_selftest: finished subtest requests, SUCCESS
>> <7>[  372.438356] [IGT] i915_selftest: starting dynamic subtest migrate
>> <5>[  373.110580] Setting dangerous option live_selftests - tainting kernel
>> <3>[  373.183499] i915 0000:00:02.0: Unable to change power state from D0 to D0, device inaccessible
>> <3>[  373.246921] i915 0000:00:02.0: [drm] *ERROR* Unrecognized display IP version 1023.255; disabling display.
>> <7>[  373.247130] i915 0000:00:02.0: [drm:intel_step_init [i915]] Using future steppings
>> <7>[  373.247716] i915 0000:00:02.0: [drm:intel_step_init [i915]] Using future steppings
>> <7>[  373.248263] i915 0000:00:02.0: [drm:intel_step_init [i915]] Using future display steppings
>> <7>[  373.251843] i915 0000:00:02.0: [drm:intel_gt_common_init_early [i915]] WOPCM: 2048K
>> <7>[  373.252505] i915 0000:00:02.0: [drm:intel_uc_init_early [i915]] GT0: enable_guc=3 (guc:yes submission:yes huc:no slpc:yes)
>> <7>[  373.253140] i915 0000:00:02.0: [drm:intel_gt_probe_all [i915]] GT0: Setting up Primary GT
>> <7>[  373.253556] i915 0000:00:02.0: [drm:intel_gt_probe_all [i915]] GT1: Setting up Standalone Media GT
>> <7>[  373.253941] i915 0000:00:02.0: [drm:intel_gt_common_init_early [i915]] WOPCM: 2048K
>> <7>[  373.254365] i915 0000:00:02.0: [drm:intel_uc_init_early [i915]] GT1: enable_guc=3 (guc:yes submission:yes huc:yes slpc:yes)
>> <3>[  375.256235] i915 0000:00:02.0: [drm] *ERROR* Device is non-operational; MMIO access returns 0xFFFFFFFF!
>> <3>[  375.259089] i915 0000:00:02.0: Device initialization failed (-5)
>> <3>[  375.260521] i915 0000:00:02.0: probe with driver i915 failed with error -5
>> <7>[  375.392209] [IGT] i915_selftest: finished subtest migrate, FAIL
>>
>> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_14724/bat-arls-3/dmesg0.txt
> Are we sure this is dependent on the FLR reset?

Yes, while on FLR read into memory will return either 0/F.


>   There are cases
> when the FLR reset doesn't make any difference and in any case
> this error is completely ignored by the driver.

This happens at very late with no recovery possible and hope is module  
reload works.


>
> Perhaps we can change it to a warning?

I think it should be error. CI will still complain even on warning.


>
>>          This is a serious issue and should be report as an error.  I think we need
>>          to create a HW ticket to understand
>>
>>          why is FLR reset fails.
>>
>>      Maybe it takes longer and longer to reset. We've been sending
>>      several patches in the latest years to fix the timings.
>>
>> HW spec says 3 sec but we can try increasing it bit higher to try it out.
> We could go, then, with just patch 1 and see if it improves.

Does it help ? If helps then we can go ahead with increased timeout.


>   Also
> because, the FLR reset might also depend on the firmware.

Possible. In that case we should wait for firmware fix ?


Regards,

Nirmoy

>
> Thanks, Nirmoy,
> Andi

  reply	other threads:[~2024-05-22  9:07 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-17 11:25 [PATCH 0/2] Don't be alarmed at FLR timeouts Andi Shyti
2024-05-17 11:25 ` [PATCH 1/2] drm/i915: Increase FLR timeout from 3s to 9s Andi Shyti
2024-05-17 11:25 ` [PATCH 2/2] drm/i915: Don't treat FLR resets as errors Andi Shyti
2024-05-17 14:00   ` Nirmoy Das
2024-05-17 19:34     ` Andi Shyti
2024-05-17 20:13       ` Nirmoy Das
2024-05-21 10:56         ` Andi Shyti
2024-05-22  9:07           ` Nirmoy Das [this message]
2024-05-17 12:45 ` ✓ Fi.CI.BAT: success for Don't be alarmed at FLR timeouts Patchwork
2024-05-17 17:53 ` ✗ Fi.CI.IGT: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=af8150dc-c36f-478f-a0f5-b5dfed272b30@linux.intel.com \
    --to=nirmoy.das@linux.intel.com \
    --cc=andi.shyti@kernel.org \
    --cc=andi.shyti@linux.intel.com \
    --cc=daniele.ceraolospurio@intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=janusz.krzysztofik@linux.intel.com \
    --cc=nirmoy.das@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.