All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Matt Roper <matthew.d.roper@intel.com>
Cc: Intel-gfx@lists.freedesktop.org,
	Lucas De Marchi <lucas.demarchi@intel.com>,
	dri-devel@lists.freedesktop.org
Subject: Re: [Intel-gfx] [PATCH] drm/i915/dg2: Catch and log more unexpected values in DG1_MSTR_TILE_INTR
Date: Tue, 7 Jun 2022 10:20:51 +0100	[thread overview]
Message-ID: <2fd8b3c5-de0e-19fd-a500-1faf5d5bdb4b@linux.intel.com> (raw)
In-Reply-To: <Yp4bblojY+tq66VD@mdroper-desk1.amr.corp.intel.com>


On 06/06/2022 16:21, Matt Roper wrote:
> On Mon, Jun 06, 2022 at 12:55:20PM +0100, Tvrtko Ursulin wrote:
>>
>> On 27/05/2022 19:42, Matt Roper wrote:
>>> On Thu, May 26, 2022 at 11:18:17AM +0100, Tvrtko Ursulin wrote:
>>>> On 25/05/2022 19:05, Matt Roper wrote:
>>>>> On Wed, May 25, 2022 at 05:03:13PM +0100, Tvrtko Ursulin wrote:
>>>>>>
>>>>>> On 24/05/2022 18:51, Matt Roper wrote:
>>>>>>> On Tue, May 24, 2022 at 10:43:39AM +0100, Tvrtko Ursulin wrote:
>>>>>>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>>>>
>>>>>>>> Catch and log any garbage in the register, including no tiles marked, or
>>>>>>>> multiple tiles marked.
>>>>>>>>
>>>>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>>>> Cc: Matt Roper <matthew.d.roper@intel.com>
>>>>>>>> ---
>>>>>>>> We caught garbage in DG1_MSTR_TILE_INTR with DG2 (actual value 0xF9D2C008)
>>>>>>>> during glmark and more badness. So I thought lets log all possible failure
>>>>>>>> modes from here and also use per device logging.
>>>>>>>> ---
>>>>>>>>      drivers/gpu/drm/i915/i915_irq.c | 33 ++++++++++++++++++++++-----------
>>>>>>>>      drivers/gpu/drm/i915/i915_reg.h |  1 +
>>>>>>>>      2 files changed, 23 insertions(+), 11 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
>>>>>>>> index 73cebc6aa650..79853d3fc1ed 100644
>>>>>>>> --- a/drivers/gpu/drm/i915/i915_irq.c
>>>>>>>> +++ b/drivers/gpu/drm/i915/i915_irq.c
>>>>>>>> @@ -2778,24 +2778,30 @@ static irqreturn_t dg1_irq_handler(int irq, void *arg)
>>>>>>>>      	u32 gu_misc_iir;
>>>>>>>>      	if (!intel_irqs_enabled(i915))
>>>>>>>> -		return IRQ_NONE;
>>>>>>>> +		goto none;
>>>>>>>>      	master_tile_ctl = dg1_master_intr_disable(regs);
>>>>>>>> -	if (!master_tile_ctl) {
>>>>>>>> -		dg1_master_intr_enable(regs);
>>>>>>>> -		return IRQ_NONE;
>>>>>>>> +	if (!master_tile_ctl)
>>>>>>>> +		goto enable_none;
>>>>>>>> +
>>>>>>>> +	if (master_tile_ctl & ~(DG1_MSTR_IRQ | DG1_MSTR_TILE_MASK)) {
>>>>>>>> +		drm_warn(&i915->drm, "Garbage in master_tile_ctl: 0x%08x!\n",
>>>>>>>> +			 master_tile_ctl);
>>>>>>>
>>>>>>> I know we have a bunch of them already, but shouldn't we be avoiding
>>>>>>> printk-based stuff like this inside interrupt handlers?  Should we be
>>>>>>> migrating all these error messages over to trace_printk or something
>>>>>>> similar that's safer to use?
>>>>>>
>>>>>> Not sure - I kind of think some really unexpected and worrying situations
>>>>>> should be loud and on by default. Risk is then spam if not ratelimited.
>>>>>> Maybe we should instead ratelimit most errors/warnings coming for irq
>>>>>> handlers?
>>>>>
>>>>> It's not the risk of spam that's the problem, but rather that
>>>>> printk-based stuff eventually calls into the console code to flush its
>>>>> buffers.  That's way more overhead than you want in an interrupt handler
>>>>> so it's bad on its own, but if you're using something slow like a serial
>>>>> console, it becomes even more of a problem.
>>>>
>>>> Is it a problem for messages which we never expect to see?
>>>
>>> Kind of.  While not as catastrophic, it's the same argument for why we
>>> don't use BUG() anymore...when the impossible does manage to happen
>>> there's unnecessary collateral damage on things outside of graphics.  If
>>> we're adding huge delays inside an interrupt handler (while other
>>> interrupts are disabled) that impacts the system-wide usability, not
>>> just our own driver.
>>>
>>> I'd also argue that these messages actually are semi-expected.  Random
>>> bits being set shouldn't happen, but in the world of dgpu's, we do
>>> occasionally see cases where the PCI link itself goes down for reasons
>>> outside our control and then all registers read back as 0xFFFFFFFF,
>>> which will probably trigger error messages here (as well as a bunch of
>>> other places).
>>
>> Could you expand a bit on what is semi-expected and when? I mean the
>> circumstances of PCI link going down. We certainly don't have any code to
>> survive that.
> 
> Yeah, I'm referring to the "Lost access to MMIO BAR" errors; in the past
> most of them have ultimately been tracked down to bugs in early
> firmware, so flashing an updated IFWI/BIOS onto the device usually
> solved the problems.  Generally those buggy firmwares are an internal
> problem that never make it into the wild, but I think we have also seen
> cases where they get triggered by physical/electrical problems on a
> specific part; that can potentially happen to anyone who's unlucky
> enough to get a defective/damaged unit.
> 
> Basically "hardware returns all F's" happens because the CPU initiates
> an MMIO transaction with the hardware, the hardware fails to produce any
> response (possibly due to failing hardware, possibly due to
> firmware/BIOS bugs), so 0xFFFFFFFF gets returned as an autocompletion to
> prevent the CPU core from hanging.
> 
> It looks like we still have a few open here:
> https://gitlab.freedesktop.org/search?search=%22Lost+access+to+MMIO+BAR%22&group_id=2642&project_id=4519&scope=issues&search_code=false&snippets=false&repository_ref=&nav_source=navbar
> 
> and there are some features on specific platforms we haven't turned on
> yet because they also trigger these failures (which is still under
> debug).
> 
> We don't/can't really do much to handle these problems in i915 today
> except printing the 'lost access' error so that we know to ignore
> whatever kinds of bogus errors we get after that point (usually lots of
> messages about forcewake failing to clear, engine/GuC reset failing to
> complete, etc.).  But aside from i915 being broken, the rest of the
> platform should generally continue to work, so you can still access the
> machine over the network, save logs to disk, etc.

Interesting, I missed the addition of 29b6f88d60dd ("drm/i915: Try to 
detect sudden loss of MMIO access"), thanks!

In case of my "Garbage in master_tile_ctl" or "Unexpected irq from 
tile.." messages, in case of lost PCI link they should happen only once. 
I don't think hardware will keep raising interrupts if driver cannot 
talk to it. But it does seem prudent to go with the rate-limiting 
flavour just in case.

Regards,

Tvrtko

WARNING: multiple messages have this Message-ID (diff)
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Matt Roper <matthew.d.roper@intel.com>
Cc: Intel-gfx@lists.freedesktop.org,
	Lucas De Marchi <lucas.demarchi@intel.com>,
	dri-devel@lists.freedesktop.org,
	Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Subject: Re: [PATCH] drm/i915/dg2: Catch and log more unexpected values in DG1_MSTR_TILE_INTR
Date: Tue, 7 Jun 2022 10:20:51 +0100	[thread overview]
Message-ID: <2fd8b3c5-de0e-19fd-a500-1faf5d5bdb4b@linux.intel.com> (raw)
In-Reply-To: <Yp4bblojY+tq66VD@mdroper-desk1.amr.corp.intel.com>


On 06/06/2022 16:21, Matt Roper wrote:
> On Mon, Jun 06, 2022 at 12:55:20PM +0100, Tvrtko Ursulin wrote:
>>
>> On 27/05/2022 19:42, Matt Roper wrote:
>>> On Thu, May 26, 2022 at 11:18:17AM +0100, Tvrtko Ursulin wrote:
>>>> On 25/05/2022 19:05, Matt Roper wrote:
>>>>> On Wed, May 25, 2022 at 05:03:13PM +0100, Tvrtko Ursulin wrote:
>>>>>>
>>>>>> On 24/05/2022 18:51, Matt Roper wrote:
>>>>>>> On Tue, May 24, 2022 at 10:43:39AM +0100, Tvrtko Ursulin wrote:
>>>>>>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>>>>
>>>>>>>> Catch and log any garbage in the register, including no tiles marked, or
>>>>>>>> multiple tiles marked.
>>>>>>>>
>>>>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>>>> Cc: Matt Roper <matthew.d.roper@intel.com>
>>>>>>>> ---
>>>>>>>> We caught garbage in DG1_MSTR_TILE_INTR with DG2 (actual value 0xF9D2C008)
>>>>>>>> during glmark and more badness. So I thought lets log all possible failure
>>>>>>>> modes from here and also use per device logging.
>>>>>>>> ---
>>>>>>>>      drivers/gpu/drm/i915/i915_irq.c | 33 ++++++++++++++++++++++-----------
>>>>>>>>      drivers/gpu/drm/i915/i915_reg.h |  1 +
>>>>>>>>      2 files changed, 23 insertions(+), 11 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
>>>>>>>> index 73cebc6aa650..79853d3fc1ed 100644
>>>>>>>> --- a/drivers/gpu/drm/i915/i915_irq.c
>>>>>>>> +++ b/drivers/gpu/drm/i915/i915_irq.c
>>>>>>>> @@ -2778,24 +2778,30 @@ static irqreturn_t dg1_irq_handler(int irq, void *arg)
>>>>>>>>      	u32 gu_misc_iir;
>>>>>>>>      	if (!intel_irqs_enabled(i915))
>>>>>>>> -		return IRQ_NONE;
>>>>>>>> +		goto none;
>>>>>>>>      	master_tile_ctl = dg1_master_intr_disable(regs);
>>>>>>>> -	if (!master_tile_ctl) {
>>>>>>>> -		dg1_master_intr_enable(regs);
>>>>>>>> -		return IRQ_NONE;
>>>>>>>> +	if (!master_tile_ctl)
>>>>>>>> +		goto enable_none;
>>>>>>>> +
>>>>>>>> +	if (master_tile_ctl & ~(DG1_MSTR_IRQ | DG1_MSTR_TILE_MASK)) {
>>>>>>>> +		drm_warn(&i915->drm, "Garbage in master_tile_ctl: 0x%08x!\n",
>>>>>>>> +			 master_tile_ctl);
>>>>>>>
>>>>>>> I know we have a bunch of them already, but shouldn't we be avoiding
>>>>>>> printk-based stuff like this inside interrupt handlers?  Should we be
>>>>>>> migrating all these error messages over to trace_printk or something
>>>>>>> similar that's safer to use?
>>>>>>
>>>>>> Not sure - I kind of think some really unexpected and worrying situations
>>>>>> should be loud and on by default. Risk is then spam if not ratelimited.
>>>>>> Maybe we should instead ratelimit most errors/warnings coming for irq
>>>>>> handlers?
>>>>>
>>>>> It's not the risk of spam that's the problem, but rather that
>>>>> printk-based stuff eventually calls into the console code to flush its
>>>>> buffers.  That's way more overhead than you want in an interrupt handler
>>>>> so it's bad on its own, but if you're using something slow like a serial
>>>>> console, it becomes even more of a problem.
>>>>
>>>> Is it a problem for messages which we never expect to see?
>>>
>>> Kind of.  While not as catastrophic, it's the same argument for why we
>>> don't use BUG() anymore...when the impossible does manage to happen
>>> there's unnecessary collateral damage on things outside of graphics.  If
>>> we're adding huge delays inside an interrupt handler (while other
>>> interrupts are disabled) that impacts the system-wide usability, not
>>> just our own driver.
>>>
>>> I'd also argue that these messages actually are semi-expected.  Random
>>> bits being set shouldn't happen, but in the world of dgpu's, we do
>>> occasionally see cases where the PCI link itself goes down for reasons
>>> outside our control and then all registers read back as 0xFFFFFFFF,
>>> which will probably trigger error messages here (as well as a bunch of
>>> other places).
>>
>> Could you expand a bit on what is semi-expected and when? I mean the
>> circumstances of PCI link going down. We certainly don't have any code to
>> survive that.
> 
> Yeah, I'm referring to the "Lost access to MMIO BAR" errors; in the past
> most of them have ultimately been tracked down to bugs in early
> firmware, so flashing an updated IFWI/BIOS onto the device usually
> solved the problems.  Generally those buggy firmwares are an internal
> problem that never make it into the wild, but I think we have also seen
> cases where they get triggered by physical/electrical problems on a
> specific part; that can potentially happen to anyone who's unlucky
> enough to get a defective/damaged unit.
> 
> Basically "hardware returns all F's" happens because the CPU initiates
> an MMIO transaction with the hardware, the hardware fails to produce any
> response (possibly due to failing hardware, possibly due to
> firmware/BIOS bugs), so 0xFFFFFFFF gets returned as an autocompletion to
> prevent the CPU core from hanging.
> 
> It looks like we still have a few open here:
> https://gitlab.freedesktop.org/search?search=%22Lost+access+to+MMIO+BAR%22&group_id=2642&project_id=4519&scope=issues&search_code=false&snippets=false&repository_ref=&nav_source=navbar
> 
> and there are some features on specific platforms we haven't turned on
> yet because they also trigger these failures (which is still under
> debug).
> 
> We don't/can't really do much to handle these problems in i915 today
> except printing the 'lost access' error so that we know to ignore
> whatever kinds of bogus errors we get after that point (usually lots of
> messages about forcewake failing to clear, engine/GuC reset failing to
> complete, etc.).  But aside from i915 being broken, the rest of the
> platform should generally continue to work, so you can still access the
> machine over the network, save logs to disk, etc.

Interesting, I missed the addition of 29b6f88d60dd ("drm/i915: Try to 
detect sudden loss of MMIO access"), thanks!

In case of my "Garbage in master_tile_ctl" or "Unexpected irq from 
tile.." messages, in case of lost PCI link they should happen only once. 
I don't think hardware will keep raising interrupts if driver cannot 
talk to it. But it does seem prudent to go with the rate-limiting 
flavour just in case.

Regards,

Tvrtko

  reply	other threads:[~2022-06-07  9:20 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-24  9:43 [Intel-gfx] [PATCH] drm/i915/dg2: Catch and log more unexpected values in DG1_MSTR_TILE_INTR Tvrtko Ursulin
2022-05-24  9:43 ` Tvrtko Ursulin
2022-05-24 10:01 ` [Intel-gfx] ✗ Fi.CI.SPARSE: warning for " Patchwork
2022-05-24 10:22 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2022-05-24 11:46 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
2022-05-24 17:51 ` [Intel-gfx] [PATCH] " Matt Roper
2022-05-24 17:51   ` Matt Roper
2022-05-25 16:03   ` [Intel-gfx] " Tvrtko Ursulin
2022-05-25 16:03     ` Tvrtko Ursulin
2022-05-25 18:05     ` [Intel-gfx] " Matt Roper
2022-05-25 18:05       ` Matt Roper
2022-05-26 10:18       ` [Intel-gfx] " Tvrtko Ursulin
2022-05-26 10:18         ` Tvrtko Ursulin
2022-05-27 18:42         ` [Intel-gfx] " Matt Roper
2022-05-27 18:42           ` Matt Roper
2022-06-06 11:55           ` [Intel-gfx] " Tvrtko Ursulin
2022-06-06 11:55             ` Tvrtko Ursulin
2022-06-06 15:21             ` [Intel-gfx] " Matt Roper
2022-06-06 15:21               ` Matt Roper
2022-06-07  9:20               ` Tvrtko Ursulin [this message]
2022-06-07  9:20                 ` Tvrtko Ursulin
2022-05-25 18:14     ` [Intel-gfx] " Lucas De Marchi
2022-05-25 18:14       ` Lucas De Marchi
2022-05-26 10:29       ` [Intel-gfx] " Tvrtko Ursulin
2022-05-26 10:29         ` Tvrtko Ursulin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2fd8b3c5-de0e-19fd-a500-1faf5d5bdb4b@linux.intel.com \
    --to=tvrtko.ursulin@linux.intel.com \
    --cc=Intel-gfx@lists.freedesktop.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=lucas.demarchi@intel.com \
    --cc=matthew.d.roper@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.