public inbox for linux-pci@vger.kernel.org
 help / color / mirror / Atom feed
* IWL errors when reading PCI config through /sys
@ 2024-11-03 12:52 Jan Šídlo
  2024-11-04 21:22 ` Jan Šídlo
  2024-11-04 23:33 ` Bjorn Helgaas
  0 siblings, 2 replies; 5+ messages in thread
From: Jan Šídlo @ 2024-11-03 12:52 UTC (permalink / raw)
  To: linux-pci

Hello,

I'm not sure if this is the right place - if not, I'm sorry! It is the first time I'm trying to join a linux
mailing list so I may have missed something or I may have done something incorrectly. I'm not even sure if
this is the right way to send a message, but I have to start somewhere :)

I'm trying to hunt down few issues with my new-ish HP ZBook not wanting to go to deeper C-stsates, which is
kind of painful for a laptop (battery drain is ~5-10%/hour). For this I created a little python script that
gathers all the info about all the components from the system and periodically reports the status (every 3s or
so) including PCI and USB devices. To gather some information (specifically about ASPM) I'm reading /config
file for each PCI device in /sys device tree and parsing it. I'm not reading only /config but it is a prime
suspect, because I excluded WLAN card from this reading routine and the crash took much longer to occur -
hours instead of minutes.

When I run this script, the IWL subsystem crashes after some time (minutes to hours). There is clearly
something going on the PCI bus that I don't really understand. Since the error I get from IWL is changing, I
suspect there is some kind of race condition that is triggered by my script. I opened a bug [1] and after some
back and forth with Emmanuel Grumbach, he said that this kind of error is caused by IWL not being able to talk
to the WLAN device (at all) and to try to get your opinion on the matter :)

I have tried two different kernel versions (6.11.5 and 6.10.10), two different WLAN cards (BE200NGW and
AX211NGW) and multiple versions of firmware for the cards. The error is still present, so I would say I'd need
to dig deeper, but I'm not really familiar with PCI subsystem and how to debug it efficiently given the amount
of data going through.

What can I do to debug this issue further?

Thanks
Jan

1 - https://bugzilla.kernel.org/show_bug.cgi?id=219457

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: IWL errors when reading PCI config through /sys
  2024-11-03 12:52 IWL errors when reading PCI config through /sys Jan Šídlo
@ 2024-11-04 21:22 ` Jan Šídlo
  2024-11-04 23:33 ` Bjorn Helgaas
  1 sibling, 0 replies; 5+ messages in thread
From: Jan Šídlo @ 2024-11-04 21:22 UTC (permalink / raw)
  To: linux-pci

Hi,

I have new findings - I was mistaken when I put the blame on PCI config of the device. After some more digging
(and many reboots), I can now trigger the bug at will - all I have to do is write "powersave" to pcie_aspm
policy:
  echo powersave | /sys/module/pcie_aspm/parameters/policy

Few seconds after that I get the ol' crash:

04.11 21:54:03  root[11383]: Calling 'echo powersave | /sys/module/pcie_aspm/parameters/policy'
04.11 21:54:10  kernel: iwlwifi 0000:04:00.0: Error sending SYSTEM_STATISTICS_CMD: time out after 2000ms.
04.11 21:54:10  kernel: iwlwifi 0000:04:00.0: Current CMD queue read_ptr 309 write_ptr 310
04.11 21:54:10  kernel: iwlwifi 0000:04:00.0: Start IWL Error Log Dump:
04.11 21:54:10  kernel: iwlwifi 0000:04:00.0: Transport status: 0x0000004A, valid: -1
04.11 21:54:10  kernel: iwlwifi 0000:04:00.0: Loaded firmware version: 92.67ce4588.0 gl-c0-fm-c0-92.ucode
....

I tested it multiple times and it is definitely consistent.

I always thought that pcie_aspm/parameters/policy should be generally safe (as long as I use
default/performance/powersave and not pcie_aspm=force on the cmdline). Am I mistaken and should this setting
be avoided as unsafe?

If not, can I somehow help to get to the bottom of this?

Thanks
Jan


On Sun, 2024-11-03 at 13:52 +0100, Jan Šídlo wrote:
> Hello,
> 
> I'm not sure if this is the right place - if not, I'm sorry! It is the first time I'm trying to join a linux
> mailing list so I may have missed something or I may have done something incorrectly. I'm not even sure if
> this is the right way to send a message, but I have to start somewhere :)
> 
> I'm trying to hunt down few issues with my new-ish HP ZBook not wanting to go to deeper C-stsates, which is
> kind of painful for a laptop (battery drain is ~5-10%/hour). For this I created a little python script that
> gathers all the info about all the components from the system and periodically reports the status (every 3s
> or
> so) including PCI and USB devices. To gather some information (specifically about ASPM) I'm reading /config
> file for each PCI device in /sys device tree and parsing it. I'm not reading only /config but it is a prime
> suspect, because I excluded WLAN card from this reading routine and the crash took much longer to occur -
> hours instead of minutes.
> 
> When I run this script, the IWL subsystem crashes after some time (minutes to hours). There is clearly
> something going on the PCI bus that I don't really understand. Since the error I get from IWL is changing, I
> suspect there is some kind of race condition that is triggered by my script. I opened a bug [1] and after
> some
> back and forth with Emmanuel Grumbach, he said that this kind of error is caused by IWL not being able to
> talk
> to the WLAN device (at all) and to try to get your opinion on the matter :)
> 
> I have tried two different kernel versions (6.11.5 and 6.10.10), two different WLAN cards (BE200NGW and
> AX211NGW) and multiple versions of firmware for the cards. The error is still present, so I would say I'd
> need
> to dig deeper, but I'm not really familiar with PCI subsystem and how to debug it efficiently given the
> amount
> of data going through.
> 
> What can I do to debug this issue further?
> 
> Thanks
> Jan
> 
> 1 - https://bugzilla.kernel.org/show_bug.cgi?id=219457
> 
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: IWL errors when reading PCI config through /sys
  2024-11-03 12:52 IWL errors when reading PCI config through /sys Jan Šídlo
  2024-11-04 21:22 ` Jan Šídlo
@ 2024-11-04 23:33 ` Bjorn Helgaas
  2024-11-05  0:24   ` Jan Šídlo
  1 sibling, 1 reply; 5+ messages in thread
From: Bjorn Helgaas @ 2024-11-04 23:33 UTC (permalink / raw)
  To: Jan Šídlo; +Cc: linux-pci, Emmanuel Grumbach

[+cc Emmanuel]

On Sun, Nov 03, 2024 at 01:52:24PM +0100, Jan Šídlo wrote:
> ...

Thanks for the report!  Also, thanks for mentioning the bugzilla
report here on the mailing list, since most subsystems don't actively
monitor bugzilla.

> I'm trying to hunt down few issues with my new-ish HP ZBook not
> wanting to go to deeper C-stsates, which is kind of painful for a
> laptop (battery drain is ~5-10%/hour). For this I created a little
> python script that gathers all the info about all the components
> from the system and periodically reports the status (every 3s or so)
> including PCI and USB devices. To gather some information
> (specifically about ASPM) I'm reading /config file for each PCI
> device in /sys device tree and parsing it. I'm not reading only
> /config but it is a prime suspect, because I excluded WLAN card from
> this reading routine and the crash took much longer to occur - hours
> instead of minutes.
> 
> When I run this script, the IWL subsystem crashes after some time
> (minutes to hours). There is clearly something going on the PCI bus
> that I don't really understand. Since the error I get from IWL is
> changing, I suspect there is some kind of race condition that is
> triggered by my script. I opened a bug [1] and after some back and
> forth with Emmanuel Grumbach, he said that this kind of error is
> caused by IWL not being able to talk to the WLAN device (at all) and
> to try to get your opinion on the matter :)

It *should* be safe to read "config" from sysfs at any time, and also
to write to the ASPM "policy" module parameter file at any time, but
there could be bugs there.

When you say "crash", I guess you mean all the iwlwifi error logging
and the WARN_ON() stacktraces, right?   I don't see an actual oops or
panic in the logs yet.

I assume none of these happen unless you are running your script or
writing the "policy" parameter?  Does the problem happen if you *only*
run your script to scrape the info from "config"?  What about if you
*only* update the "policy" parameter?

Emmanuel is right; the iwlwifi logging (e.g., "iwlwifi 0000:04:00.0:
0xFFFFFFFF | ADVANCED_SYSASSERT") sure looks like reads from the
device are failing so we get ~0 data.  I'm guessing those come from a
BAR, so the BAR could be disabled or the device might not be
responding e.g., if it is in a low-power state (D1, D2, D3hot, D3cold)
or being reset.

I don't know whether iwlwifi checks for any PCIe failures like this.
I see iwl_trans_is_hw_error_value(), but that must be for some
iwlwifi-specific error thing, not for PCIe errors, because it checks
for things like 0xa5a5a5a0.  For PCIe errors, we would see ~0
(0xffffffff).

My guess is that all the other WARN()s and stacktraces are just a
consequence of trying to do things to a device that isn't responding.

> I have tried two different kernel versions (6.11.5 and 6.10.10), two
> different WLAN cards (BE200NGW and AX211NGW) and multiple versions
> of firmware for the cards. The error is still present, so I would
> say I'd need to dig deeper, but I'm not really familiar with PCI
> subsystem and how to debug it efficiently given the amount of data
> going through.
> 
> What can I do to debug this issue further?
>
> 1 - https://bugzilla.kernel.org/show_bug.cgi?id=219457

Any clue if this is a regression?  Seems like a common device and we
should have lots of reports.  That would suggest something related to
scraping the "config" file or updating ASPM "policy" at runtime.  So
I'd say the first step is to confirm that one or both of those is
implicated.

Bjorn

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: IWL errors when reading PCI config through /sys
  2024-11-04 23:33 ` Bjorn Helgaas
@ 2024-11-05  0:24   ` Jan Šídlo
  2024-11-05 14:34     ` Bjorn Helgaas
  0 siblings, 1 reply; 5+ messages in thread
From: Jan Šídlo @ 2024-11-05  0:24 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Emmanuel Grumbach

On Mon, 2024-11-04 at 17:33 -0600, Bjorn Helgaas wrote:
> It *should* be safe to read "config" from sysfs at any time, and also
> to write to the ASPM "policy" module parameter file at any time, but
> there could be bugs there.
> 
> When you say "crash", I guess you mean all the iwlwifi error logging
> and the WARN_ON() stacktraces, right?   I don't see an actual oops or
> panic in the logs yet.

There is no crash in form of an oops from the kernel fortunately :) But the WLAN card stops talking & IWL
driver is not able to recover. Only shutdown fixes the issue. I did not try just reboot to be honest as I
thought that full powercycle is necessary to properly reset the device - but I can try tomorrow if necessary.

> I assume none of these happen unless you are running your script or
> writing the "policy" parameter?  Does the problem happen if you *only*
> run your script to scrape the info from "config"?  What about if you
> *only* update the "policy" parameter?

The error does not happen if I read the config - I tested that properly. Without touching the ASPM policy the
script is able to run without any problems. And also I can trigger the bug immediately when I write
"powersave" to the ASPM policy without the script.

> Emmanuel is right; the iwlwifi logging (e.g., "iwlwifi 0000:04:00.0:
> 0xFFFFFFFF | ADVANCED_SYSASSERT") sure looks like reads from the
> device are failing so we get ~0 data.  I'm guessing those come from a
> BAR, so the BAR could be disabled or the device might not be
> responding e.g., if it is in a low-power state (D1, D2, D3hot, D3cold)
> or being reset.

Device is reported being in D0 through the sysfs, but I'm not sure if that is really correct, because if I do
echo 1 > remove and rescan, the kernel complains about not being able to talk to the device. I can get the
exact error tomorrow if you'd like.

> I don't know whether iwlwifi checks for any PCIe failures like this.
> I see iwl_trans_is_hw_error_value(), but that must be for some
> iwlwifi-specific error thing, not for PCIe errors, because it checks
> for things like 0xa5a5a5a0.  For PCIe errors, we would see ~0
> (0xffffffff).
> 
> My guess is that all the other WARN()s and stacktraces are just a
> consequence of trying to do things to a device that isn't responding.

You are probably right, Emmanuel did mention similar thing - that these errors are typically thrown around
when the iwl driver cannot talk to the card (at all).

> > I have tried two different kernel versions (6.11.5 and 6.10.10), two
> > different WLAN cards (BE200NGW and AX211NGW) and multiple versions
> > of firmware for the cards. The error is still present, so I would
> > say I'd need to dig deeper, but I'm not really familiar with PCI
> > subsystem and how to debug it efficiently given the amount of data
> > going through.
> > 
> > What can I do to debug this issue further?
> > 
> > 1 - https://bugzilla.kernel.org/show_bug.cgi?id=219457
> 
> Any clue if this is a regression?  Seems like a common device and we
> should have lots of reports.  That would suggest something related to
> scraping the "config" file or updating ASPM "policy" at runtime.  So
> I'd say the first step is to confirm that one or both of those is
> implicated.

Updating policy definitely triggers this (at least on my ZBook) reliably. Scraping config is fine, I wrongly
thought it was the problem (mea culpa) :)

TLP (common power management tool for laptops) can also write this policy, but I don't know if it is enabled
by default (I believe it is not) so that might be the reason there are not lots of these reports.

I also believe that I did not encounter this error when I was using TLP on this laptop earlier - and I *had*
this policy enabled. However I stopped using TLP some time ago (unfortunately I can't remember when exactly)
and switched to a set of custom scripts when I was trying to hunt down the shallow cstates/bad S0ix residency.

I saw "queue Y is stuck" quite often, but these were not annoying enough to make me dig deeper as the iwl
would recover from this with just few second long outage.

I'm not sure yet if this is a regression, but I can test out few older kernel versions. Gentoo still has 5.10
and up so it should not be that hard to try out if you think it will help. :)

Thanks for your time!

Jan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: IWL errors when reading PCI config through /sys
  2024-11-05  0:24   ` Jan Šídlo
@ 2024-11-05 14:34     ` Bjorn Helgaas
  0 siblings, 0 replies; 5+ messages in thread
From: Bjorn Helgaas @ 2024-11-05 14:34 UTC (permalink / raw)
  To: Jan Šídlo; +Cc: linux-pci, Emmanuel Grumbach

On Tue, Nov 05, 2024 at 01:24:59AM +0100, Jan Šídlo wrote:
> On Mon, 2024-11-04 at 17:33 -0600, Bjorn Helgaas wrote:
> > It *should* be safe to read "config" from sysfs at any time, and
> > also to write to the ASPM "policy" module parameter file at any
> > time, but there could be bugs there.
> > 
> > When you say "crash", I guess you mean all the iwlwifi error
> > logging and the WARN_ON() stacktraces, right?   I don't see an
> > actual oops or panic in the logs yet.
> 
> There is no crash in form of an oops from the kernel fortunately :)
> But the WLAN card stops talking & IWL driver is not able to recover.
> Only shutdown fixes the issue. I did not try just reboot to be
> honest as I thought that full powercycle is necessary to properly
> reset the device - but I can try tomorrow if necessary.
> 
> > I assume none of these happen unless you are running your script
> > or writing the "policy" parameter?  Does the problem happen if you
> > *only* run your script to scrape the info from "config"?  What
> > about if you *only* update the "policy" parameter?
> 
> The error does not happen if I read the config - I tested that
> properly. Without touching the ASPM policy the script is able to run
> without any problems. And also I can trigger the bug immediately
> when I write "powersave" to the ASPM policy without the script.

Perfect, thanks for narrowing that down!

> > Emmanuel is right; the iwlwifi logging (e.g., "iwlwifi
> > 0000:04:00.0: 0xFFFFFFFF | ADVANCED_SYSASSERT") sure looks like
> > reads from the device are failing so we get ~0 data.  I'm guessing
> > those come from a BAR, so the BAR could be disabled or the device
> > might not be responding e.g., if it is in a low-power state (D1,
> > D2, D3hot, D3cold) or being reset.
> 
> Device is reported being in D0 through the sysfs, but I'm not sure
> if that is really correct, because if I do echo 1 > remove and
> rescan, the kernel complains about not being able to talk to the
> device. I can get the exact error tomorrow if you'd like.

It's unavoidably racy to read the current state from config space.
But since you've identified the write to "policy" in
pcie_aspm_set_policy() as the critical item, I think that's the place
to look.

We had some recent issues related to configuring ASPM while the device
was in a low-power state, e.g.,
https://lore.kernel.org/linux-pci/20240130163519.GA521777@bhelgaas/

While pcie_config_aspm_link() does check dev->current_state, I don't
see anything that would prevent the power management framework from
changing the power state while we're configuring devices to match the
new ASPM state.

Bjorn

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-11-05 14:34 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-03 12:52 IWL errors when reading PCI config through /sys Jan Šídlo
2024-11-04 21:22 ` Jan Šídlo
2024-11-04 23:33 ` Bjorn Helgaas
2024-11-05  0:24   ` Jan Šídlo
2024-11-05 14:34     ` Bjorn Helgaas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox