* IWL errors when reading PCI config through /sys @ 2024-11-03 12:52 Jan Šídlo 2024-11-04 21:22 ` Jan Šídlo 2024-11-04 23:33 ` Bjorn Helgaas 0 siblings, 2 replies; 5+ messages in thread From: Jan Šídlo @ 2024-11-03 12:52 UTC (permalink / raw) To: linux-pci Hello, I'm not sure if this is the right place - if not, I'm sorry! It is the first time I'm trying to join a linux mailing list so I may have missed something or I may have done something incorrectly. I'm not even sure if this is the right way to send a message, but I have to start somewhere :) I'm trying to hunt down few issues with my new-ish HP ZBook not wanting to go to deeper C-stsates, which is kind of painful for a laptop (battery drain is ~5-10%/hour). For this I created a little python script that gathers all the info about all the components from the system and periodically reports the status (every 3s or so) including PCI and USB devices. To gather some information (specifically about ASPM) I'm reading /config file for each PCI device in /sys device tree and parsing it. I'm not reading only /config but it is a prime suspect, because I excluded WLAN card from this reading routine and the crash took much longer to occur - hours instead of minutes. When I run this script, the IWL subsystem crashes after some time (minutes to hours). There is clearly something going on the PCI bus that I don't really understand. Since the error I get from IWL is changing, I suspect there is some kind of race condition that is triggered by my script. I opened a bug [1] and after some back and forth with Emmanuel Grumbach, he said that this kind of error is caused by IWL not being able to talk to the WLAN device (at all) and to try to get your opinion on the matter :) I have tried two different kernel versions (6.11.5 and 6.10.10), two different WLAN cards (BE200NGW and AX211NGW) and multiple versions of firmware for the cards. The error is still present, so I would say I'd need to dig deeper, but I'm not really familiar with PCI subsystem and how to debug it efficiently given the amount of data going through. What can I do to debug this issue further? Thanks Jan 1 - https://bugzilla.kernel.org/show_bug.cgi?id=219457 ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: IWL errors when reading PCI config through /sys 2024-11-03 12:52 IWL errors when reading PCI config through /sys Jan Šídlo @ 2024-11-04 21:22 ` Jan Šídlo 2024-11-04 23:33 ` Bjorn Helgaas 1 sibling, 0 replies; 5+ messages in thread From: Jan Šídlo @ 2024-11-04 21:22 UTC (permalink / raw) To: linux-pci Hi, I have new findings - I was mistaken when I put the blame on PCI config of the device. After some more digging (and many reboots), I can now trigger the bug at will - all I have to do is write "powersave" to pcie_aspm policy: echo powersave | /sys/module/pcie_aspm/parameters/policy Few seconds after that I get the ol' crash: 04.11 21:54:03 root[11383]: Calling 'echo powersave | /sys/module/pcie_aspm/parameters/policy' 04.11 21:54:10 kernel: iwlwifi 0000:04:00.0: Error sending SYSTEM_STATISTICS_CMD: time out after 2000ms. 04.11 21:54:10 kernel: iwlwifi 0000:04:00.0: Current CMD queue read_ptr 309 write_ptr 310 04.11 21:54:10 kernel: iwlwifi 0000:04:00.0: Start IWL Error Log Dump: 04.11 21:54:10 kernel: iwlwifi 0000:04:00.0: Transport status: 0x0000004A, valid: -1 04.11 21:54:10 kernel: iwlwifi 0000:04:00.0: Loaded firmware version: 92.67ce4588.0 gl-c0-fm-c0-92.ucode .... I tested it multiple times and it is definitely consistent. I always thought that pcie_aspm/parameters/policy should be generally safe (as long as I use default/performance/powersave and not pcie_aspm=force on the cmdline). Am I mistaken and should this setting be avoided as unsafe? If not, can I somehow help to get to the bottom of this? Thanks Jan On Sun, 2024-11-03 at 13:52 +0100, Jan Šídlo wrote: > Hello, > > I'm not sure if this is the right place - if not, I'm sorry! It is the first time I'm trying to join a linux > mailing list so I may have missed something or I may have done something incorrectly. I'm not even sure if > this is the right way to send a message, but I have to start somewhere :) > > I'm trying to hunt down few issues with my new-ish HP ZBook not wanting to go to deeper C-stsates, which is > kind of painful for a laptop (battery drain is ~5-10%/hour). For this I created a little python script that > gathers all the info about all the components from the system and periodically reports the status (every 3s > or > so) including PCI and USB devices. To gather some information (specifically about ASPM) I'm reading /config > file for each PCI device in /sys device tree and parsing it. I'm not reading only /config but it is a prime > suspect, because I excluded WLAN card from this reading routine and the crash took much longer to occur - > hours instead of minutes. > > When I run this script, the IWL subsystem crashes after some time (minutes to hours). There is clearly > something going on the PCI bus that I don't really understand. Since the error I get from IWL is changing, I > suspect there is some kind of race condition that is triggered by my script. I opened a bug [1] and after > some > back and forth with Emmanuel Grumbach, he said that this kind of error is caused by IWL not being able to > talk > to the WLAN device (at all) and to try to get your opinion on the matter :) > > I have tried two different kernel versions (6.11.5 and 6.10.10), two different WLAN cards (BE200NGW and > AX211NGW) and multiple versions of firmware for the cards. The error is still present, so I would say I'd > need > to dig deeper, but I'm not really familiar with PCI subsystem and how to debug it efficiently given the > amount > of data going through. > > What can I do to debug this issue further? > > Thanks > Jan > > 1 - https://bugzilla.kernel.org/show_bug.cgi?id=219457 > > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: IWL errors when reading PCI config through /sys 2024-11-03 12:52 IWL errors when reading PCI config through /sys Jan Šídlo 2024-11-04 21:22 ` Jan Šídlo @ 2024-11-04 23:33 ` Bjorn Helgaas 2024-11-05 0:24 ` Jan Šídlo 1 sibling, 1 reply; 5+ messages in thread From: Bjorn Helgaas @ 2024-11-04 23:33 UTC (permalink / raw) To: Jan Šídlo; +Cc: linux-pci, Emmanuel Grumbach [+cc Emmanuel] On Sun, Nov 03, 2024 at 01:52:24PM +0100, Jan Šídlo wrote: > ... Thanks for the report! Also, thanks for mentioning the bugzilla report here on the mailing list, since most subsystems don't actively monitor bugzilla. > I'm trying to hunt down few issues with my new-ish HP ZBook not > wanting to go to deeper C-stsates, which is kind of painful for a > laptop (battery drain is ~5-10%/hour). For this I created a little > python script that gathers all the info about all the components > from the system and periodically reports the status (every 3s or so) > including PCI and USB devices. To gather some information > (specifically about ASPM) I'm reading /config file for each PCI > device in /sys device tree and parsing it. I'm not reading only > /config but it is a prime suspect, because I excluded WLAN card from > this reading routine and the crash took much longer to occur - hours > instead of minutes. > > When I run this script, the IWL subsystem crashes after some time > (minutes to hours). There is clearly something going on the PCI bus > that I don't really understand. Since the error I get from IWL is > changing, I suspect there is some kind of race condition that is > triggered by my script. I opened a bug [1] and after some back and > forth with Emmanuel Grumbach, he said that this kind of error is > caused by IWL not being able to talk to the WLAN device (at all) and > to try to get your opinion on the matter :) It *should* be safe to read "config" from sysfs at any time, and also to write to the ASPM "policy" module parameter file at any time, but there could be bugs there. When you say "crash", I guess you mean all the iwlwifi error logging and the WARN_ON() stacktraces, right? I don't see an actual oops or panic in the logs yet. I assume none of these happen unless you are running your script or writing the "policy" parameter? Does the problem happen if you *only* run your script to scrape the info from "config"? What about if you *only* update the "policy" parameter? Emmanuel is right; the iwlwifi logging (e.g., "iwlwifi 0000:04:00.0: 0xFFFFFFFF | ADVANCED_SYSASSERT") sure looks like reads from the device are failing so we get ~0 data. I'm guessing those come from a BAR, so the BAR could be disabled or the device might not be responding e.g., if it is in a low-power state (D1, D2, D3hot, D3cold) or being reset. I don't know whether iwlwifi checks for any PCIe failures like this. I see iwl_trans_is_hw_error_value(), but that must be for some iwlwifi-specific error thing, not for PCIe errors, because it checks for things like 0xa5a5a5a0. For PCIe errors, we would see ~0 (0xffffffff). My guess is that all the other WARN()s and stacktraces are just a consequence of trying to do things to a device that isn't responding. > I have tried two different kernel versions (6.11.5 and 6.10.10), two > different WLAN cards (BE200NGW and AX211NGW) and multiple versions > of firmware for the cards. The error is still present, so I would > say I'd need to dig deeper, but I'm not really familiar with PCI > subsystem and how to debug it efficiently given the amount of data > going through. > > What can I do to debug this issue further? > > 1 - https://bugzilla.kernel.org/show_bug.cgi?id=219457 Any clue if this is a regression? Seems like a common device and we should have lots of reports. That would suggest something related to scraping the "config" file or updating ASPM "policy" at runtime. So I'd say the first step is to confirm that one or both of those is implicated. Bjorn ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: IWL errors when reading PCI config through /sys 2024-11-04 23:33 ` Bjorn Helgaas @ 2024-11-05 0:24 ` Jan Šídlo 2024-11-05 14:34 ` Bjorn Helgaas 0 siblings, 1 reply; 5+ messages in thread From: Jan Šídlo @ 2024-11-05 0:24 UTC (permalink / raw) To: Bjorn Helgaas; +Cc: linux-pci, Emmanuel Grumbach On Mon, 2024-11-04 at 17:33 -0600, Bjorn Helgaas wrote: > It *should* be safe to read "config" from sysfs at any time, and also > to write to the ASPM "policy" module parameter file at any time, but > there could be bugs there. > > When you say "crash", I guess you mean all the iwlwifi error logging > and the WARN_ON() stacktraces, right? I don't see an actual oops or > panic in the logs yet. There is no crash in form of an oops from the kernel fortunately :) But the WLAN card stops talking & IWL driver is not able to recover. Only shutdown fixes the issue. I did not try just reboot to be honest as I thought that full powercycle is necessary to properly reset the device - but I can try tomorrow if necessary. > I assume none of these happen unless you are running your script or > writing the "policy" parameter? Does the problem happen if you *only* > run your script to scrape the info from "config"? What about if you > *only* update the "policy" parameter? The error does not happen if I read the config - I tested that properly. Without touching the ASPM policy the script is able to run without any problems. And also I can trigger the bug immediately when I write "powersave" to the ASPM policy without the script. > Emmanuel is right; the iwlwifi logging (e.g., "iwlwifi 0000:04:00.0: > 0xFFFFFFFF | ADVANCED_SYSASSERT") sure looks like reads from the > device are failing so we get ~0 data. I'm guessing those come from a > BAR, so the BAR could be disabled or the device might not be > responding e.g., if it is in a low-power state (D1, D2, D3hot, D3cold) > or being reset. Device is reported being in D0 through the sysfs, but I'm not sure if that is really correct, because if I do echo 1 > remove and rescan, the kernel complains about not being able to talk to the device. I can get the exact error tomorrow if you'd like. > I don't know whether iwlwifi checks for any PCIe failures like this. > I see iwl_trans_is_hw_error_value(), but that must be for some > iwlwifi-specific error thing, not for PCIe errors, because it checks > for things like 0xa5a5a5a0. For PCIe errors, we would see ~0 > (0xffffffff). > > My guess is that all the other WARN()s and stacktraces are just a > consequence of trying to do things to a device that isn't responding. You are probably right, Emmanuel did mention similar thing - that these errors are typically thrown around when the iwl driver cannot talk to the card (at all). > > I have tried two different kernel versions (6.11.5 and 6.10.10), two > > different WLAN cards (BE200NGW and AX211NGW) and multiple versions > > of firmware for the cards. The error is still present, so I would > > say I'd need to dig deeper, but I'm not really familiar with PCI > > subsystem and how to debug it efficiently given the amount of data > > going through. > > > > What can I do to debug this issue further? > > > > 1 - https://bugzilla.kernel.org/show_bug.cgi?id=219457 > > Any clue if this is a regression? Seems like a common device and we > should have lots of reports. That would suggest something related to > scraping the "config" file or updating ASPM "policy" at runtime. So > I'd say the first step is to confirm that one or both of those is > implicated. Updating policy definitely triggers this (at least on my ZBook) reliably. Scraping config is fine, I wrongly thought it was the problem (mea culpa) :) TLP (common power management tool for laptops) can also write this policy, but I don't know if it is enabled by default (I believe it is not) so that might be the reason there are not lots of these reports. I also believe that I did not encounter this error when I was using TLP on this laptop earlier - and I *had* this policy enabled. However I stopped using TLP some time ago (unfortunately I can't remember when exactly) and switched to a set of custom scripts when I was trying to hunt down the shallow cstates/bad S0ix residency. I saw "queue Y is stuck" quite often, but these were not annoying enough to make me dig deeper as the iwl would recover from this with just few second long outage. I'm not sure yet if this is a regression, but I can test out few older kernel versions. Gentoo still has 5.10 and up so it should not be that hard to try out if you think it will help. :) Thanks for your time! Jan ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: IWL errors when reading PCI config through /sys 2024-11-05 0:24 ` Jan Šídlo @ 2024-11-05 14:34 ` Bjorn Helgaas 0 siblings, 0 replies; 5+ messages in thread From: Bjorn Helgaas @ 2024-11-05 14:34 UTC (permalink / raw) To: Jan Šídlo; +Cc: linux-pci, Emmanuel Grumbach On Tue, Nov 05, 2024 at 01:24:59AM +0100, Jan Šídlo wrote: > On Mon, 2024-11-04 at 17:33 -0600, Bjorn Helgaas wrote: > > It *should* be safe to read "config" from sysfs at any time, and > > also to write to the ASPM "policy" module parameter file at any > > time, but there could be bugs there. > > > > When you say "crash", I guess you mean all the iwlwifi error > > logging and the WARN_ON() stacktraces, right? I don't see an > > actual oops or panic in the logs yet. > > There is no crash in form of an oops from the kernel fortunately :) > But the WLAN card stops talking & IWL driver is not able to recover. > Only shutdown fixes the issue. I did not try just reboot to be > honest as I thought that full powercycle is necessary to properly > reset the device - but I can try tomorrow if necessary. > > > I assume none of these happen unless you are running your script > > or writing the "policy" parameter? Does the problem happen if you > > *only* run your script to scrape the info from "config"? What > > about if you *only* update the "policy" parameter? > > The error does not happen if I read the config - I tested that > properly. Without touching the ASPM policy the script is able to run > without any problems. And also I can trigger the bug immediately > when I write "powersave" to the ASPM policy without the script. Perfect, thanks for narrowing that down! > > Emmanuel is right; the iwlwifi logging (e.g., "iwlwifi > > 0000:04:00.0: 0xFFFFFFFF | ADVANCED_SYSASSERT") sure looks like > > reads from the device are failing so we get ~0 data. I'm guessing > > those come from a BAR, so the BAR could be disabled or the device > > might not be responding e.g., if it is in a low-power state (D1, > > D2, D3hot, D3cold) or being reset. > > Device is reported being in D0 through the sysfs, but I'm not sure > if that is really correct, because if I do echo 1 > remove and > rescan, the kernel complains about not being able to talk to the > device. I can get the exact error tomorrow if you'd like. It's unavoidably racy to read the current state from config space. But since you've identified the write to "policy" in pcie_aspm_set_policy() as the critical item, I think that's the place to look. We had some recent issues related to configuring ASPM while the device was in a low-power state, e.g., https://lore.kernel.org/linux-pci/20240130163519.GA521777@bhelgaas/ While pcie_config_aspm_link() does check dev->current_state, I don't see anything that would prevent the power management framework from changing the power state while we're configuring devices to match the new ASPM state. Bjorn ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-11-05 14:34 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-11-03 12:52 IWL errors when reading PCI config through /sys Jan Šídlo 2024-11-04 21:22 ` Jan Šídlo 2024-11-04 23:33 ` Bjorn Helgaas 2024-11-05 0:24 ` Jan Šídlo 2024-11-05 14:34 ` Bjorn Helgaas
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox