From: Bjorn Helgaas <helgaas@kernel.org>
To: "Jan Šídlo" <me@xoores.cz>
Cc: linux-pci@vger.kernel.org,
Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Subject: Re: IWL errors when reading PCI config through /sys
Date: Mon, 4 Nov 2024 17:33:01 -0600 [thread overview]
Message-ID: <20241104233301.GA1433525@bhelgaas> (raw)
In-Reply-To: <117b71d187214442cbcec407a618ff546e5d4386.camel@xoores.cz>
[+cc Emmanuel]
On Sun, Nov 03, 2024 at 01:52:24PM +0100, Jan Šídlo wrote:
> ...
Thanks for the report! Also, thanks for mentioning the bugzilla
report here on the mailing list, since most subsystems don't actively
monitor bugzilla.
> I'm trying to hunt down few issues with my new-ish HP ZBook not
> wanting to go to deeper C-stsates, which is kind of painful for a
> laptop (battery drain is ~5-10%/hour). For this I created a little
> python script that gathers all the info about all the components
> from the system and periodically reports the status (every 3s or so)
> including PCI and USB devices. To gather some information
> (specifically about ASPM) I'm reading /config file for each PCI
> device in /sys device tree and parsing it. I'm not reading only
> /config but it is a prime suspect, because I excluded WLAN card from
> this reading routine and the crash took much longer to occur - hours
> instead of minutes.
>
> When I run this script, the IWL subsystem crashes after some time
> (minutes to hours). There is clearly something going on the PCI bus
> that I don't really understand. Since the error I get from IWL is
> changing, I suspect there is some kind of race condition that is
> triggered by my script. I opened a bug [1] and after some back and
> forth with Emmanuel Grumbach, he said that this kind of error is
> caused by IWL not being able to talk to the WLAN device (at all) and
> to try to get your opinion on the matter :)
It *should* be safe to read "config" from sysfs at any time, and also
to write to the ASPM "policy" module parameter file at any time, but
there could be bugs there.
When you say "crash", I guess you mean all the iwlwifi error logging
and the WARN_ON() stacktraces, right? I don't see an actual oops or
panic in the logs yet.
I assume none of these happen unless you are running your script or
writing the "policy" parameter? Does the problem happen if you *only*
run your script to scrape the info from "config"? What about if you
*only* update the "policy" parameter?
Emmanuel is right; the iwlwifi logging (e.g., "iwlwifi 0000:04:00.0:
0xFFFFFFFF | ADVANCED_SYSASSERT") sure looks like reads from the
device are failing so we get ~0 data. I'm guessing those come from a
BAR, so the BAR could be disabled or the device might not be
responding e.g., if it is in a low-power state (D1, D2, D3hot, D3cold)
or being reset.
I don't know whether iwlwifi checks for any PCIe failures like this.
I see iwl_trans_is_hw_error_value(), but that must be for some
iwlwifi-specific error thing, not for PCIe errors, because it checks
for things like 0xa5a5a5a0. For PCIe errors, we would see ~0
(0xffffffff).
My guess is that all the other WARN()s and stacktraces are just a
consequence of trying to do things to a device that isn't responding.
> I have tried two different kernel versions (6.11.5 and 6.10.10), two
> different WLAN cards (BE200NGW and AX211NGW) and multiple versions
> of firmware for the cards. The error is still present, so I would
> say I'd need to dig deeper, but I'm not really familiar with PCI
> subsystem and how to debug it efficiently given the amount of data
> going through.
>
> What can I do to debug this issue further?
>
> 1 - https://bugzilla.kernel.org/show_bug.cgi?id=219457
Any clue if this is a regression? Seems like a common device and we
should have lots of reports. That would suggest something related to
scraping the "config" file or updating ASPM "policy" at runtime. So
I'd say the first step is to confirm that one or both of those is
implicated.
Bjorn
next prev parent reply other threads:[~2024-11-04 23:33 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-03 12:52 IWL errors when reading PCI config through /sys Jan Šídlo
2024-11-04 21:22 ` Jan Šídlo
2024-11-04 23:33 ` Bjorn Helgaas [this message]
2024-11-05 0:24 ` Jan Šídlo
2024-11-05 14:34 ` Bjorn Helgaas
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20241104233301.GA1433525@bhelgaas \
--to=helgaas@kernel.org \
--cc=emmanuel.grumbach@intel.com \
--cc=linux-pci@vger.kernel.org \
--cc=me@xoores.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox