From: Luis Chamberlain <mcgrof@kernel.org>
To: Brian Norris <briannorris@chromium.org>
Cc: aquini@redhat.com, peterz@infradead.org, daniel.vetter@ffwll.ch,
mchehab+samsung@kernel.org, will@kernel.org, schlad@suse.de,
bhe@redhat.com, ath10k@lists.infradead.org,
Takashi Iwai <tiwai@suse.de>,
mingo@redhat.com, dyoung@redhat.com, pmladek@suse.com,
Kees Cook <keescook@chromium.org>, Arnd Bergmann <arnd@arndb.de>,
gpiccoli@canonical.com, Steven Rostedt <rostedt@goodmis.org>,
cai@lca.pw, tglx@linutronix.de,
Andy Shevchenko <andriy.shevchenko@linux.intel.com>,
Andrew Morton <akpm@linux-foundation.org>,
Kalle Valo <kvalo@codeaurora.org>,
"<netdev@vger.kernel.org>" <netdev@vger.kernel.org>,
linux-wireless <linux-wireless@vger.kernel.org>,
Linux Kernel <linux-kernel@vger.kernel.org>,
jeyu@kernel.org, Johannes Berg <johannes@sipsolutions.net>,
"David S. Miller" <davem@davemloft.net>
Subject: Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
Date: Tue, 19 May 2020 14:02:12 +0000 [thread overview]
Message-ID: <20200519140212.GT11244@42.do-not-panic.com> (raw)
In-Reply-To: <CA+ASDXOg9oKeMJP1Mf42oCMMM3sVe0jniaWowbXVuaYZ=ZpDjQ@mail.gmail.com>
On Mon, May 18, 2020 at 06:23:33PM -0700, Brian Norris wrote:
> On Sat, May 16, 2020 at 6:51 AM Johannes Berg <johannes@sipsolutions.net> wrote:
> > In addition, look what we have in iwl_trans_pcie_removal_wk(). If we
> > detect that the device is really wedged enough that the only way we can
> > still try to recover is by completely unbinding the driver from it, then
> > we give userspace a uevent for that. I don't remember exactly how and
> > where that gets used (ChromeOS) though, but it'd be nice to have that
> > sort of thing as part of the infrastructure, in a sort of two-level
> > notification?
>
> <slight side track>
> We use this on certain devices where we know the underlying hardware
> has design issues that may lead to device failure
Ah, after reading below I see you meant for iwlwifi.
If userspace can indeed grow to support this, that would be fantastic.
I should note that I don't discourage hiding firmware or hardware
issues. Quite the contrary, I suspect that taking pride in being
trasnparent about it, and dealing with it fast can help lead the pack.
I wrote about this long ago in 2015 [0], and stand by it.
[0] https://www.do-not-panic.com/2015/04/god-complex-why-open-models-will-win.html
> -- then when we see
> this sort of unrecoverable "firmware-death", we remove the
> device[*]+driver, force-reset the PCI device (SBR), and try to
> reload/reattach the driver. This all happens by way of a udev rule.
So you've sprikled your own udev event here as part of your kernel delta?
> We
> also log this sort of stuff (and metrics around it) for bug reports
> and health statistics, since we really hope to not see this happen
> often.
Assuming perfection is ideal but silly. So, what infrastructure do you
use for this sort of issue?
> [*] "We" (user space) don't actually do this...it happens via the
> 'remove_when_gone' module parameter abomination found in iwlwifi.
Holy moly.. but hey, at least it may seem a bit more seemless than forcing
a reboot / manual driver removal / addition to the user.
BTW is this likely a place on iwlwifi where the firmware likely crashed?
> I'd
> personally rather see the EVENT=INACESSIBLE stuff on its own, and let
> user space deal with when and how to remove and reset the device. But
> I digress too much here ;)
> </slight side track>
This is all useful information. We are just touching the surface of the
topic by addressing networking first. Imagine when we address other
subsystems.
> I really came to this thread to say that I also love the idea of a
> generic mechanism (a la $subject) to report firmware crashes, but I
> also have no interest in seeing a taint flag for it. For Chrome OS, I
> would readily (as in, we're already looking at more-hacky /
> non-generic ways to do this for drivers we care about) process these
> kinds of stats as they happen, logging metrics for bug reports and/or
> for automated crash statistics, when we see a firmware crash.
Great!
> A uevent
> would suit us very well I think, although it would be nice if drivers
> could also supply some small amount of informative text along with it
A follow up to this series was to add a uevent to add_taint(), however
since a *count* is not considered I think it is correct to seek
alternatives at this point. The leaner the solution the better though.
Do you have a pointer to what guys use so I can read?
> (e.g., a sort of "reason code", in case we can possibly aggregate
> certain failure types). We already do this sort of thing for WARN()
> and friends (not via uevent, but via log parsing; at least it has nice
> "cut here" markers!).
Indeed, similar things can indeed be argued about WARN*()... this
however can be non-device specific. With panic-on-warn becoming a
"thing", the more important it becomes to really tally exactly *why*
these WARN*()s may trigger.
> Perhaps
Note below.
> devlink (as proposed down-thread) would also fit the bill. I
> don't think sysfs alone would fit our needs, as we'd like to process
> these things as they happen, not only when a user submits a bug
> report.
I think we've reached a point where using "*Perhaps*" does not suffice,
and if there is already a *user* of similar desired infrastructure I
think we should jump on the opportunity to replace what you have with
something which could be used by other devices / subsystems which
require firmware. And indeed, also even consider in the abstract sense,
the possibility to leverage something like this for WARN*()s later too.
> > Level 1: firmware crashed, but we're recovering, at least mostly, and
> > it's more informational
>
> Chrome OS would love to track these things too, since we'd like to see
> these minimized, even if they're usually recoverable ;)
>
> > Level 2: device is wedged, going to try to recover by some more forceful
> > means (perhaps some devices can be power-cycled? etc.) but (more) state
> > would be lost in these cases?
>
> And we'd definitely want to know about these. We already get this for
> the iwlwifi case described above, in a non-generic way.
>
> In general, it's probably not that easy to tell the difference between
> 1 and 2, since even as you and Luis have noted, with the same driver
> (and the same driver location), you find the same crashes may or may
> not be recoverable. iwlwifi has extracted certain level 2 cases into
> iwl_trans_pcie_removal_wk(), but even iwlwifi doesn't know all the
> ways in which level 1 crashes actually lead to severe
> (non-recoverable) failure.
And that is fine, accepting these for what they are will help. However,
leaving the user in the *dark*, is what we should *not do*.
Luis
_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k
WARNING: multiple messages have this Message-ID (diff)
From: Luis Chamberlain <mcgrof@kernel.org>
To: Brian Norris <briannorris@chromium.org>
Cc: Johannes Berg <johannes@sipsolutions.net>,
linux-wireless <linux-wireless@vger.kernel.org>,
aquini@redhat.com, peterz@infradead.org, daniel.vetter@ffwll.ch,
mchehab+samsung@kernel.org, will@kernel.org, bhe@redhat.com,
ath10k@lists.infradead.org, Takashi Iwai <tiwai@suse.de>,
mingo@redhat.com, dyoung@redhat.com, pmladek@suse.com,
Kees Cook <keescook@chromium.org>, Arnd Bergmann <arnd@arndb.de>,
gpiccoli@canonical.com, Steven Rostedt <rostedt@goodmis.org>,
cai@lca.pw, tglx@linutronix.de,
Andy Shevchenko <andriy.shevchenko@linux.intel.com>,
Kalle Valo <kvalo@codeaurora.org>,
"<netdev@vger.kernel.org>" <netdev@vger.kernel.org>,
schlad@suse.de, Linux Kernel <linux-kernel@vger.kernel.org>,
jeyu@kernel.org, Andrew Morton <akpm@linux-foundation.org>,
"David S. Miller" <davem@davemloft.net>
Subject: Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
Date: Tue, 19 May 2020 14:02:12 +0000 [thread overview]
Message-ID: <20200519140212.GT11244@42.do-not-panic.com> (raw)
In-Reply-To: <CA+ASDXOg9oKeMJP1Mf42oCMMM3sVe0jniaWowbXVuaYZ=ZpDjQ@mail.gmail.com>
On Mon, May 18, 2020 at 06:23:33PM -0700, Brian Norris wrote:
> On Sat, May 16, 2020 at 6:51 AM Johannes Berg <johannes@sipsolutions.net> wrote:
> > In addition, look what we have in iwl_trans_pcie_removal_wk(). If we
> > detect that the device is really wedged enough that the only way we can
> > still try to recover is by completely unbinding the driver from it, then
> > we give userspace a uevent for that. I don't remember exactly how and
> > where that gets used (ChromeOS) though, but it'd be nice to have that
> > sort of thing as part of the infrastructure, in a sort of two-level
> > notification?
>
> <slight side track>
> We use this on certain devices where we know the underlying hardware
> has design issues that may lead to device failure
Ah, after reading below I see you meant for iwlwifi.
If userspace can indeed grow to support this, that would be fantastic.
I should note that I don't discourage hiding firmware or hardware
issues. Quite the contrary, I suspect that taking pride in being
trasnparent about it, and dealing with it fast can help lead the pack.
I wrote about this long ago in 2015 [0], and stand by it.
[0] https://www.do-not-panic.com/2015/04/god-complex-why-open-models-will-win.html
> -- then when we see
> this sort of unrecoverable "firmware-death", we remove the
> device[*]+driver, force-reset the PCI device (SBR), and try to
> reload/reattach the driver. This all happens by way of a udev rule.
So you've sprikled your own udev event here as part of your kernel delta?
> We
> also log this sort of stuff (and metrics around it) for bug reports
> and health statistics, since we really hope to not see this happen
> often.
Assuming perfection is ideal but silly. So, what infrastructure do you
use for this sort of issue?
> [*] "We" (user space) don't actually do this...it happens via the
> 'remove_when_gone' module parameter abomination found in iwlwifi.
Holy moly.. but hey, at least it may seem a bit more seemless than forcing
a reboot / manual driver removal / addition to the user.
BTW is this likely a place on iwlwifi where the firmware likely crashed?
> I'd
> personally rather see the EVENT=INACESSIBLE stuff on its own, and let
> user space deal with when and how to remove and reset the device. But
> I digress too much here ;)
> </slight side track>
This is all useful information. We are just touching the surface of the
topic by addressing networking first. Imagine when we address other
subsystems.
> I really came to this thread to say that I also love the idea of a
> generic mechanism (a la $subject) to report firmware crashes, but I
> also have no interest in seeing a taint flag for it. For Chrome OS, I
> would readily (as in, we're already looking at more-hacky /
> non-generic ways to do this for drivers we care about) process these
> kinds of stats as they happen, logging metrics for bug reports and/or
> for automated crash statistics, when we see a firmware crash.
Great!
> A uevent
> would suit us very well I think, although it would be nice if drivers
> could also supply some small amount of informative text along with it
A follow up to this series was to add a uevent to add_taint(), however
since a *count* is not considered I think it is correct to seek
alternatives at this point. The leaner the solution the better though.
Do you have a pointer to what guys use so I can read?
> (e.g., a sort of "reason code", in case we can possibly aggregate
> certain failure types). We already do this sort of thing for WARN()
> and friends (not via uevent, but via log parsing; at least it has nice
> "cut here" markers!).
Indeed, similar things can indeed be argued about WARN*()... this
however can be non-device specific. With panic-on-warn becoming a
"thing", the more important it becomes to really tally exactly *why*
these WARN*()s may trigger.
> Perhaps
Note below.
> devlink (as proposed down-thread) would also fit the bill. I
> don't think sysfs alone would fit our needs, as we'd like to process
> these things as they happen, not only when a user submits a bug
> report.
I think we've reached a point where using "*Perhaps*" does not suffice,
and if there is already a *user* of similar desired infrastructure I
think we should jump on the opportunity to replace what you have with
something which could be used by other devices / subsystems which
require firmware. And indeed, also even consider in the abstract sense,
the possibility to leverage something like this for WARN*()s later too.
> > Level 1: firmware crashed, but we're recovering, at least mostly, and
> > it's more informational
>
> Chrome OS would love to track these things too, since we'd like to see
> these minimized, even if they're usually recoverable ;)
>
> > Level 2: device is wedged, going to try to recover by some more forceful
> > means (perhaps some devices can be power-cycled? etc.) but (more) state
> > would be lost in these cases?
>
> And we'd definitely want to know about these. We already get this for
> the iwlwifi case described above, in a non-generic way.
>
> In general, it's probably not that easy to tell the difference between
> 1 and 2, since even as you and Luis have noted, with the same driver
> (and the same driver location), you find the same crashes may or may
> not be recoverable. iwlwifi has extracted certain level 2 cases into
> iwl_trans_pcie_removal_wk(), but even iwlwifi doesn't know all the
> ways in which level 1 crashes actually lead to severe
> (non-recoverable) failure.
And that is fine, accepting these for what they are will help. However,
leaving the user in the *dark*, is what we should *not do*.
Luis
next prev parent reply other threads:[~2020-05-19 14:02 UTC|newest]
Thread overview: 128+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-05-15 21:28 [PATCH v2 00/15] net: taint when the device driver firmware crashes Luis Chamberlain
2020-05-15 21:28 ` [PATCH v2 01/15] taint: add module firmware crash taint support Luis Chamberlain
2020-05-16 4:03 ` Rafael Aquini
2020-05-19 16:42 ` Jessica Yu
2020-05-22 5:17 ` Luis Chamberlain
2020-05-15 21:28 ` [PATCH v2 02/15] ethernet/839: use new module_firmware_crashed() Luis Chamberlain
2020-05-16 4:04 ` Rafael Aquini
2020-05-15 21:28 ` [PATCH v2 03/15] bnx2x: " Luis Chamberlain
2020-05-16 4:05 ` Rafael Aquini
2020-05-15 21:28 ` [PATCH v2 04/15] bnxt: " Luis Chamberlain
2020-05-16 4:06 ` Rafael Aquini
2020-05-16 5:14 ` Vasundhara Volam
2020-05-15 21:28 ` [PATCH v2 05/15] bna: " Luis Chamberlain
2020-05-16 4:07 ` Rafael Aquini
2020-05-15 21:28 ` [PATCH v2 06/15] liquidio: " Luis Chamberlain
2020-05-16 4:07 ` Rafael Aquini
2020-05-15 21:28 ` [PATCH v2 07/15] cxgb4: " Luis Chamberlain
2020-05-16 4:09 ` Rafael Aquini
2020-05-15 21:28 ` [PATCH v2 08/15] ehea: " Luis Chamberlain
2020-05-16 4:09 ` Rafael Aquini
2020-05-15 21:28 ` [PATCH v2 09/15] qed: " Luis Chamberlain
2020-05-16 4:10 ` Rafael Aquini
2020-05-15 21:28 ` [PATCH v2 10/15] soc: qcom: ipa: " Luis Chamberlain
2020-05-16 4:10 ` Rafael Aquini
2020-05-19 22:34 ` Alex Elder
2020-05-22 5:28 ` Luis Chamberlain
2020-05-22 20:52 ` Alex Elder
2020-05-22 21:53 ` Luis Chamberlain
2020-05-15 21:28 ` [PATCH v2 11/15] wimax/i2400m: " Luis Chamberlain
2020-05-16 4:11 ` Rafael Aquini
2020-05-15 21:28 ` [PATCH v2 12/15] ath10k: " Luis Chamberlain
2020-05-15 21:28 ` Luis Chamberlain
2020-05-16 4:11 ` Rafael Aquini
2020-05-16 4:11 ` Rafael Aquini
2020-05-16 13:24 ` Johannes Berg
2020-05-16 13:24 ` Johannes Berg
2020-05-16 13:50 ` Johannes Berg
2020-05-16 13:50 ` Johannes Berg
2020-05-18 16:56 ` Luis Chamberlain
2020-05-18 16:56 ` Luis Chamberlain
2020-05-19 1:23 ` Brian Norris
2020-05-19 1:23 ` Brian Norris
2020-05-19 14:02 ` Luis Chamberlain [this message]
2020-05-19 14:02 ` Luis Chamberlain
2020-05-20 0:47 ` Brian Norris
2020-05-20 0:47 ` Brian Norris
2020-05-20 5:37 ` Emmanuel Grumbach
2020-05-20 5:37 ` Emmanuel Grumbach
2020-05-20 8:32 ` Andy Shevchenko
2020-05-20 8:32 ` Andy Shevchenko
2020-05-21 19:01 ` Brian Norris
2020-05-21 19:01 ` Brian Norris
2020-05-22 5:12 ` Emmanuel Grumbach
2020-05-22 5:12 ` Emmanuel Grumbach
2020-05-22 5:23 ` Luis Chamberlain
2020-05-22 5:23 ` Luis Chamberlain
2020-05-18 16:51 ` Luis Chamberlain
2020-05-18 16:51 ` Luis Chamberlain
2020-05-18 16:58 ` Ben Greear
2020-05-18 16:58 ` Ben Greear
2020-05-18 17:09 ` Luis Chamberlain
2020-05-18 17:09 ` Luis Chamberlain
2020-05-18 17:15 ` Ben Greear
2020-05-18 17:15 ` Ben Greear
2020-05-18 17:18 ` Luis Chamberlain
2020-05-18 17:18 ` Luis Chamberlain
2020-05-18 18:06 ` Steve deRosier
2020-05-18 18:06 ` Steve deRosier
2020-05-18 19:09 ` Luis Chamberlain
2020-05-18 19:09 ` Luis Chamberlain
2020-05-18 19:25 ` Johannes Berg
2020-05-18 19:25 ` Johannes Berg
2020-05-18 19:59 ` Luis Chamberlain
2020-05-18 19:59 ` Luis Chamberlain
2020-05-18 20:07 ` Johannes Berg
2020-05-18 20:07 ` Johannes Berg
2020-05-18 21:18 ` Luis Chamberlain
2020-05-18 21:18 ` Luis Chamberlain
2020-05-18 20:28 ` Jakub Kicinski
2020-05-18 20:28 ` Jakub Kicinski
2020-05-18 20:29 ` Johannes Berg
2020-05-18 20:29 ` Johannes Berg
2020-05-18 20:35 ` Jakub Kicinski
2020-05-18 20:35 ` Jakub Kicinski
2020-05-18 20:41 ` Johannes Berg
2020-05-18 20:41 ` Johannes Berg
2020-05-18 20:46 ` Jakub Kicinski
2020-05-18 20:46 ` Jakub Kicinski
2020-05-18 21:22 ` Luis Chamberlain
2020-05-18 21:22 ` Luis Chamberlain
2020-05-18 22:16 ` Jakub Kicinski
2020-05-18 22:16 ` Jakub Kicinski
2020-05-19 1:05 ` Luis Chamberlain
2020-05-19 1:05 ` Luis Chamberlain
2020-05-19 21:15 ` [RFC 1/2] devlink: add simple fw crash helpers Jakub Kicinski
2020-05-19 21:15 ` Jakub Kicinski
2020-05-22 5:20 ` Luis Chamberlain
2020-05-22 5:20 ` Luis Chamberlain
2020-05-22 17:17 ` Jakub Kicinski
2020-05-22 17:17 ` Jakub Kicinski
2020-05-22 20:46 ` Johannes Berg
2020-05-22 20:46 ` Johannes Berg
2020-05-22 21:51 ` Luis Chamberlain
2020-05-22 21:51 ` Luis Chamberlain
2020-05-22 23:23 ` Steve deRosier
2020-05-22 23:23 ` Steve deRosier
2020-05-22 23:44 ` Luis Chamberlain
2020-05-22 23:44 ` Luis Chamberlain
2020-05-25 9:07 ` Andy Shevchenko
2020-05-25 9:07 ` Andy Shevchenko
2020-05-25 17:08 ` Ben Greear
2020-05-25 17:08 ` Ben Greear
2020-05-25 20:57 ` Jakub Kicinski
2020-05-25 20:57 ` Jakub Kicinski
2020-07-30 13:56 ` Johannes Berg
2020-07-30 13:56 ` Johannes Berg
2020-05-22 21:49 ` Luis Chamberlain
2020-05-22 21:49 ` Luis Chamberlain
2020-05-19 21:15 ` [RFC 2/2] i2400m: use devlink health reporter Jakub Kicinski
2020-05-19 21:15 ` Jakub Kicinski
2020-05-15 21:28 ` [PATCH v2 13/15] ath6kl: use new module_firmware_crashed() Luis Chamberlain
2020-05-15 21:28 ` Luis Chamberlain
2020-05-16 4:12 ` Rafael Aquini
2020-05-16 4:12 ` Rafael Aquini
2020-05-15 21:28 ` [PATCH v2 14/15] brcm80211: " Luis Chamberlain
2020-05-16 4:13 ` Rafael Aquini
2020-05-15 21:28 ` [PATCH v2 15/15] mwl8k: " Luis Chamberlain
2020-05-16 4:13 ` Rafael Aquini
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200519140212.GT11244@42.do-not-panic.com \
--to=mcgrof@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=andriy.shevchenko@linux.intel.com \
--cc=aquini@redhat.com \
--cc=arnd@arndb.de \
--cc=ath10k@lists.infradead.org \
--cc=bhe@redhat.com \
--cc=briannorris@chromium.org \
--cc=cai@lca.pw \
--cc=daniel.vetter@ffwll.ch \
--cc=davem@davemloft.net \
--cc=dyoung@redhat.com \
--cc=gpiccoli@canonical.com \
--cc=jeyu@kernel.org \
--cc=johannes@sipsolutions.net \
--cc=keescook@chromium.org \
--cc=kvalo@codeaurora.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-wireless@vger.kernel.org \
--cc=mchehab+samsung@kernel.org \
--cc=mingo@redhat.com \
--cc=netdev@vger.kernel.org \
--cc=peterz@infradead.org \
--cc=pmladek@suse.com \
--cc=rostedt@goodmis.org \
--cc=schlad@suse.de \
--cc=tglx@linutronix.de \
--cc=tiwai@suse.de \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.