netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Eran Ben Elisha <eranbe@nvidia.com>
To: Jakub Kicinski <kuba@kernel.org>
Cc: <netdev@vger.kernel.org>, <jiri@resnulli.us>, <saeedm@nvidia.com>,
	<andrew.gospodarek@broadcom.com>, <jacob.e.keller@intel.com>,
	<guglielmo.morandin@broadcom.com>, <eugenem@fb.com>,
	<eranbe@mellanox.com>
Subject: Re: [RFC] devlink: health: add remediation type
Date: Tue, 9 Mar 2021 16:06:49 +0200	[thread overview]
Message-ID: <bca3440c-9279-58a6-377f-6a4fdcccdf1f@nvidia.com> (raw)
In-Reply-To: <20210308095950.3cede742@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com>



On 3/8/2021 7:59 PM, Jakub Kicinski wrote:
> On Mon, 8 Mar 2021 09:16:00 -0800 Jakub Kicinski wrote:
>>>> +	DLH_REMEDY_BAD_PART,
>>> BAD_PART probably indicates that the reporter (or any command line
>>> execution) cannot recover the issue.
>>> As the suggested remedy is static per reporter's recover method, it
>>> doesn't make sense for one to set a recover method that by design cannot
>>> recover successfully.
>>>
>>> Maybe we should extend devlink_health_reporter_state with POWER_CYCLE,
>>> REIMAGE and BAD_PART? To indicate the user that for a successful
>>> recovery, it should run a non-devlink-health operation?
>>
>> Hm, export and extend devlink_health_reporter_state? I like that idea.
> 
> Trying to type it up it looks less pretty than expected.
> 
> Let's looks at some examples.
> 
> A queue reporter, say "rx", resets the queue dropping all outstanding
> buffers. As previously mentioned when the normal remediation fails user
> is expected to power cycle the machine or maybe swap the card. The
> device itself does not have a crystal ball.

Not sure, reopen the queue, or reinit the driver might also be good in 
case of issue in the SW/HW queue context for example. But I agree that 
RX reporter can't tell from its perspective what further escalation is 
needed in case its local defined operations failed.

> 
> A management FW reporter "fw", has a auto recovery of FW reset
> (REMEDY_RESET). On failure -> power cycle.
> 
> An "io" reporter (PCI link had to be trained down) can only return
> a hardware failure (we should probably have a HW failure other than
> BAD_PART for this).
> 
> Flash reporters - the device will know if the flash had a bad block
> or the entire part is bad, so probably can have 2 reporters for this.
> 
> Most of the reporters would only report one "action" that can be
> performed to fix them. The cartesian product of ->recovery types vs
> manual recovery does not seem necessary. And drivers would get bloated
> with additional boilerplate of returning ERROR_NEED_POWER_CYCLE for
> _all_ cases with ->recovery. Because what else would the fix be if
> software-initiated reset didn't work?
> 

OK, I see your point.

If I got you right, this is the conclusions so far:
1. Each reporter with recover callback will have to supply a remedy 
definition.
2. We shouldn't have POWER_CYCLE, REIMAGE and BAD_PART as a remedy, 
because these are not valid reporter recover flows in any case.
3. If a reporter will fail to recover, its status shall remain as error, 
and it is out of the reporter's scope to advise the administrator on 
further actions.

  reply	other threads:[~2021-03-09 14:07 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-06  2:42 [RFC] devlink: health: add remediation type Jakub Kicinski
2021-03-06 14:48 ` Andrew Lunn
2021-03-06 19:03   ` Jakub Kicinski
2021-03-07 15:59 ` Eran Ben Elisha
2021-03-08 17:16   ` Jakub Kicinski
2021-03-08 17:59     ` Jakub Kicinski
2021-03-09 14:06       ` Eran Ben Elisha [this message]
2021-03-09 22:52         ` Jakub Kicinski
2021-03-09 14:18     ` Eran Ben Elisha
2021-03-09 22:52       ` Jakub Kicinski
2021-03-09 23:44         ` Jacob Keller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bca3440c-9279-58a6-377f-6a4fdcccdf1f@nvidia.com \
    --to=eranbe@nvidia.com \
    --cc=andrew.gospodarek@broadcom.com \
    --cc=eranbe@mellanox.com \
    --cc=eugenem@fb.com \
    --cc=guglielmo.morandin@broadcom.com \
    --cc=jacob.e.keller@intel.com \
    --cc=jiri@resnulli.us \
    --cc=kuba@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=saeedm@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).