From: Andrew Lunn <andrew@lunn.ch>
To: Eran Ben Elisha <eranbe@mellanox.com>
Cc: netdev@vger.kernel.org, Jiri Pirko <jiri@mellanox.com>,
Andy Gospodarek <andrew.gospodarek@broadcom.com>,
Michael Chan <michael.chan@broadcom.com>,
Jakub Kicinski <jakub.kicinski@netronome.com>,
Simon Horman <simon.horman@netronome.com>,
Alexander Duyck <alexander.duyck@gmail.com>,
Florian Fainelli <f.fainelli@gmail.com>,
Tal Alon <talal@mellanox.com>, Ariel Almog <ariela@mellanox.com>
Subject: Re: [RFC PATCH iproute2-next] man: Add devlink health man page
Date: Thu, 13 Sep 2018 15:24:53 +0200 [thread overview]
Message-ID: <20180913132453.GE11702@lunn.ch> (raw)
In-Reply-To: <5c12253d-2100-09bb-9e3e-6259fc7a9323@mellanox.com>
On Thu, Sep 13, 2018 at 03:49:37PM +0300, Eran Ben Elisha wrote:
>
>
> On 9/13/2018 3:08 PM, Andrew Lunn wrote:
> >> devlink health sensor set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on
> >> Sets TX_COMP_ERROR sensor parameters for a specific device.
> >
> >I hope the real sensors have more understandable names. If i remember
> >correctly, the same sort of comment was given for resource
> >management. It was pretty unclear what the resource names actually
> >mean. Is an average user going to have any idea how to actually use
> >these sensors and actions?
>
> well, hopefully. the whole point is to have it fully controlled by the user.
> However, names for the command should be short. I guess we shall have it
> documented (challenge is to fit to multi vendors).
>
> >
> >Can you give more examples of sensors. We should understand if there
> >are any overlaps with hwmon.
>
> I restate here that we shall have SW sensors as well, and not only HW
> sensors.
>
> This is what I had in mind:
> 1. command interface error
> 2. command interface timeout
> 3. stuck TX queue (like tx_timeout)
> 4. stuck TX completion queue (driver did not process packets in a reasonable
> time period)
> 5. stuck RX queue
> 6. RX completion error
> 7. TX completion error
> 8. HW / FW catastrophic error report
> 9. completion queue overrun
Hi Eran
I'm having trouble differentiating between these SW sensors and bugs
which need fixing. What causes a command interface error? Sending it a
command it does not understand? A wrongly formatted command? A command
the version of the firmware does not support? These all sound just
like plain old bugs which need fixing, not something which needs a
framework to detect them and try to recover from them by resetting
something.
I would of expected that all the issues are about physical
properties. Something similar to SMART for hard disks. The power
supplies are starting to droop, suggesting it might die soon. The
tacho on the fan suggests the FAN is not rotating as fast as it
should, so the motor is going to die soon. An SFP is giving i2c
errors, suggesting it is not seated correctly. The card as a whole is
overheating, despite the fan working, suggesting the ambient
temperature is just too high.
Andrew
next prev parent reply other threads:[~2018-09-13 18:34 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-09-13 8:18 [RFC PATCH iproute2-next] System specification health API Eran Ben Elisha
2018-09-13 8:18 ` [RFC PATCH iproute2-next] man: Add devlink health man page Eran Ben Elisha
2018-09-13 10:27 ` Tobin C. Harding
2018-09-13 11:58 ` Eran Ben Elisha
2018-09-13 22:06 ` Tobin C. Harding
2018-09-13 12:08 ` Andrew Lunn
2018-09-13 12:49 ` Eran Ben Elisha
2018-09-13 13:24 ` Andrew Lunn [this message]
2018-09-13 14:30 ` Eran Ben Elisha
2018-09-13 15:12 ` Andrew Lunn
2018-09-16 9:14 ` Eran Ben Elisha
2018-09-13 17:36 ` [RFC PATCH iproute2-next] System specification health API Jakub Kicinski
2018-09-16 10:37 ` Eran Ben Elisha
2018-09-25 12:00 ` Eran Ben Elisha
2018-09-16 19:29 ` Stephen Hemminger
2018-09-16 19:57 ` Andrew Lunn
2018-09-25 12:17 ` Eran Ben Elisha
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180913132453.GE11702@lunn.ch \
--to=andrew@lunn.ch \
--cc=alexander.duyck@gmail.com \
--cc=andrew.gospodarek@broadcom.com \
--cc=ariela@mellanox.com \
--cc=eranbe@mellanox.com \
--cc=f.fainelli@gmail.com \
--cc=jakub.kicinski@netronome.com \
--cc=jiri@mellanox.com \
--cc=michael.chan@broadcom.com \
--cc=netdev@vger.kernel.org \
--cc=simon.horman@netronome.com \
--cc=talal@mellanox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).