From: Jakub Kicinski <jakub.kicinski@netronome.com>
To: Eran Ben Elisha <eranbe@mellanox.com>
Cc: netdev@vger.kernel.org, Jiri Pirko <jiri@mellanox.com>,
Andy Gospodarek <andrew.gospodarek@broadcom.com>,
Michael Chan <michael.chan@broadcom.com>,
Simon Horman <simon.horman@netronome.com>,
Alexander Duyck <alexander.duyck@gmail.com>,
Andrew Lunn <andrew@lunn.ch>,
Florian Fainelli <f.fainelli@gmail.com>,
Tal Alon <talal@mellanox.com>, Ariel Almog <ariela@mellanox.com>
Subject: Re: [RFC PATCH iproute2-next] System specification health API
Date: Thu, 13 Sep 2018 10:36:04 -0700 [thread overview]
Message-ID: <20180913103604.0ef868f4@cakuba.netronome.com> (raw)
In-Reply-To: <1536826696-9413-1-git-send-email-eranbe@mellanox.com>
On Thu, 13 Sep 2018 11:18:15 +0300, Eran Ben Elisha wrote:
> The health spec is targeted for Real Time Alerting, in order to know when
> something bad had happened to a PCI device
By spec you mean some standards body spec you implement or this
proposal is a spec?
> - Provide alert debug information
> - Self healing
> - If problem needs vendor support, provide a way to gather all needed debugging
> information.
>
> The health contains sensors which sense for malfunction. Once sensor triggered,
> actions such as logs and correction can be taken.
> Sensors are sensing the health state and can trigger correction action.
>
> The sensors are divided into the following groups
> - Hardware sensor - a sensor which is triggered by the device due to
> malfunction.
> - Software sensor - a sensor which is triggered by the software due to
> malfunction.
> Both group of sensors can be triggered due to error event or due to a periodic check.
>
> Actions are the way to handle sensor events. Action can be in one of the
> following groups:
> - Dump - SW trace, SW dump, HW trace, HW dump
> - Reset - Surgical correction (e.g. modify Q, flush Q, reset of device, etc)
> Actions can be performed by SW or HW.
>
> User is allowed to enable or disable sensors and sensor2action mapping.
>
> This RFC man page patch describes the suggested API of devlink-health in order
> to control sensors and actions.
I like the idea of configuring response to events like this, although
I'm not sure the name sensor is appropriate here - perhaps exception or
error would be better? Are there going to be values reported?
I'm not so sure about HW sensors in relation to existing HWMON
infrastructure... I assume you're targeting things like say some HW
engine/block reporting it encountered an error? Sounds good, too.
Are the actions all envisioned to be performed by the driver?
Firmware? Hardware? I guess that distinction can be added later.
For FW/HW actions we would go back to the problem of persistence of
the setting since it was only implemented for params :S
Is the dump option going to tie back into region snapshots?
next prev parent reply other threads:[~2018-09-13 22:46 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-09-13 8:18 [RFC PATCH iproute2-next] System specification health API Eran Ben Elisha
2018-09-13 8:18 ` [RFC PATCH iproute2-next] man: Add devlink health man page Eran Ben Elisha
2018-09-13 10:27 ` Tobin C. Harding
2018-09-13 11:58 ` Eran Ben Elisha
2018-09-13 22:06 ` Tobin C. Harding
2018-09-13 12:08 ` Andrew Lunn
2018-09-13 12:49 ` Eran Ben Elisha
2018-09-13 13:24 ` Andrew Lunn
2018-09-13 14:30 ` Eran Ben Elisha
2018-09-13 15:12 ` Andrew Lunn
2018-09-16 9:14 ` Eran Ben Elisha
2018-09-13 17:36 ` Jakub Kicinski [this message]
2018-09-16 10:37 ` [RFC PATCH iproute2-next] System specification health API Eran Ben Elisha
2018-09-25 12:00 ` Eran Ben Elisha
2018-09-16 19:29 ` Stephen Hemminger
2018-09-16 19:57 ` Andrew Lunn
2018-09-25 12:17 ` Eran Ben Elisha
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180913103604.0ef868f4@cakuba.netronome.com \
--to=jakub.kicinski@netronome.com \
--cc=alexander.duyck@gmail.com \
--cc=andrew.gospodarek@broadcom.com \
--cc=andrew@lunn.ch \
--cc=ariela@mellanox.com \
--cc=eranbe@mellanox.com \
--cc=f.fainelli@gmail.com \
--cc=jiri@mellanox.com \
--cc=michael.chan@broadcom.com \
--cc=netdev@vger.kernel.org \
--cc=simon.horman@netronome.com \
--cc=talal@mellanox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.