netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH iproute2-next V2] System specification exception API
@ 2018-09-26 11:52 Eran Ben Elisha
  2018-09-26 11:52 ` [RFC PATCH iproute2-next V2] man: Add devlink exception man page Eran Ben Elisha
  2018-09-27 12:47 ` [RFC PATCH iproute2-next V2] System specification exception API Jiri Pirko
  0 siblings, 2 replies; 8+ messages in thread
From: Eran Ben Elisha @ 2018-09-26 11:52 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, Jiri Pirko, Stephen Hemminger,
	Andrew Lunn, Tobin C. Harding
  Cc: Ariel Almog, Tal Alon, Eran Ben Elisha

The exception spec is targeted for Real Time Alerting, in order to know when
something bad had happened to a PCI device
- Provide alert debug information
- Self healing
- If problem needs vendor support, provide a way to gather all needed debugging
  information.

The exception mechanism contains condition checkers which sense for malfunction. Upon a condition hit,
actions such as logs and correction can be taken.

The condition checkers are divided into the following groups
- Hardware - a checker which is triggered by the device due to
  malfunction.
- Software - a checker which is triggered by the software due to
  malfunction.
Both groups of condition checkers can be triggered due to error event or due to a periodic check.

Actions are the way to handle those events. Action can be in one of the
following groups:
- Dump -  SW trace, SW dump, HW trace, HW dump
- Reset - Surgical correction (e.g. modify Q, flush Q, reset of device, etc)
Actions can be performed by SW or HW.

User is allowed to enable or disable condition checkers and its action mapping.

This RFC man page patch describes the suggested API of devlink-exception in order
to control conditions and actions.

V2:
* Renaming terms:
	health -> exception
	sensor -> condition
* Remove reinit command and merge with action command.
* Consmetics in grammer.

Eran Ben Elisha (1):
  man: Add devlink exception man page

 man/man8/devlink-exception.8 | 158 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 158 insertions(+)
 create mode 100644 man/man8/devlink-exception.8

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC PATCH iproute2-next V2] man: Add devlink exception man page
  2018-09-26 11:52 [RFC PATCH iproute2-next V2] System specification exception API Eran Ben Elisha
@ 2018-09-26 11:52 ` Eran Ben Elisha
  2018-09-27 14:32   ` Jiri Pirko
  2018-09-27 12:47 ` [RFC PATCH iproute2-next V2] System specification exception API Jiri Pirko
  1 sibling, 1 reply; 8+ messages in thread
From: Eran Ben Elisha @ 2018-09-26 11:52 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, Jiri Pirko, Stephen Hemminger,
	Andrew Lunn, Tobin C. Harding
  Cc: Ariel Almog, Tal Alon, Eran Ben Elisha

Add devlink-exception man page. Devlink-exception tool will control device
exception attributes, conditions, actions and logging.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>

-------------------------------------------------------
Copy paste man output to here for easier review process of the RFC.

DEVLINK-EXCEPTION(8)                                                                                            Linux                                                                                           DEVLINK-EXCEPTION(8)

NAME
       devlink-exception - devlink exception configuration

SYNOPSIS
       devlink [ OPTIONS ] exception  { COMMAND | help }

       OPTIONS := { -V[ersion] | -n[no-nice-names] }

       devlink exception show [ DEV ] [ condition NAME ] [ action NAME ]

       devlink exception condition set DEV name NAME [ action NAME { active | inactive } ]

       devlink exception action set DEV name NAME period PERIOD count COUNT fail { ignore | down }

       devlink exception help

DESCRIPTION
       devlink-exception tool allows user to configure the way driver treats unexpected status. The tool allows configuration of the conditions that can trigger exception activity. Set for each condition the follow up opera‐
       tions, such as, reset and dump of info. In addition, set the exception activity termination action.

   devlink exception show - Display devlink exception conditions and actions attributes
       DEV    Specifies the devlink device to show.

       condition NAME
              Specifies the devlink condition to show.

       action NAME
              Specifies the devlink action to show.

   devlink exception condition set - sets devlink exception condition attributes
       DEV    Specifies the devlink device to set.

       name NAME
              Name of the condition to set.

       action NAME { active | inactive }
                  Specify which actions to activate and which to deactivate once a condition was triggered. Actions can be dump, reset, etc.

   devlink exception action set - sets devlink action attributes.
       Once this command is launched, period and count measurement will be reset.

       DEV    Specifies the devlink device to set.

       name NAME
              Specifies the devlink action to set.

       period PERIOD
              The period on which we limit the amount of performed actions, measured in seconds.

       count COUNT
              The maximum number of actions performed in a limited time frame.

       fail   { ignore | down }
                  Specify the behavior once count limit was reached.

                  ignore - Skip triggering this action.

                  down - Driver will remain in nonoperational state.

EXAMPLES
       devlink exception show
           Shows the exception state of all devlink devices on the system.

       devlink exception show pci/0000:01:00.0
           Shows the exception state of specified devlink device.

       devlink exception condition set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on
           Sets TX_COMP_ERROR condition parameters for a specific device.

       devlink exception action set pci/0000:01:00.0 name reset period 3600 count 5 fail ignore
           Sets exception attributes for reset action. Period timer and counter are being reset.

SEE ALSO
       devlink(8), devlink-port(8), devlink-sb(8), devlink-monitor(8), devlink-dev(8),

AUTHOR
       Eran ben Elisha <eranbe@mellanox.com>

iproute2                                                                                                     15 Aug 2018                                                                                        DEVLINK-EXCEPTION(8)

---
 man/man8/devlink-exception.8 | 158 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 158 insertions(+)
 create mode 100644 man/man8/devlink-exception.8

diff --git a/man/man8/devlink-exception.8 b/man/man8/devlink-exception.8
new file mode 100644
index 000000000000..03f24b32cc98
--- /dev/null
+++ b/man/man8/devlink-exception.8
@@ -0,0 +1,158 @@
+.TH DEVLINK\-EXCEPTION 8 "15 Aug 2018" "iproute2" "Linux"
+.SH NAME
+devlink-exception \- devlink exception configuration
+.SH SYNOPSIS
+.sp
+.ad l
+.in +8
+.ti -8
+.B devlink
+.RI "[ " OPTIONS " ]"
+.BR exception
+.RI  " { " COMMAND " | "
+.BR help " }"
+.sp
+
+.ti -8
+.IR OPTIONS " := { "
+\fB\-V\fR[\fIersion\fR] |
+\fB\-n\fR[\fIno-nice-names\fR] }
+
+.ti -8
+.B devlink exception show
+.RI "[ " DEV " ]"
+.RI "[ "
+.B condition
+.IR NAME
+.RI "]"
+.RI "[ "
+.B action
+.IR NAME
+.RI "]"
+
+.ti -8
+.B devlink exception condition set
+.IR DEV
+.B name
+.IR NAME
+.RI "[ "
+.BR action
+.IR NAME
+.R "{" active "|" inactive "}" ]
+
+.ti -8
+.B devlink exception action set
+.IR DEV
+.B name
+.IR NAME
+.BR period
+.IR PERIOD
+.BR count
+.IR COUNT
+.BR fail " { "
+.IR ignore
+.BR "| "
+.IR down
+.R "} "
+
+.ti -8
+.B devlink exception help
+
+.SH "DESCRIPTION"
+.B devlink-exception
+tool allows user to configure the way driver treats unexpected status. The tool allows configuration of the conditions that can trigger exception activity. Set for each condition the follow up operations, such as, reset and dump of info. In addition, set the exception activity termination action.
+
+.SS devlink exception show - Display devlink exception conditions and actions attributes
+.TP
+.BI "DEV"
+Specifies the devlink device to show.
+
+.PP
+.TP
+.BI condition " NAME"
+Specifies the devlink condition to show.
+
+.TP
+.BI action " NAME"
+Specifies the devlink action to show.
+
+.SS devlink exception condition set - sets devlink exception condition attributes
+
+.TP
+.B "DEV"
+Specifies the devlink device to set.
+
+.TP
+.BI name " NAME"
+Name of the condition to set.
+
+.TP
+.BR action
+.IR NAME
+.R "{" active "|" inactive "} "
+.in +4
+Specify which actions to activate and which to deactivate once a condition was triggered. Actions can be dump, reset, etc.
+
+.SS devlink exception action set - sets devlink action attributes.
+Once this command is launched, period and count measurement will be reset.
+
+.TP
+.B "DEV"
+Specifies the devlink device to set.
+
+.TP
+.BI name " NAME"
+Specifies the devlink action to set.
+
+.TP
+.BI period " PERIOD"
+The period on which we limit the amount of performed actions, measured in seconds.
+
+.TP
+.BI count " COUNT"
+The maximum number of actions performed in a limited time frame.
+
+.TP
+.BR fail
+.R "{" ignore "|" down "}"
+.in +4
+Specify the behavior once count limit was reached.
+
+.I ignore
+- Skip triggering this action.
+
+.I down
+- Driver will remain in nonoperational state.
+
+.SH "EXAMPLES"
+.PP
+devlink exception show
+.RS 4
+Shows the exception state of all devlink devices on the system.
+.RE
+.PP
+devlink exception show pci/0000:01:00.0
+.RS 4
+Shows the exception state of specified devlink device.
+.RE
+.PP
+devlink exception condition set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on
+.RS 4
+Sets TX_COMP_ERROR condition parameters for a specific device.
+.RE
+.PP
+devlink exception action set pci/0000:01:00.0 name reset period 3600 count 5 fail ignore
+.RS 4
+Sets exception attributes for reset action. Period timer and counter are being reset.
+.RE
+
+.SH SEE ALSO
+.BR devlink (8),
+.BR devlink-port (8),
+.BR devlink-sb (8),
+.BR devlink-monitor (8),
+.BR devlink-dev (8),
+.br
+
+.SH AUTHOR
+Eran ben Elisha <eranbe@mellanox.com>
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH iproute2-next V2] System specification exception API
  2018-09-26 11:52 [RFC PATCH iproute2-next V2] System specification exception API Eran Ben Elisha
  2018-09-26 11:52 ` [RFC PATCH iproute2-next V2] man: Add devlink exception man page Eran Ben Elisha
@ 2018-09-27 12:47 ` Jiri Pirko
  2018-09-27 14:02   ` Eran Ben Elisha
  1 sibling, 1 reply; 8+ messages in thread
From: Jiri Pirko @ 2018-09-27 12:47 UTC (permalink / raw)
  To: Eran Ben Elisha
  Cc: netdev, Jakub Kicinski, Jiri Pirko, Stephen Hemminger,
	Andrew Lunn, Tobin C. Harding, Ariel Almog, Tal Alon

Wed, Sep 26, 2018 at 01:52:58PM CEST, eranbe@mellanox.com wrote:
>The exception spec is targeted for Real Time Alerting, in order to know when
>something bad had happened to a PCI device
>- Provide alert debug information
>- Self healing
>- If problem needs vendor support, provide a way to gather all needed debugging
>  information.
>
>The exception mechanism contains condition checkers which sense for malfunction. Upon a condition hit,
>actions such as logs and correction can be taken.
>
>The condition checkers are divided into the following groups
>- Hardware - a checker which is triggered by the device due to
>  malfunction.
>- Software - a checker which is triggered by the software due to
>  malfunction.

What do you mean by a "software malfunction", a "FW malfunction"?
Also, I don't see this 2 groups in the man.


>Both groups of condition checkers can be triggered due to error event or due to a periodic check.
>
>Actions are the way to handle those events. Action can be in one of the
>following groups:
>- Dump -  SW trace, SW dump, HW trace, HW dump
>- Reset - Surgical correction (e.g. modify Q, flush Q, reset of device, etc)
>Actions can be performed by SW or HW.
>
>User is allowed to enable or disable condition checkers and its action mapping.
>
>This RFC man page patch describes the suggested API of devlink-exception in order
>to control conditions and actions.
>
>V2:
>* Renaming terms:
>	health -> exception
>	sensor -> condition
>* Remove reinit command and merge with action command.
>* Consmetics in grammer.
>
>Eran Ben Elisha (1):
>  man: Add devlink exception man page
>
> man/man8/devlink-exception.8 | 158 +++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 158 insertions(+)
> create mode 100644 man/man8/devlink-exception.8
>
>-- 
>1.8.3.1
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH iproute2-next V2] System specification exception API
  2018-09-27 12:47 ` [RFC PATCH iproute2-next V2] System specification exception API Jiri Pirko
@ 2018-09-27 14:02   ` Eran Ben Elisha
  2018-09-27 14:34     ` Jiri Pirko
  0 siblings, 1 reply; 8+ messages in thread
From: Eran Ben Elisha @ 2018-09-27 14:02 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, Jakub Kicinski, Jiri Pirko, Stephen Hemminger,
	Andrew Lunn, Tobin C. Harding, Ariel Almog, Tal Alon



On 9/27/2018 3:47 PM, Jiri Pirko wrote:
> Wed, Sep 26, 2018 at 01:52:58PM CEST, eranbe@mellanox.com wrote:
>> The exception spec is targeted for Real Time Alerting, in order to know when
>> something bad had happened to a PCI device
>> - Provide alert debug information
>> - Self healing
>> - If problem needs vendor support, provide a way to gather all needed debugging
>>   information.
>>
>> The exception mechanism contains condition checkers which sense for malfunction. Upon a condition hit,
>> actions such as logs and correction can be taken.
>>
>> The condition checkers are divided into the following groups
>> - Hardware - a checker which is triggered by the device due to
>>   malfunction.
>> - Software - a checker which is triggered by the software due to
>>   malfunction.
> 
> What do you mean by a "software malfunction", a "FW malfunction"?
> Also, I don't see this 2 groups in the man.

Software malfunction can be a Transmit error (caused by bad send request).
FW/HW malfunction can be any catastrophic error report (the ones that 
should be exposed to driver).
The comment here was to highlight that we can support different kinds of 
condition groups.
If for a specific condition, we will need to highlight it is SW/HW, we 
can concatenate it to its name.

Eran

> 
> 
>> Both groups of condition checkers can be triggered due to error event or due to a periodic check.
>>
>> Actions are the way to handle those events. Action can be in one of the
>> following groups:
>> - Dump -  SW trace, SW dump, HW trace, HW dump
>> - Reset - Surgical correction (e.g. modify Q, flush Q, reset of device, etc)
>> Actions can be performed by SW or HW.
>>
>> User is allowed to enable or disable condition checkers and its action mapping.
>>
>> This RFC man page patch describes the suggested API of devlink-exception in order
>> to control conditions and actions.
>>
>> V2:
>> * Renaming terms:
>> 	health -> exception
>> 	sensor -> condition
>> * Remove reinit command and merge with action command.
>> * Consmetics in grammer.
>>
>> Eran Ben Elisha (1):
>>   man: Add devlink exception man page
>>
>> man/man8/devlink-exception.8 | 158 +++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 158 insertions(+)
>> create mode 100644 man/man8/devlink-exception.8
>>
>> -- 
>> 1.8.3.1
>>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH iproute2-next V2] man: Add devlink exception man page
  2018-09-26 11:52 ` [RFC PATCH iproute2-next V2] man: Add devlink exception man page Eran Ben Elisha
@ 2018-09-27 14:32   ` Jiri Pirko
  2018-09-27 16:26     ` David Ahern
  0 siblings, 1 reply; 8+ messages in thread
From: Jiri Pirko @ 2018-09-27 14:32 UTC (permalink / raw)
  To: Eran Ben Elisha
  Cc: netdev, Jakub Kicinski, Jiri Pirko, Stephen Hemminger,
	Andrew Lunn, Tobin C. Harding, Ariel Almog, Tal Alon

Wed, Sep 26, 2018 at 01:52:59PM CEST, eranbe@mellanox.com wrote:
>Add devlink-exception man page. Devlink-exception tool will control device
>exception attributes, conditions, actions and logging.
>
>Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
>
>-------------------------------------------------------
>Copy paste man output to here for easier review process of the RFC.
>
>DEVLINK-EXCEPTION(8)                                                                                            Linux                                                                                           DEVLINK-EXCEPTION(8)
>
>NAME
>       devlink-exception - devlink exception configuration
>
>SYNOPSIS
>       devlink [ OPTIONS ] exception  { COMMAND | help }
>
>       OPTIONS := { -V[ersion] | -n[no-nice-names] }
>
>       devlink exception show [ DEV ] [ condition NAME ] [ action NAME ]
>
>       devlink exception condition set DEV name NAME [ action NAME { active | inactive } ]
>
>       devlink exception action set DEV name NAME period PERIOD count COUNT fail { ignore | down }
>
>       devlink exception help
>
>DESCRIPTION
>       devlink-exception tool allows user to configure the way driver treats unexpected status. The tool allows configuration of the conditions that can trigger exception activity. Set for each condition the follow up opera‐
>       tions, such as, reset and dump of info. In addition, set the exception activity termination action.
>
>   devlink exception show - Display devlink exception conditions and actions attributes
>       DEV    Specifies the devlink device to show.
>
>       condition NAME
>              Specifies the devlink condition to show.
>
>       action NAME
>              Specifies the devlink action to show.
>
>   devlink exception condition set - sets devlink exception condition attributes
>       DEV    Specifies the devlink device to set.
>
>       name NAME
>              Name of the condition to set.
>
>       action NAME { active | inactive }
>                  Specify which actions to activate and which to deactivate once a condition was triggered. Actions can be dump, reset, etc.
>
>   devlink exception action set - sets devlink action attributes.
>       Once this command is launched, period and count measurement will be reset.
>
>       DEV    Specifies the devlink device to set.
>
>       name NAME
>              Specifies the devlink action to set.
>
>       period PERIOD
>              The period on which we limit the amount of performed actions, measured in seconds.
>
>       count COUNT
>              The maximum number of actions performed in a limited time frame.
>
>       fail   { ignore | down }
>                  Specify the behavior once count limit was reached.
>
>                  ignore - Skip triggering this action.
>
>                  down - Driver will remain in nonoperational state.
>
>EXAMPLES
>       devlink exception show
>           Shows the exception state of all devlink devices on the system.
>
>       devlink exception show pci/0000:01:00.0
>           Shows the exception state of specified devlink device.
>
>       devlink exception condition set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on
>           Sets TX_COMP_ERROR condition parameters for a specific device.
>
>       devlink exception action set pci/0000:01:00.0 name reset period 3600 count 5 fail ignore
>           Sets exception attributes for reset action. Period timer and counter are being reset.

Looks good to me. But still, I need the code so I can play with it, to
see the outputs etc.

Thanks!


>
>SEE ALSO
>       devlink(8), devlink-port(8), devlink-sb(8), devlink-monitor(8), devlink-dev(8),
>
>AUTHOR
>       Eran ben Elisha <eranbe@mellanox.com>
>
>iproute2                                                                                                     15 Aug 2018                                                                                        DEVLINK-EXCEPTION(8)
>
>---
> man/man8/devlink-exception.8 | 158 +++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 158 insertions(+)
> create mode 100644 man/man8/devlink-exception.8
>
>diff --git a/man/man8/devlink-exception.8 b/man/man8/devlink-exception.8
>new file mode 100644
>index 000000000000..03f24b32cc98
>--- /dev/null
>+++ b/man/man8/devlink-exception.8
>@@ -0,0 +1,158 @@
>+.TH DEVLINK\-EXCEPTION 8 "15 Aug 2018" "iproute2" "Linux"
>+.SH NAME
>+devlink-exception \- devlink exception configuration
>+.SH SYNOPSIS
>+.sp
>+.ad l
>+.in +8
>+.ti -8
>+.B devlink
>+.RI "[ " OPTIONS " ]"
>+.BR exception
>+.RI  " { " COMMAND " | "
>+.BR help " }"
>+.sp
>+
>+.ti -8
>+.IR OPTIONS " := { "
>+\fB\-V\fR[\fIersion\fR] |
>+\fB\-n\fR[\fIno-nice-names\fR] }
>+
>+.ti -8
>+.B devlink exception show
>+.RI "[ " DEV " ]"
>+.RI "[ "
>+.B condition
>+.IR NAME
>+.RI "]"
>+.RI "[ "
>+.B action
>+.IR NAME
>+.RI "]"
>+
>+.ti -8
>+.B devlink exception condition set
>+.IR DEV
>+.B name
>+.IR NAME
>+.RI "[ "
>+.BR action
>+.IR NAME
>+.R "{" active "|" inactive "}" ]
>+
>+.ti -8
>+.B devlink exception action set
>+.IR DEV
>+.B name
>+.IR NAME
>+.BR period
>+.IR PERIOD
>+.BR count
>+.IR COUNT
>+.BR fail " { "
>+.IR ignore
>+.BR "| "
>+.IR down
>+.R "} "
>+
>+.ti -8
>+.B devlink exception help
>+
>+.SH "DESCRIPTION"
>+.B devlink-exception
>+tool allows user to configure the way driver treats unexpected status. The tool allows configuration of the conditions that can trigger exception activity. Set for each condition the follow up operations, such as, reset and dump of info. In addition, set the exception activity termination action.
>+
>+.SS devlink exception show - Display devlink exception conditions and actions attributes
>+.TP
>+.BI "DEV"
>+Specifies the devlink device to show.
>+
>+.PP
>+.TP
>+.BI condition " NAME"
>+Specifies the devlink condition to show.
>+
>+.TP
>+.BI action " NAME"
>+Specifies the devlink action to show.
>+
>+.SS devlink exception condition set - sets devlink exception condition attributes
>+
>+.TP
>+.B "DEV"
>+Specifies the devlink device to set.
>+
>+.TP
>+.BI name " NAME"
>+Name of the condition to set.
>+
>+.TP
>+.BR action
>+.IR NAME
>+.R "{" active "|" inactive "} "
>+.in +4
>+Specify which actions to activate and which to deactivate once a condition was triggered. Actions can be dump, reset, etc.
>+
>+.SS devlink exception action set - sets devlink action attributes.
>+Once this command is launched, period and count measurement will be reset.
>+
>+.TP
>+.B "DEV"
>+Specifies the devlink device to set.
>+
>+.TP
>+.BI name " NAME"
>+Specifies the devlink action to set.
>+
>+.TP
>+.BI period " PERIOD"
>+The period on which we limit the amount of performed actions, measured in seconds.
>+
>+.TP
>+.BI count " COUNT"
>+The maximum number of actions performed in a limited time frame.
>+
>+.TP
>+.BR fail
>+.R "{" ignore "|" down "}"
>+.in +4
>+Specify the behavior once count limit was reached.
>+
>+.I ignore
>+- Skip triggering this action.
>+
>+.I down
>+- Driver will remain in nonoperational state.
>+
>+.SH "EXAMPLES"
>+.PP
>+devlink exception show
>+.RS 4
>+Shows the exception state of all devlink devices on the system.
>+.RE
>+.PP
>+devlink exception show pci/0000:01:00.0
>+.RS 4
>+Shows the exception state of specified devlink device.
>+.RE
>+.PP
>+devlink exception condition set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on
>+.RS 4
>+Sets TX_COMP_ERROR condition parameters for a specific device.
>+.RE
>+.PP
>+devlink exception action set pci/0000:01:00.0 name reset period 3600 count 5 fail ignore
>+.RS 4
>+Sets exception attributes for reset action. Period timer and counter are being reset.
>+.RE
>+
>+.SH SEE ALSO
>+.BR devlink (8),
>+.BR devlink-port (8),
>+.BR devlink-sb (8),
>+.BR devlink-monitor (8),
>+.BR devlink-dev (8),
>+.br
>+
>+.SH AUTHOR
>+Eran ben Elisha <eranbe@mellanox.com>
>-- 
>1.8.3.1
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH iproute2-next V2] System specification exception API
  2018-09-27 14:02   ` Eran Ben Elisha
@ 2018-09-27 14:34     ` Jiri Pirko
  2018-09-27 15:04       ` Eran Ben Elisha
  0 siblings, 1 reply; 8+ messages in thread
From: Jiri Pirko @ 2018-09-27 14:34 UTC (permalink / raw)
  To: Eran Ben Elisha
  Cc: netdev, Jakub Kicinski, Jiri Pirko, Stephen Hemminger,
	Andrew Lunn, Tobin C. Harding, Ariel Almog, Tal Alon

Thu, Sep 27, 2018 at 04:02:48PM CEST, eranbe@mellanox.com wrote:
>
>
>On 9/27/2018 3:47 PM, Jiri Pirko wrote:
>> Wed, Sep 26, 2018 at 01:52:58PM CEST, eranbe@mellanox.com wrote:
>> > The exception spec is targeted for Real Time Alerting, in order to know when
>> > something bad had happened to a PCI device
>> > - Provide alert debug information
>> > - Self healing
>> > - If problem needs vendor support, provide a way to gather all needed debugging
>> >   information.
>> > 
>> > The exception mechanism contains condition checkers which sense for malfunction. Upon a condition hit,
>> > actions such as logs and correction can be taken.
>> > 
>> > The condition checkers are divided into the following groups
>> > - Hardware - a checker which is triggered by the device due to
>> >   malfunction.
>> > - Software - a checker which is triggered by the software due to
>> >   malfunction.
>> 
>> What do you mean by a "software malfunction", a "FW malfunction"?
>> Also, I don't see this 2 groups in the man.
>
>Software malfunction can be a Transmit error (caused by bad send request).

Sorry, but I still don't undestand what "software malfuntion" are you
talking about. Could you be more specific please?


>FW/HW malfunction can be any catastrophic error report (the ones that should
>be exposed to driver).
>The comment here was to highlight that we can support different kinds of
>condition groups.
>If for a specific condition, we will need to highlight it is SW/HW, we can
>concatenate it to its name.
>
>Eran
>
>> 
>> 
>> > Both groups of condition checkers can be triggered due to error event or due to a periodic check.
>> > 
>> > Actions are the way to handle those events. Action can be in one of the
>> > following groups:
>> > - Dump -  SW trace, SW dump, HW trace, HW dump
>> > - Reset - Surgical correction (e.g. modify Q, flush Q, reset of device, etc)
>> > Actions can be performed by SW or HW.
>> > 
>> > User is allowed to enable or disable condition checkers and its action mapping.
>> > 
>> > This RFC man page patch describes the suggested API of devlink-exception in order
>> > to control conditions and actions.
>> > 
>> > V2:
>> > * Renaming terms:
>> > 	health -> exception
>> > 	sensor -> condition
>> > * Remove reinit command and merge with action command.
>> > * Consmetics in grammer.
>> > 
>> > Eran Ben Elisha (1):
>> >   man: Add devlink exception man page
>> > 
>> > man/man8/devlink-exception.8 | 158 +++++++++++++++++++++++++++++++++++++++++++
>> > 1 file changed, 158 insertions(+)
>> > create mode 100644 man/man8/devlink-exception.8
>> > 
>> > -- 
>> > 1.8.3.1
>> > 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH iproute2-next V2] System specification exception API
  2018-09-27 14:34     ` Jiri Pirko
@ 2018-09-27 15:04       ` Eran Ben Elisha
  0 siblings, 0 replies; 8+ messages in thread
From: Eran Ben Elisha @ 2018-09-27 15:04 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, Jakub Kicinski, Jiri Pirko, Stephen Hemminger,
	Andrew Lunn, Tobin C. Harding, Ariel Almog, Tal Alon



On 9/27/2018 5:34 PM, Jiri Pirko wrote:
> Thu, Sep 27, 2018 at 04:02:48PM CEST, eranbe@mellanox.com wrote:
>>
>>
>> On 9/27/2018 3:47 PM, Jiri Pirko wrote:
>>> Wed, Sep 26, 2018 at 01:52:58PM CEST, eranbe@mellanox.com wrote:
>>>> The exception spec is targeted for Real Time Alerting, in order to know when
>>>> something bad had happened to a PCI device
>>>> - Provide alert debug information
>>>> - Self healing
>>>> - If problem needs vendor support, provide a way to gather all needed debugging
>>>>    information.
>>>>
>>>> The exception mechanism contains condition checkers which sense for malfunction. Upon a condition hit,
>>>> actions such as logs and correction can be taken.
>>>>
>>>> The condition checkers are divided into the following groups
>>>> - Hardware - a checker which is triggered by the device due to
>>>>    malfunction.
>>>> - Software - a checker which is triggered by the software due to
>>>>    malfunction.
>>>
>>> What do you mean by a "software malfunction", a "FW malfunction"?
>>> Also, I don't see this 2 groups in the man.
>>
>> Software malfunction can be a Transmit error (caused by bad send request).
> 
> Sorry, but I still don't undestand what "software malfuntion" are you
> talking about. Could you be more specific please?

* Driver is building a bad send Work request (bug in driver, bug in 
packet generator, etc). When it sends it, it gets back an error 
completion from the HW. This error might cause the HW Queue to be in 
error state and cannot be used again until it is being "recovered".

Condition: Error completion
Action: Queue recover
The entire scenario is due to SW malfunction.

* Driver is trying to configure HW QoS register bug failed by the FW.

Condition: command execution error
Action: Dump of command + Dump of SW internal related DB + Dump of FW 
related DB

* Another existing example is the ndo_tx_timeout routine. (This is being 
done in the networking stuck layer, and can be configured today from a 
sysfs). If a vendor driver has other specific checking routine like this 
one in its driver (which he needs to configure from userspace), then it 
can handled via devlink-exception and be tagged as a software condition.

> 
> 
>> FW/HW malfunction can be any catastrophic error report (the ones that should
>> be exposed to driver).
>> The comment here was to highlight that we can support different kinds of
>> condition groups.
>> If for a specific condition, we will need to highlight it is SW/HW, we can
>> concatenate it to its name.
>>
>> Eran
>>

>>>>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH iproute2-next V2] man: Add devlink exception man page
  2018-09-27 14:32   ` Jiri Pirko
@ 2018-09-27 16:26     ` David Ahern
  0 siblings, 0 replies; 8+ messages in thread
From: David Ahern @ 2018-09-27 16:26 UTC (permalink / raw)
  To: Jiri Pirko, Eran Ben Elisha
  Cc: netdev, Jakub Kicinski, Jiri Pirko, Stephen Hemminger,
	Andrew Lunn, Tobin C. Harding, Ariel Almog, Tal Alon

On 9/27/18 8:32 AM, Jiri Pirko wrote:
> But still, I need the code so I can play with it, to
> see the outputs etc.

+1

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-09-27 22:45 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-09-26 11:52 [RFC PATCH iproute2-next V2] System specification exception API Eran Ben Elisha
2018-09-26 11:52 ` [RFC PATCH iproute2-next V2] man: Add devlink exception man page Eran Ben Elisha
2018-09-27 14:32   ` Jiri Pirko
2018-09-27 16:26     ` David Ahern
2018-09-27 12:47 ` [RFC PATCH iproute2-next V2] System specification exception API Jiri Pirko
2018-09-27 14:02   ` Eran Ben Elisha
2018-09-27 14:34     ` Jiri Pirko
2018-09-27 15:04       ` Eran Ben Elisha

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).