From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jakub Kicinski Subject: Re: [RFC PATCH iproute2-next] System specification health API Date: Thu, 13 Sep 2018 10:36:04 -0700 Message-ID: <20180913103604.0ef868f4@cakuba.netronome.com> References: <1536826696-9413-1-git-send-email-eranbe@mellanox.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: netdev@vger.kernel.org, Jiri Pirko , Andy Gospodarek , Michael Chan , Simon Horman , Alexander Duyck , Andrew Lunn , Florian Fainelli , Tal Alon , Ariel Almog To: Eran Ben Elisha Return-path: Received: from mail-qk1-f171.google.com ([209.85.222.171]:45716 "EHLO mail-qk1-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726824AbeIMWqt (ORCPT ); Thu, 13 Sep 2018 18:46:49 -0400 Received: by mail-qk1-f171.google.com with SMTP id z125-v6so3636199qkb.12 for ; Thu, 13 Sep 2018 10:36:20 -0700 (PDT) In-Reply-To: <1536826696-9413-1-git-send-email-eranbe@mellanox.com> Sender: netdev-owner@vger.kernel.org List-ID: On Thu, 13 Sep 2018 11:18:15 +0300, Eran Ben Elisha wrote: > The health spec is targeted for Real Time Alerting, in order to know when > something bad had happened to a PCI device By spec you mean some standards body spec you implement or this proposal is a spec? > - Provide alert debug information > - Self healing > - If problem needs vendor support, provide a way to gather all needed debugging > information. > > The health contains sensors which sense for malfunction. Once sensor triggered, > actions such as logs and correction can be taken. > Sensors are sensing the health state and can trigger correction action. > > The sensors are divided into the following groups > - Hardware sensor - a sensor which is triggered by the device due to > malfunction. > - Software sensor - a sensor which is triggered by the software due to > malfunction. > Both group of sensors can be triggered due to error event or due to a periodic check. > > Actions are the way to handle sensor events. Action can be in one of the > following groups: > - Dump - SW trace, SW dump, HW trace, HW dump > - Reset - Surgical correction (e.g. modify Q, flush Q, reset of device, etc) > Actions can be performed by SW or HW. > > User is allowed to enable or disable sensors and sensor2action mapping. > > This RFC man page patch describes the suggested API of devlink-health in order > to control sensors and actions. I like the idea of configuring response to events like this, although I'm not sure the name sensor is appropriate here - perhaps exception or error would be better? Are there going to be values reported? I'm not so sure about HW sensors in relation to existing HWMON infrastructure... I assume you're targeting things like say some HW engine/block reporting it encountered an error? Sounds good, too. Are the actions all envisioned to be performed by the driver? Firmware? Hardware? I guess that distinction can be added later. For FW/HW actions we would go back to the problem of persistence of the setting since it was only implemented for params :S Is the dump option going to tie back into region snapshots?