From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrew Lunn Subject: Re: [RFC PATCH iproute2-next] man: Add devlink health man page Date: Thu, 13 Sep 2018 17:12:52 +0200 Message-ID: <20180913151252.GC23892@lunn.ch> References: <1536826696-9413-1-git-send-email-eranbe@mellanox.com> <1536826696-9413-2-git-send-email-eranbe@mellanox.com> <20180913120815.GB11702@lunn.ch> <5c12253d-2100-09bb-9e3e-6259fc7a9323@mellanox.com> <20180913132453.GE11702@lunn.ch> <66584ca2-8efa-9a6d-c1f3-1cf81cb04259@mellanox.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: netdev@vger.kernel.org, Jiri Pirko , Andy Gospodarek , Michael Chan , Jakub Kicinski , Simon Horman , Alexander Duyck , Florian Fainelli , Tal Alon , Ariel Almog To: Eran Ben Elisha Return-path: Received: from vps0.lunn.ch ([185.16.172.187]:33593 "EHLO vps0.lunn.ch" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726824AbeIMUWu (ORCPT ); Thu, 13 Sep 2018 16:22:50 -0400 Content-Disposition: inline In-Reply-To: <66584ca2-8efa-9a6d-c1f3-1cf81cb04259@mellanox.com> Sender: netdev-owner@vger.kernel.org List-ID: > >>>> devlink health sensor set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on > >>>> Sets TX_COMP_ERROR sensor parameters for a specific device. > >>This is what I had in mind: > >>1. command interface error > >>2. command interface timeout > >>3. stuck TX queue (like tx_timeout) > >>4. stuck TX completion queue (driver did not process packets in a reasonable > >>time period) > >>5. stuck RX queue > >>6. RX completion error > >>7. TX completion error > >>8. HW / FW catastrophic error report > >>9. completion queue overrun > Such issues do exist in production environment, and need to be handled even > if root cause is a bug which will be fixed in latest release. My feature > should help developers / administrator to control and recover their live > systems, by auto correction and logging support. > Goal is: > - Provide alert debug information > - Self healing > - If problem needs vendor support, provide a way to gather all needed > debugging information. So maybe you have the wrong name for this. Health is nice in terms of Marketing, but we are actually talking about bug recovery. devlink bug sensor set pci/0000:01:00.0 name command_interface_error action reset off action dump on devlink bug sensor set pci/0000:01:00.0 name command_interface_timeout action reset off action dump on devlink bug sensor set pci/0000:01:00.0 name transmit_completion_error action reset off action dump on devlink bug sensor set pci/0000:01:00.0 name completion_queue_overrun action reset off action dump on seems a lot more understandable than: devlink health set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on Andrew