From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.0 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D53B7C43603 for ; Wed, 18 Dec 2019 22:09:19 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 928BC21D7D for ; Wed, 18 Dec 2019 22:09:19 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=armlinux.org.uk header.i=@armlinux.org.uk header.b="PFNB4n4l" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726536AbfLRWJS (ORCPT ); Wed, 18 Dec 2019 17:09:18 -0500 Received: from pandora.armlinux.org.uk ([78.32.30.218]:52190 "EHLO pandora.armlinux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726463AbfLRWJS (ORCPT ); Wed, 18 Dec 2019 17:09:18 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=armlinux.org.uk; s=pandora-2019; h=Sender:In-Reply-To:Content-Type: MIME-Version:References:Message-ID:Subject:Cc:To:From:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=vI5UevTqcA5mwLf5SCTpIKzJ5O8gTKfGiTuphzXWWYw=; b=PFNB4n4lTOJw3Fubuk5HE78oK uwzvWzJ34f254//mdvLmvuO7+svTGVzFHUwyQtoxTLRbJCHCDs3se2HXS5b+VShT4nRKunvXnO7Ag NbC2MDXKxGkhZJt8t6nt5CHcDOh+cf2JaVtSnTvT3osrsih1al8MHDj3SltOfSeHNwp7iHkhTQXJ2 LWOfa44ZqcxueMicRu5QbJlg5AexUr5FfPjJQjSfDZ+B8vupt0hDDmMKWiSX7E1jdQinV9oyMIA9c uSml4ZwePakgYThiLGB4HMNcevwNIgV/HgN3kkM4uJY1rxJd+8HQYHTX5EOTnRhYzn9IyoJlAHy6M J/9VJW++A==; Received: from shell.armlinux.org.uk ([2001:4d48:ad52:3201:5054:ff:fe00:4ec]:43142) by pandora.armlinux.org.uk with esmtpsa (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ihhVH-0006iW-3b; Wed, 18 Dec 2019 22:09:11 +0000 Received: from linux by shell.armlinux.org.uk with local (Exim 4.92) (envelope-from ) id 1ihhVE-0004ib-8T; Wed, 18 Dec 2019 22:09:08 +0000 Date: Wed, 18 Dec 2019 22:09:08 +0000 From: Russell King - ARM Linux admin To: Heiner Kallweit Cc: Andrew Lunn , Florian Fainelli , "David S. Miller" , netdev@vger.kernel.org Subject: Re: [PATCH net] net: phy: make phy_error() report which PHY has failed Message-ID: <20191218220908.GX25745@shell.armlinux.org.uk> References: <20191217233436.GS25745@shell.armlinux.org.uk> <61f23d43-1c4d-a11e-a798-c938a896ddb3@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <61f23d43-1c4d-a11e-a798-c938a896ddb3@gmail.com> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Wed, Dec 18, 2019 at 09:54:32PM +0100, Heiner Kallweit wrote: > On 18.12.2019 00:34, Russell King - ARM Linux admin wrote: > > On Tue, Dec 17, 2019 at 10:41:34PM +0100, Heiner Kallweit wrote: > >> On 17.12.2019 13:53, Russell King wrote: > >>> phy_error() is called from phy_interrupt() or phy_state_machine(), and > >>> uses WARN_ON() to print a backtrace. The backtrace is not useful when > >>> reporting a PHY error. > >>> > >>> However, a system may contain multiple ethernet PHYs, and phy_error() > >>> gives no clue which one caused the problem. > >>> > >>> Replace WARN_ON() with a call to phydev_err() so that we can see which > >>> PHY had an error, and also inform the user that we are halting the PHY. > >>> > >>> Fixes: fa7b28c11bbf ("net: phy: print stack trace in phy_error") > >>> Signed-off-by: Russell King > >>> --- > >>> There is another related problem in this area. If an error is detected > >>> while the PHY is running, phy_error() moves to PHY_HALTED state. If we > >>> try to take the network device down, then: > >>> > >>> void phy_stop(struct phy_device *phydev) > >>> { > >>> if (!phy_is_started(phydev)) { > >>> WARN(1, "called from state %s\n", > >>> phy_state_to_str(phydev->state)); > >>> return; > >>> } > >>> > >>> triggers, and we never do any of the phy_stop() cleanup. I'm not sure > >>> what the best way to solve this is - introducing a PHY_ERROR state may > >>> be a solution, but I think we want some phy_is_started() sites to > >>> return true for it and others to return false. > >>> > >>> Heiner - you introduced the above warning, could you look at improving > >>> this case so we don't print a warning and taint the kernel when taking > >>> a network device down after phy_error() please? > >>> > >> I think we need both types of information: > >> - the affected PHY device > >> - the stack trace to see where the issue was triggered > > > > Can you please explain why the stack trace is useful. For the paths > > that are reachable, all it tells you is whether it was reached via > > the interrupt or the workqueue. > > > > If it's via the interrupt, the rest of the backtrace beyond that is > > irrelevant. If it's the workqueue, the backtrace doesn't go back > > very far, and doesn't tell you what operation triggered it. > > > > If it's important to see where or why phy_error() was called, there > > are much better ways of doing that, notably passing a string into > > phy_error() to describe the actual error itself. That would convey > > way more useful information than the backtrace does. > > > > I have been faced with these backtraces, and they have not been at > > all useful for diagnosing the problem. > > > "The problem" comes in two flavors: > 1. The problem that caused the PHY error > 2. The problem caused by the PHY error (if we decide to not > always switch to HALTED state) > > We can't do much for case 1, maybe we could add an errno argument > to phy_error(). To facilitate analyzing case 2 we'd need to change > code pieces like the following. > > case a: > err = f1(); > case b: > err = f2(); > > if (err) > phy_error() > > For my understanding: What caused the PHY error in your case(s)? > Which info would have been useful for analyzing the error? Errors reading/writing from the PHY. The problem with a backtrace from phy_error() is it doesn't tell you where the error actually occurred, it only tells you where the error is reported - which is one of two different paths at the moment. That can be achieved with much more elegance and simplicity by passing a string into phy_error() to describe the call site if that's even relevant. I would say, however, that knowing where the error occurred would be far better information. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up