From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4F51ED51F for ; Tue, 5 Sep 2023 14:01:00 +0000 (UTC) Received: from pandora.armlinux.org.uk (unknown [IPv6:2001:4d48:ad52:32c8:5054:ff:fe00:142]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EB206197 for ; Tue, 5 Sep 2023 07:00:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=armlinux.org.uk; s=pandora-2019; h=Sender:In-Reply-To:Content-Type: MIME-Version:References:Message-ID:Subject:Cc:To:From:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=MkGH9dqOj1JGHR8MKUGPeijOi4+2Yf0aUnPnT6G1g/4=; b=SbtMkLevzN/6/riyAImnTLElA8 Wc5WRpeOdB0f1ZKmtVDR7F0PA5F3DRo5eIr1vOvfkRwVavNpLPWlPHIjzMebLm9gi7XaAoKlI7zFg HW1mvLWLWj9danrFJ9lFWeWuN/B+LKs7E23cgxOugnf/x95Sr1qZsGxjK6I7pOh5OycCqG2vmwDKs mhaV1x/50W6smXp91lv3emHHL24weJKpLJSWzFpoO0tSv7pRmHCxqHqdPsASgvGmnQZ7UTBxkpGWe mXUyUfTJwwRvV1N/PeUoPkg5l7draiY6N8i94e68CnwQEUNIjST0Xid584sVg+PjKDKGyYZpTEapx xyNPQCyg==; Received: from shell.armlinux.org.uk ([fd8f:7570:feb6:1:5054:ff:fe00:4ec]:54536) by pandora.armlinux.org.uk with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1qdWbo-0007w6-21; Tue, 05 Sep 2023 15:00:48 +0100 Received: from linux by shell.armlinux.org.uk with local (Exim 4.94.2) (envelope-from ) id 1qdWbo-0003jZ-LM; Tue, 05 Sep 2023 15:00:48 +0100 Date: Tue, 5 Sep 2023 15:00:48 +0100 From: "Russell King (Oracle)" To: Andrew Lunn Cc: Jijie Shao , f.fainelli@gmail.com, davem@davemloft.net, edumazet@google.com, hkallweit1@gmail.com, kuba@kernel.org, netdev@vger.kernel.org, pabeni@redhat.com, "shenjian15@huawei.com" , "liuyonglong@huawei.com" , wangjie125@huawei.com, chenhao418@huawei.com, Hao Lan , "wangpeiyang1@huawei.com" Subject: Re: [PATCH net-next] net: phy: avoid kernel warning dump when stopping an errored PHY Message-ID: References: <8e7e02d8-2b2a-8619-e607-fbac50706252@huawei.com> <29917acb-bd80-10e5-b1ae-c844ea0e9cbb@huawei.com> <99eade9d-a580-4519-8399-832e196d335a@lunn.ch> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <99eade9d-a580-4519-8399-832e196d335a@lunn.ch> Sender: Russell King (Oracle) X-Spam-Status: No, score=-1.3 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_BLOCKED,RDNS_NONE, SPF_HELO_NONE,SPF_NONE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net On Tue, Sep 05, 2023 at 02:09:29PM +0200, Andrew Lunn wrote: > > When we do a phy_stop(), hardware might be error and we can't access to > > mdio.And our process is read/write mdio failed first, then do phy_stop(), > > reset hardware and call phy_start() finally. > > If the hardware/fimrware is already dead, you have to expect a stack > trace, because the once a second poll can happen, before you notice > the hardware/firmware is dead and call phy_stop(). > > You might want to also disconnect the PHY and reconnect it after the > reset. Andrew, I think that's what is being tried here, but there's a race between phy_stop() and phy_state_machine() which is screwing up phydev->state. Honestly, the locking in phy_state_machine() is insane, allows this race to happen, and really needs fixing... and I think that the phydev->lock usage has become really insane over the years. We have some driver methods now which are always called with the lock held, others where the lock may or may not be held, and others where the lock isn't held - and none of this is documented. Please can you have a look at the four patches I've just posted as attached to my previous email. I think we need to start sorting out some of this crazyness and santising the locking. My four patches address most of it, except the call to phy_suspend(). If that can be solved, then we can improve the locking more, and eliminate the race entirely. If we held the lock over the entire state machine function, then the problem that has been reported here would not exist - phy_stop() would not be able to "nip in" during the middle of the PHY state machine running, and thus we would not see the PHY_HALTED state overwritten by PHY_ERROR unexpectedly. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!