From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.8 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_PASS,URIBL_BLOCKED, USER_AGENT_NEOMUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 125E7C282C4 for ; Tue, 12 Feb 2019 16:30:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id D7A3320842 for ; Tue, 12 Feb 2019 16:30:27 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=armlinux.org.uk header.i=@armlinux.org.uk header.b="hhoC7CvX" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731144AbfBLQa0 (ORCPT ); Tue, 12 Feb 2019 11:30:26 -0500 Received: from pandora.armlinux.org.uk ([78.32.30.218]:41518 "EHLO pandora.armlinux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731142AbfBLQaZ (ORCPT ); Tue, 12 Feb 2019 11:30:25 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=armlinux.org.uk; s=pandora-2014; h=Sender:In-Reply-To:Content-Type: MIME-Version:References:Message-ID:Subject:Cc:To:From:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=bjNXJGVlanyJ8ZGlCGDggR1AeuXczMowWo4owVrGoUc=; b=hhoC7CvXzoSj45rMmz0gCYzGa PjOx1p0M+yMQQXLhnKsb3d0VMaj0Whvr+CYJbdTQeV8u74aqMmjjlcKd7l/9bdvvPFHPcDutwcZE3 VqQZsw4SeqpSb+0luhrj4YDRQCz8uVbOPhXDksJDkZi2rVBfrrbLunwiCn4dBnHY65oBE=; Received: from shell.armlinux.org.uk ([2002:4e20:1eda:1:5054:ff:fe00:4ec]:36726) by pandora.armlinux.org.uk with esmtpsa (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.90_1) (envelope-from ) id 1gtawu-0006bt-2O; Tue, 12 Feb 2019 16:30:20 +0000 Received: from linux by shell.armlinux.org.uk with local (Exim 4.89) (envelope-from ) id 1gtawr-0002A1-A9; Tue, 12 Feb 2019 16:30:17 +0000 Date: Tue, 12 Feb 2019 16:30:17 +0000 From: Russell King - ARM Linux admin To: Heiner Kallweit Cc: Andrew Lunn , John David Anglin , Vivien Didelot , Florian Fainelli , netdev@vger.kernel.org Subject: Re: [PATCH net] dsa: mv88e6xxx: Ensure all pending interrupts are handled prior to exit Message-ID: <20190212163017.lwstmgtyw76cwrd7@shell.armlinux.org.uk> References: <20190130223846.GB30115@lunn.ch> <9415d82e-965b-7777-0ad0-f23d6c9f177e@bell.net> <53b49df8-53ed-704f-9197-230b18d83090@bell.net> <824d011b-3692-69c3-5e2c-58e950a80abf@bell.net> <6a1ebc61-3505-beb8-21cb-ea42ad9fe67e@bell.net> <20190211233327.GB8591@lunn.ch> <2b6bbb4c-1346-461b-ff7a-cb96b4142f7a@bell.net> <20190212035806.GE19023@lunn.ch> <13c1e6d5-c287-0091-3b24-1978f9a18e7e@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <13c1e6d5-c287-0091-3b24-1978f9a18e7e@gmail.com> User-Agent: NeoMutt/20170113 (1.7.2) Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Tue, Feb 12, 2019 at 07:51:05AM +0100, Heiner Kallweit wrote: > On 12.02.2019 04:58, Andrew Lunn wrote: > > That change means we don't check the PHY device if it caused an > > interrupt when its state is less than UP. > > > > What i'm seeing is that the PHY is interrupting pretty early on after > > a reboot when the previous boot had the interface up. > > > So this means that when going down for reboot the interrupts are not > properly masked / disabled? Because (at least for net-next) we enable > interrupts in phy_start() only. Looking at Linus' tree as opposed to net-next, things do look rather broken wrt interrupts: +-phy_attach_direct `-phydev->state = PHY_READY +-phy_prepare_link +-phy_start_machine `-phy_trigger_machine() `-phy_start_interrupts +-request_threaded_irq() `-phy_enable_interrupts() +-phy_clear_interrupt() `-phy_config_interrupt(, PHY_INTERRUPT_ENABLED) At this point, the PHY is then able to generate interrupts, which, because phy_start() has not been called and phy_interrupt() checks that phydev->state >= PHY_UP, get ignored by the interrupt handler exactly as Andrew is finding. So it looks like 5.0-rc is already in need of this being fixed. In looking at this, I came across this chunk of code: static inline bool __phy_is_started(struct phy_device *phydev) { WARN_ON(!mutex_is_locked(&phydev->lock)); return phydev->state >= PHY_UP; } /** * phy_is_started - Convenience function to check whether PHY is started * @phydev: The phy_device struct */ static inline bool phy_is_started(struct phy_device *phydev) { bool started; mutex_lock(&phydev->lock); started = __phy_is_started(phydev); mutex_unlock(&phydev->lock); return started; } which looks to me like over-complication. The mutex locking there is completely pointless - what are you trying to achieve with it? Let's go through this. The above is exactly equivalent to: bool phy_is_started(phydev) { int state; mutex_lock(&phydev->lock); state = phydev->state; mutex_unlock(&phydev->lock); return state >= PHY_UP; } since when we do the test is irrelevant. Architectures that Linux runs on are single-copy atomic, which means that reading phydev->state itself is an atomic operation. So, the mutex locking around that doesn't add to the atomicity of the entire operation. How, depending on what you do with the rest of this function depends whether the entire operation is safe or not. For example, let's take this code at the end of phy_state_machine(): if (phy_polling_mode(phydev) && phy_is_started(phydev)) phy_queue_state_machine(phydev, PHY_STATE_TIME); state = PHY_UP thread 0 thread 1 phy_disconnect() +-phy_is_started() phy_is_started() | `-phy_stop() +-phydev->state = PHY_HALTED `-phy_stop_machine() `-cancel_delayed_work_sync() phy_queue_state_machine() `-mod_delayed_work() At this point, the phydev->state_queue() has been added back onto the system workqueue despite phy_stop_machine() having been called and cancel_delayed_work_sync() called on it. The original code in 4.20 did not have this race condition. Basically, the lock inside phy_is_started() does nothing useful, and I'd say is dangerously misleading. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up