From mboxrd@z Thu Jan 1 00:00:00 1970 From: Geert Uytterhoeven Subject: Re: [RFC net-next 4/4] net: phy: Correctly process PHY_HALTED in phy_stop_machine() Date: Mon, 27 Nov 2017 08:48:11 +0100 Message-ID: References: <20171025232124.14120-1-f.fainelli@gmail.com> <20171025232124.14120-5-f.fainelli@gmail.com> <2630f8f3-9c07-cfe4-6d03-d8e3a746556f@gmail.com> <4a784794-ffd7-4556-d8f9-05cf39d3eb91@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Cc: "netdev@vger.kernel.org" , Marc Gonzalez , "David S. Miller" , Andrew Lunn , opendmb@gmail.com, Mason , David Daney , Geert Uytterhoeven To: Florian Fainelli Return-path: Received: from mail-qt0-f195.google.com ([209.85.216.195]:41299 "EHLO mail-qt0-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751034AbdK0HsM (ORCPT ); Mon, 27 Nov 2017 02:48:12 -0500 Received: by mail-qt0-f195.google.com with SMTP id i40so26692359qti.8 for ; Sun, 26 Nov 2017 23:48:12 -0800 (PST) In-Reply-To: <4a784794-ffd7-4556-d8f9-05cf39d3eb91@gmail.com> Sender: netdev-owner@vger.kernel.org List-ID: Hi Florian, On Mon, Nov 27, 2017 at 5:05 AM, Florian Fainelli wrote: > On 11/06/2017 07:50 AM, Geert Uytterhoeven wrote: >> On Tue, Oct 31, 2017 at 5:33 PM, Florian Fainelli wrote: >>> On 10/31/2017 08:26 AM, Geert Uytterhoeven wrote: >>>> On Mon, Oct 30, 2017 at 5:09 PM, Florian Fainelli wrote: >>>>> On 10/30/2017 06:56 AM, Geert Uytterhoeven wrote: >>>>>> On Thu, Oct 26, 2017 at 1:21 AM, Florian Fainelli wrote: >>>>>>> Marc reported that he was not getting the PHY library adjust_link() >>>>>>> callback function to run when calling phy_stop() + phy_disconnect() >>>>>>> which does not indeed happen because we set the state machine to >>>>>>> PHY_HALTED but we don't get to run it to process this state past that >>>>>>> point. >>>>>>> >>>>>>> Fix this with a synchronous call to phy_state_machine() in order to have >>>>>>> the state machine actually act on PHY_HALTED, set the PHY device's link >>>>>>> down, turn the network device's carrier off and finally call the >>>>>>> adjust_link() function. >>>>>>> >>>>>>> At the end of phy_state_machine() though, if we are going to be moving >>>>>>> from PHY_HALTED to PHY_HALTED, do not reschedule the state machine, this >>>>>>> is pointless. >>>>>>> >>>>>>> Reported-by: Marc Gonzalez >>>>>>> Fixes: a390d1f379cf ("phylib: convert state_queue work to delayed_work") >>>>>>> Signed-off-by: Marc Gonzalez >>>>>>> Signed-off-by: Florian Fainelli >>>>>> >>>>>> Thanks for your patch! >>>>>> >>>>>> Unfortunately, after applying this one, the last in your series, both >>>>>> sh73a0/kzm9g and r8a73a4/ape6evm start crashing again in the system >>>>>> suspend/resume path, due to register accesses while the device is already >>>>>> suspended: >>>>> >>>>> OK, seems like there is another path, uncovered by this patch that we >>>>> can be hitting, does the following patch below help? >>>> >>>> Unfortunately it doesn't help. >>> >>> OK :/ >>> >>>> >>>>>> Unhandled fault: imprecise external abort (0x1406) at 0x0005b950 >>>> >>>> Note that this is an imprecise external abort, i.e. it's reporting may >>>> be delayed, >>>> and the backtrace may be inaccurate. >>> >>> True, can you help narrow it down with me? Can you confirm that >>> adjust_link() (assuming that is the problem) does not get called past >>> phy_stop_machine() as it should? >> >> I've added some additional debug checks (keep track of both phy and >> smsc state, and refuse the access registers if smsc is disabled). > > Thanks for doing that, and sorry for responding that late. > >> >> Apparently phy_stop_machine() is called twice: >> - Once from mdio_bus_phy_suspend(), cfr. the first backtrace, >> - A second time from smsc911x_suspend(), cfr. the second backtrace. >> >> The second call causes a call to smsc911x_phy_adjust_link() while the smsc is >> already disabled, cfr. the third backtrace. This would trigger the imprecise >> external abort if I let it access the registers. >> >> ------------[ cut here ]------------ >> WARNING: CPU: 0 PID: 1083 at drivers/net/phy/phy.c:597 >> phy_stop_machine+0x44/0xcc >> phy_stop_machine: phy running, good >> CPU: 0 PID: 1083 Comm: bash Not tainted >> 4.14.0-rc7-ape6evm-00443-gcdfc0e18a47e0bb3-dirty #637 >> Hardware name: Generic R8A73A4 (Flattened Device Tree) >> [] (unwind_backtrace) from [] (show_stack+0x10/0x14) >> [] (show_stack) from [] (dump_stack+0xa4/0xdc) >> [] (dump_stack) from [] (__warn+0xcc/0xfc) >> [] (__warn) from [] (warn_slowpath_fmt+0x34/0x44) >> [] (warn_slowpath_fmt) from [] (phy_stop_machine+0x44/0xcc) >> [] (phy_stop_machine) from [] >> (mdio_bus_phy_suspend+0x24/0x40) >> [] (mdio_bus_phy_suspend) from [] >> (dpm_run_callback+0x17c/0x3ec) >> [] (dpm_run_callback) from [] (__device_suspend+0x498/0x6b0) >> [] (__device_suspend) from [] (dpm_suspend+0x1d8/0x568) >> [] (dpm_suspend) from [] >> (suspend_devices_and_enter+0x78/0xe98) >> [] (suspend_devices_and_enter) from [] >> (pm_suspend+0xa40/0xbec) >> [] (pm_suspend) from [] (state_store+0xac/0xcc) >> [] (state_store) from [] (kernfs_fop_write+0x190/0x1d0) >> [] (kernfs_fop_write) from [] (__vfs_write+0x20/0x11c) >> [] (__vfs_write) from [] (vfs_write+0xb8/0x144) >> [] (vfs_write) from [] (SyS_write+0x40/0x80) >> [] (SyS_write) from [] (ret_fast_syscall+0x0/0x28) >> ---[ end trace 8fc4c71351438007 ]--- >> libphy: phy_stop_machine: Kicking state machine synchronously >> libphy: phy_stop_machine: Kicking state machine done >> ------------[ cut here ]------------ >> WARNING: CPU: 0 PID: 1083 at drivers/net/phy/phy.c:598 >> phy_stop_machine+0x64/0xcc >> phy_stop_machine: phy already stopped >> CPU: 0 PID: 1083 Comm: bash Tainted: G W >> 4.14.0-rc7-ape6evm-00443-gcdfc0e18a47e0bb3-dirty #637 >> Hardware name: Generic R8A73A4 (Flattened Device Tree) >> [] (unwind_backtrace) from [] (show_stack+0x10/0x14) >> [] (show_stack) from [] (dump_stack+0xa4/0xdc) >> [] (dump_stack) from [] (__warn+0xcc/0xfc) >> [] (__warn) from [] (warn_slowpath_fmt+0x34/0x44) >> [] (warn_slowpath_fmt) from [] (phy_stop_machine+0x64/0xcc) >> [] (phy_stop_machine) from [] (smsc911x_suspend+0x44/0xa4) >> [] (smsc911x_suspend) from [] (dpm_run_callback+0x17c/0x3ec) >> [] (dpm_run_callback) from [] (__device_suspend+0x498/0x6b0) >> [] (__device_suspend) from [] (dpm_suspend+0x1d8/0x568) >> [] (dpm_suspend) from [] >> (suspend_devices_and_enter+0x78/0xe98) >> [] (suspend_devices_and_enter) from [] >> (pm_suspend+0xa40/0xbec) >> [] (pm_suspend) from [] (state_store+0xac/0xcc) >> [] (state_store) from [] (kernfs_fop_write+0x190/0x1d0) >> [] (kernfs_fop_write) from [] (__vfs_write+0x20/0x11c) >> [] (__vfs_write) from [] (vfs_write+0xb8/0x144) >> [] (vfs_write) from [] (SyS_write+0x40/0x80) >> [] (SyS_write) from [] (ret_fast_syscall+0x0/0x28) >> ---[ end trace 8fc4c71351438008 ]--- >> libphy: phy_stop_machine: Kicking state machine synchronously >> ------------[ cut here ]------------ >> WARNING: CPU: 0 PID: 1083 at drivers/net/ethernet/smsc/smsc911x.c:988 >> smsc911x_phy_adjust_link+0x2c/0x2e0 >> PHY stopped >> CPU: 0 PID: 1083 Comm: bash Tainted: G W >> 4.14.0-rc7-ape6evm-00443-gcdfc0e18a47e0bb3-dirty #637 >> Hardware name: Generic R8A73A4 (Flattened Device Tree) >> [] (unwind_backtrace) from [] (show_stack+0x10/0x14) >> [] (show_stack) from [] (dump_stack+0xa4/0xdc) >> [] (dump_stack) from [] (__warn+0xcc/0xfc) >> [] (__warn) from [] (warn_slowpath_fmt+0x34/0x44) >> [] (warn_slowpath_fmt) from [] >> (smsc911x_phy_adjust_link+0x2c/0x2e0) >> [] (smsc911x_phy_adjust_link) from [] >> (phy_link_down+0x18/0x24) >> [] (phy_link_down) from [] (phy_state_machine+0x2d0/0x3f4) >> [] (phy_state_machine) from [] (phy_stop_machine+0x9c/0xcc) >> [] (phy_stop_machine) from [] (smsc911x_suspend+0x44/0xa4) >> [] (smsc911x_suspend) from [] (dpm_run_callback+0x17c/0x3ec) >> [] (dpm_run_callback) from [] (__device_suspend+0x498/0x6b0) >> [] (__device_suspend) from [] (dpm_suspend+0x1d8/0x568) >> [] (dpm_suspend) from [] >> (suspend_devices_and_enter+0x78/0xe98) >> [] (suspend_devices_and_enter) from [] >> (pm_suspend+0xa40/0xbec) >> [] (pm_suspend) from [] (state_store+0xac/0xcc) >> [] (state_store) from [] (kernfs_fop_write+0x190/0x1d0) >> [] (kernfs_fop_write) from [] (__vfs_write+0x20/0x11c) >> [] (__vfs_write) from [] (vfs_write+0xb8/0x144) >> [] (vfs_write) from [] (SyS_write+0x40/0x80) >> [] (SyS_write) from [] (ret_fast_syscall+0x0/0x28) >> ---[ end trace 8fc4c71351438009 ]--- >> libphy: phy_stop_machine: Kicking state machine done >> >> If I revert your "net: smsc911x: Properly manage PHY during suspend/resume", >> phy_stop_machine() is no longer called twice, and system suspend works. > > OK, there does appear to be a problem in how the network device vs. mdio > bus are suspended/resumed in this particular driver, and I have no idea > why, see below. > >> >> However, during resume, smsc911x_mii_read() is called before the >> smsc is enabled, cfr. the fourth backtrace: > > Humm, how is that possible? smsc911x_mii_bus properly sets its parent > device to be the platform device of the network device (which is > correct) so by virtue of child -> parent relationships, we should have > the network device resume first, and then have the MDIO bus resume > second (unless I am completely off here). The MDIO bus callback is not called from the network device ... >> WARNING: CPU: 1 PID: 17 at >> drivers/net/ethernet/smsc/smsc911x.c:165 __smsc911x_reg_read+0x1c/0x88 >> Modules linked in: >> CPU: 1 PID: 17 Comm: kworker/1:0 Tainted: G W >> 4.14.0-rc7-kzm9g-00443-gcdfc0e18a47e0bb3-dirty #1013 >> Hardware name: Generic SH73A0 (Flattened Device Tree) >> >> smsc hangs of [fec10000.bus, which is started only here ---> >> >> Workqueue: events_power_efficient phy_state_machine >> [] (unwind_backtrace) from [] (show_stack+0x10/0x14) >> [] (show_stack) from [] (dump_stack+0xa4/0xdc) >> [] (dump_stack) from [] (__warn+0xcc/0xfc) >> [] (__warn) from [] (warn_slowpath_null+0x1c/0x24) >> [] (warn_slowpath_null) from [] >> (__smsc911x_reg_read+0x1c/0x88) >> [] (__smsc911x_reg_read) from [] >> (smsc911x_mac_read+0x4c/0x118) >> [] (smsc911x_mac_read) from [] >> (smsc911x_mii_read+0x2c/0xa4) >> [] (smsc911x_mii_read) from [] (mdiobus_read+0x58/0x70) >> [] (mdiobus_read) from [] (genphy_update_link+0x18/0x50) >> [] (genphy_update_link) from [] >> (genphy_read_status+0xc/0x1cc) >> [] (genphy_read_status) from [] >> (phy_state_machine+0xa8/0x3f4) >> [] (phy_state_machine) from [] >> (process_one_work+0x240/0x3fc) >> [] (process_one_work) from [] >> (worker_thread+0x2cc/0x40c) >> [] (worker_thread) from [] (kthread+0x124/0x144) ... but from the worker thread, which is unaware of PM states, unless it is a freezable workqueue. Cfr. my patch "[1/2] net: phy: Freeze PHY polling before suspending devices" (https://patchwork.kernel.org/patch/9915901/), which made it a freezable workqueue, like was done on PCI to solve similar issues with PME scanning. >> [] (kthread) from [] (ret_from_fork+0x14/0x2c) >> ---[ end trace 21b7024e273f9f21 ]--- >> ------------[ cut here ]------------ >> >> (yes, the fourth backtrace is from another machine, but I can trigger >> all of this >> on both r8a73a4/ape6evm and sh73a0/kzm9g anyway). Gr{oetje,eeting}s, Geert