From mboxrd@z Thu Jan 1 00:00:00 1970 From: Florian Fainelli Subject: Re: [PATCH net] Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()" Date: Thu, 31 Aug 2017 10:53:37 -0700 Message-ID: References: <1504140569-2063-1-git-send-email-f.fainelli@gmail.com> <931bf454-81ff-94dc-82e6-bc2b889bd43a@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Cc: Marc Gonzalez , netdev , Geert Uytterhoeven , David Miller , Andrew Lunn , Mans Rullgard To: Mason , David Daney Return-path: Received: from mail-qt0-f196.google.com ([209.85.216.196]:33463 "EHLO mail-qt0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751022AbdHaRxr (ORCPT ); Thu, 31 Aug 2017 13:53:47 -0400 Received: by mail-qt0-f196.google.com with SMTP id h15so96943qta.0 for ; Thu, 31 Aug 2017 10:53:47 -0700 (PDT) In-Reply-To: Content-Language: en-US Sender: netdev-owner@vger.kernel.org List-ID: On 08/31/2017 10:49 AM, Mason wrote: > On 31/08/2017 18:57, Florian Fainelli wrote: >> And the race is between phy_detach() setting phydev->attached_dev = NULL >> and phy_state_machine() running in PHY_HALTED state and calling >> netif_carrier_off(). > > I must be missing something. > (Since a thread cannot race against itself.) > > phy_disconnect calls phy_stop_machine which > 1) stops the work queue from running in a separate thread > 2) calls phy_state_machine *synchronously* > which runs the PHY_HALTED case with everything well-defined > end of phy_stop_machine > > phy_disconnect only then calls phy_detach() > which makes future calls of phy_state_machine perilous. > > This all happens in the same thread, so I'm not yet > seeing where the race happens? The race is as described in David's earlier email, so let's recap: Thread 1 Thread 2 phy_disconnect() phy_stop_interrupts() phy_stop_machine() phy_state_machine() -> queue_delayed_work() phy_detach() phy_state_machine() -> netif_carrier_off() If phy_detach() finishes earlier than the workqueue had a chance to be scheduled and process PHY_HALTED again, then we trigger the NULL pointer de-reference. workqueues are not tasklets, the CPU scheduling them gets no guarantee they will run on the same CPU. -- Florian