netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] igb: Fix watchdog_task race with shutdown
@ 2025-04-28 11:54 Ian Ray
  2025-04-29 15:20 ` Simon Horman
  0 siblings, 1 reply; 3+ messages in thread
From: Ian Ray @ 2025-04-28 11:54 UTC (permalink / raw)
  To: Tony Nguyen, Przemek Kitszel, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: brian.ruley, Ian Ray, intel-wired-lan, netdev, linux-kernel

A rare [1] race condition is observed between the igb_watchdog_task and
shutdown on a dual-core i.MX6 based system with two I210 controllers.

Using printk, the igb_watchdog_task is hung in igb_read_phy_reg because
__igb_shutdown has already called __igb_close.

Fix this by locking in igb_watchdog_task (in the same way as is done in
igb_reset_task).

reboot             kworker

__igb_shutdown
  rtnl_lock
  __igb_close
  :                igb_watchdog_task
  :                :
  :                igb_read_phy_reg (hung)
  rtnl_unlock

[1] Note that this is easier to reproduce with 'initcall_debug' logging
and additional and printk logging in igb_main.

Signed-off-by: Ian Ray <ian.ray@gehealthcare.com>
---
 drivers/net/ethernet/intel/igb/igb_main.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index c646c71915f0..a4910e565a3f 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -5544,6 +5544,8 @@ static void igb_watchdog_task(struct work_struct *work)
 	u32 connsw;
 	u16 phy_data, retry_count = 20;
 
+	rtnl_lock();
+
 	link = igb_has_link(adapter);
 
 	if (adapter->flags & IGB_FLAG_NEED_LINK_UPDATE) {
@@ -5680,7 +5682,7 @@ static void igb_watchdog_task(struct work_struct *work)
 				if (adapter->flags & IGB_FLAG_MEDIA_RESET) {
 					schedule_work(&adapter->reset_task);
 					/* return immediately */
-					return;
+					goto unlock;
 				}
 			}
 			pm_schedule_suspend(netdev->dev.parent,
@@ -5693,7 +5695,7 @@ static void igb_watchdog_task(struct work_struct *work)
 			if (adapter->flags & IGB_FLAG_MEDIA_RESET) {
 				schedule_work(&adapter->reset_task);
 				/* return immediately */
-				return;
+				goto unlock;
 			}
 		}
 	}
@@ -5714,7 +5716,7 @@ static void igb_watchdog_task(struct work_struct *work)
 				adapter->tx_timeout_count++;
 				schedule_work(&adapter->reset_task);
 				/* return immediately since reset is imminent */
-				return;
+				goto unlock;
 			}
 		}
 
@@ -5751,6 +5753,9 @@ static void igb_watchdog_task(struct work_struct *work)
 			mod_timer(&adapter->watchdog_timer,
 				  round_jiffies(jiffies + 2 * HZ));
 	}
+
+unlock:
+	rtnl_unlock();
 }
 
 enum latency_range {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] igb: Fix watchdog_task race with shutdown
  2025-04-28 11:54 [PATCH] igb: Fix watchdog_task race with shutdown Ian Ray
@ 2025-04-29 15:20 ` Simon Horman
  2025-04-30  6:13   ` Ian Ray
  0 siblings, 1 reply; 3+ messages in thread
From: Simon Horman @ 2025-04-29 15:20 UTC (permalink / raw)
  To: Ian Ray
  Cc: Tony Nguyen, Przemek Kitszel, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, brian.ruley,
	intel-wired-lan, netdev, linux-kernel,
	Toke Høiland-Jørgensen

+ Toke

On Mon, Apr 28, 2025 at 02:54:49PM +0300, Ian Ray wrote:
> A rare [1] race condition is observed between the igb_watchdog_task and
> shutdown on a dual-core i.MX6 based system with two I210 controllers.
> 
> Using printk, the igb_watchdog_task is hung in igb_read_phy_reg because
> __igb_shutdown has already called __igb_close.
> 
> Fix this by locking in igb_watchdog_task (in the same way as is done in
> igb_reset_task).
> 
> reboot             kworker
> 
> __igb_shutdown
>   rtnl_lock
>   __igb_close
>   :                igb_watchdog_task
>   :                :
>   :                igb_read_phy_reg (hung)
>   rtnl_unlock
> 
> [1] Note that this is easier to reproduce with 'initcall_debug' logging
> and additional and printk logging in igb_main.
> 
> Signed-off-by: Ian Ray <ian.ray@gehealthcare.com>

Hi Ian,

Thanks for your patch.

While I think that the simplicity of this approach may well be appropriate
as a fix for the problem described I do have a concern.

I am worried that taking RTNL each time the watchdog tasks will create
unnecessary lock contention. That may manifest in weird and wonderful ways
in future.  Maybe this patch doesn't make things materially worse in that
regard.  But it would be nice to have a plan to move away from using RTNL,
as is happening elsewhere.

...


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] igb: Fix watchdog_task race with shutdown
  2025-04-29 15:20 ` Simon Horman
@ 2025-04-30  6:13   ` Ian Ray
  0 siblings, 0 replies; 3+ messages in thread
From: Ian Ray @ 2025-04-30  6:13 UTC (permalink / raw)
  To: Simon Horman
  Cc: Tony Nguyen, Przemek Kitszel, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, brian.ruley,
	intel-wired-lan, netdev, linux-kernel,
	Toke Høiland-Jørgensen, ian.ray

On Tue, Apr 29, 2025 at 04:20:21PM +0100, Simon Horman wrote:
> + Toke
> 
> On Mon, Apr 28, 2025 at 02:54:49PM +0300, Ian Ray wrote:
> > A rare [1] race condition is observed between the igb_watchdog_task and
> > shutdown on a dual-core i.MX6 based system with two I210 controllers.
> >
> > Using printk, the igb_watchdog_task is hung in igb_read_phy_reg because
> > __igb_shutdown has already called __igb_close.
> >
> > Fix this by locking in igb_watchdog_task (in the same way as is done in
> > igb_reset_task).
> >
> > reboot             kworker
> >
> > __igb_shutdown
> >   rtnl_lock
> >   __igb_close
> >   :                igb_watchdog_task
> >   :                :
> >   :                igb_read_phy_reg (hung)
> >   rtnl_unlock
> >
> > [1] Note that this is easier to reproduce with 'initcall_debug' logging
> > and additional and printk logging in igb_main.
> >
> > Signed-off-by: Ian Ray <ian.ray@gehealthcare.com>
> 
> Hi Ian,
> 
> Thanks for your patch.
> 
> While I think that the simplicity of this approach may well be appropriate
> as a fix for the problem described I do have a concern.
> 
> I am worried that taking RTNL each time the watchdog tasks will create
> unnecessary lock contention. That may manifest in weird and wonderful ways
> in future.  Maybe this patch doesn't make things materially worse in that
> regard.  But it would be nice to have a plan to move away from using RTNL,
> as is happening elsewhere.
> 
> ...

Hi Simon,

Many thanks for the review.  I've been reflecting on the patch (and
discussing internally) and I think it would be better to model the
behaviour on igb_remove instead of igb_reset_task.  Meaning that the
timer should be deleted, and the work cancelled, after setting bit
IGB_DOWN.  This would mirror igb_up.  (And has the advantage of not
using the RTNL.)

(As you can probably tell) I am not very familiar with this subsystem,
but the modified proposal, below, works well in my testing.  I will
happily send a V2 if you think this is a better direction.

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 291348505868..d4b905469cc2 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2173,10 +2173,14 @@ void igb_down(struct igb_adapter *adapter)
        u32 tctl, rctl;
        int i;

-       /* signal that we're down so the interrupt handler does not
-        * reschedule our watchdog timer
+       /* The watchdog timer may be rescheduled, so explicitly
+        * disable watchdog from being rescheduled.
         */
        set_bit(__IGB_DOWN, &adapter->state);
+       del_timer_sync(&adapter->watchdog_timer);
+       del_timer_sync(&adapter->phy_info_timer);
+
+       cancel_work_sync(&adapter->watchdog_task);

        /* disable receives in the hardware */
        rctl = rd32(E1000_RCTL);
@@ -2207,11 +2211,6 @@ void igb_down(struct igb_adapter *adapter)
                }
        }

-       del_timer_sync(&adapter->watchdog_timer);
-       del_timer_sync(&adapter->phy_info_timer);
-
-       cancel_work_sync(&adapter->watchdog_task);
-
        /* record the stats before reset*/
        spin_lock(&adapter->stats64_lock);
        igb_update_stats(adapter);

^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-04-30  6:13 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-28 11:54 [PATCH] igb: Fix watchdog_task race with shutdown Ian Ray
2025-04-29 15:20 ` Simon Horman
2025-04-30  6:13   ` Ian Ray

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).