stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] net: usbnet: Avoid potential RCU stall on LINK_CHANGE event
@ 2025-07-10  8:50 John Ernberg
  2025-07-14 23:35 ` Jakub Kicinski
  2025-07-15 11:49 ` Oliver Neukum
  0 siblings, 2 replies; 10+ messages in thread
From: John Ernberg @ 2025-07-10  8:50 UTC (permalink / raw)
  To: Oliver Neukum, Andrew Lunn, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni
  Cc: Ming Lei, netdev@vger.kernel.org, linux-usb@vger.kernel.org,
	linux-kernel@vger.kernel.org, John Ernberg,
	stable@vger.kernel.org

Having a Gemalto Cinterion PLS83-W modem attached to USB and activating the
cellular data link would sometimes yield the following RCU stall, leading
to a system freeze:

    rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 0-.... } 33108 jiffies s: 201 root: 0x1/.
    rcu: blocking rcu_node structures (internal RCU debug):
    Sending NMI from CPU 1 to CPUs 0:
    NMI backtrace for cpu 0

    Call trace:
     arch_local_irq_enable+0x4/0x8
     local_bh_enable+0x18/0x20
     __netdev_alloc_skb+0x18c/0x1cc
     rx_submit+0x68/0x1f8 [usbnet]
     rx_alloc_submit+0x4c/0x74 [usbnet]
     usbnet_bh+0x1d8/0x218 [usbnet]
     usbnet_bh_tasklet+0x10/0x18 [usbnet]
     tasklet_action_common+0xa8/0x110
     tasklet_action+0x2c/0x34
     handle_softirqs+0x2cc/0x3a0
     __do_softirq+0x10/0x18
     ____do_softirq+0xc/0x14
     call_on_irq_stack+0x24/0x34
     do_softirq_own_stack+0x18/0x20
     __irq_exit_rcu+0xa8/0xb8
     irq_exit_rcu+0xc/0x30
     el1_interrupt+0x34/0x48
     el1h_64_irq_handler+0x14/0x1c
     el1h_64_irq+0x68/0x6c
     _raw_spin_unlock_irqrestore+0x38/0x48
     xhci_urb_dequeue+0x1ac/0x45c [xhci_hcd]
     unlink1+0xd4/0xdc [usbcore]
     usb_hcd_unlink_urb+0x70/0xb0 [usbcore]
     usb_unlink_urb+0x24/0x44 [usbcore]
     unlink_urbs.constprop.0.isra.0+0x64/0xa8 [usbnet]
     __handle_link_change+0x34/0x70 [usbnet]
     usbnet_deferred_kevent+0x1c0/0x320 [usbnet]
     process_scheduled_works+0x2d0/0x48c
     worker_thread+0x150/0x1dc
     kthread+0xd8/0xe8
     ret_from_fork+0x10/0x20

It turns out that during the link activation a LINK_CHANGE event is emitted
which causes the active RX URBs to be unlinked, while that is happening
rx_submit() may begin pushing new URBs to the queue being emptied.
Causing the unlink queue to never empty.

Use the same approach as commit 43daa96b166c ("usbnet: Stop RX Q on MTU
change") and pause the RX queue while unlinking the URBs on LINK_CHANGE
as well.

Fixes: 4b49f58fff00 ("usbnet: handle link change")
Cc: stable@vger.kernel.org
Signed-off-by: John Ernberg <john.ernberg@actia.se>

---

Tested on 6.12.20 and forward ported.
---
 drivers/net/usb/usbnet.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index c04e715a4c2a..156f0e85a135 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -1115,7 +1115,9 @@ static void __handle_link_change(struct usbnet *dev)
 
 	if (!netif_carrier_ok(dev->net)) {
 		/* kill URBs for reading packets to save bus bandwidth */
+		usbnet_pause_rx(dev);
 		unlink_urbs(dev, &dev->rxq);
+		usbnet_resume_rx(dev);
 
 		/*
 		 * tx_timeout will unlink URBs for sending packets and
-- 
2.49.0

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] net: usbnet: Avoid potential RCU stall on LINK_CHANGE event
  2025-07-10  8:50 [PATCH] net: usbnet: Avoid potential RCU stall on LINK_CHANGE event John Ernberg
@ 2025-07-14 23:35 ` Jakub Kicinski
  2025-07-15  7:15   ` John Ernberg
  2025-07-15 11:49 ` Oliver Neukum
  1 sibling, 1 reply; 10+ messages in thread
From: Jakub Kicinski @ 2025-07-14 23:35 UTC (permalink / raw)
  To: John Ernberg
  Cc: Oliver Neukum, Andrew Lunn, David S . Miller, Eric Dumazet,
	Paolo Abeni, Ming Lei, netdev@vger.kernel.org,
	linux-usb@vger.kernel.org, linux-kernel@vger.kernel.org,
	stable@vger.kernel.org

On Thu, 10 Jul 2025 08:50:40 +0000 John Ernberg wrote:
> Having a Gemalto Cinterion PLS83-W modem attached to USB and activating the
> cellular data link would sometimes yield the following RCU stall, leading
> to a system freeze:

Do you know which sub-driver it's using?
I'm worried that this is still racy.
Since usbnet_bh checks if carrier is ok and __handle_link_change()
checks the opposite something must be out of sync if both run.
Most likely something restored the carrier while we're still handling
the previous carrier loss.
-- 
pw-bot: cr

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] net: usbnet: Avoid potential RCU stall on LINK_CHANGE event
  2025-07-14 23:35 ` Jakub Kicinski
@ 2025-07-15  7:15   ` John Ernberg
  2025-07-15 13:54     ` Jakub Kicinski
  0 siblings, 1 reply; 10+ messages in thread
From: John Ernberg @ 2025-07-15  7:15 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Oliver Neukum, Andrew Lunn, David S . Miller, Eric Dumazet,
	Paolo Abeni, Ming Lei, netdev@vger.kernel.org,
	linux-usb@vger.kernel.org, linux-kernel@vger.kernel.org,
	stable@vger.kernel.org

Hi Jakub,

On 7/15/25 1:35 AM, Jakub Kicinski wrote:
> On Thu, 10 Jul 2025 08:50:40 +0000 John Ernberg wrote:
>> Having a Gemalto Cinterion PLS83-W modem attached to USB and activating the
>> cellular data link would sometimes yield the following RCU stall, leading
>> to a system freeze:
> 
> Do you know which sub-driver it's using?

The modem uses cdc_ether (and option for the TTYs).

> I'm worried that this is still racy.
> Since usbnet_bh checks if carrier is ok and __handle_link_change()
> checks the opposite something must be out of sync if both run.
> Most likely something restored the carrier while we're still handling
> the previous carrier loss.

There could definitely be other factors, I'll try to dig some in 
cdc_ether and see if something there could be causing problems for the 
usbnet core.
I honestly kinda stopped digging when I found unlink_urbs() being 
wrapped with a pause/unpause at another place tied to a commit seeing a 
similar issue.

Thanks! // John Ernberg

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] net: usbnet: Avoid potential RCU stall on LINK_CHANGE event
  2025-07-10  8:50 [PATCH] net: usbnet: Avoid potential RCU stall on LINK_CHANGE event John Ernberg
  2025-07-14 23:35 ` Jakub Kicinski
@ 2025-07-15 11:49 ` Oliver Neukum
  1 sibling, 0 replies; 10+ messages in thread
From: Oliver Neukum @ 2025-07-15 11:49 UTC (permalink / raw)
  To: John Ernberg, Oliver Neukum, Andrew Lunn, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Ming Lei, netdev@vger.kernel.org, linux-usb@vger.kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org

On 10.07.25 10:50, John Ernberg wrote:
> Having a Gemalto Cinterion PLS83-W modem attached to USB and activating the
> cellular data link would sometimes yield the following RCU stall, leading
> to a system freeze:

[..]
   
>   	if (!netif_carrier_ok(dev->net)) {
>   		/* kill URBs for reading packets to save bus bandwidth */
> +		usbnet_pause_rx(dev);
>   		unlink_urbs(dev, &dev->rxq);
> +		usbnet_resume_rx(dev);

This will schedule work.
Do you really want that?

	Regards
		Oliver


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] net: usbnet: Avoid potential RCU stall on LINK_CHANGE event
  2025-07-15  7:15   ` John Ernberg
@ 2025-07-15 13:54     ` Jakub Kicinski
  2025-07-16 14:54       ` John Ernberg
  0 siblings, 1 reply; 10+ messages in thread
From: Jakub Kicinski @ 2025-07-15 13:54 UTC (permalink / raw)
  To: John Ernberg
  Cc: Oliver Neukum, Andrew Lunn, David S . Miller, Eric Dumazet,
	Paolo Abeni, Ming Lei, netdev@vger.kernel.org,
	linux-usb@vger.kernel.org, linux-kernel@vger.kernel.org,
	stable@vger.kernel.org

On Tue, 15 Jul 2025 07:15:51 +0000 John Ernberg wrote:
> > I'm worried that this is still racy.
> > Since usbnet_bh checks if carrier is ok and __handle_link_change()
> > checks the opposite something must be out of sync if both run.
> > Most likely something restored the carrier while we're still handling
> > the previous carrier loss.  
> 
> There could definitely be other factors, I'll try to dig some in 
> cdc_ether and see if something there could be causing problems for the 
> usbnet core.
> I honestly kinda stopped digging when I found unlink_urbs() being 
> wrapped with a pause/unpause at another place tied to a commit seeing a 
> similar issue.

Looking at cdc_ether:

static void usbnet_cdc_zte_status(struct usbnet *dev, struct urb *urb)
[...]
        if (event->wValue &&                                                    
            netif_carrier_ok(dev->net))                                         
                netif_carrier_off(dev->net);  

This looks sus. Is the Gemalto Cinterion PLS83-W ZTE based?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] net: usbnet: Avoid potential RCU stall on LINK_CHANGE event
  2025-07-15 13:54     ` Jakub Kicinski
@ 2025-07-16 14:54       ` John Ernberg
  2025-07-16 21:39         ` Jakub Kicinski
  0 siblings, 1 reply; 10+ messages in thread
From: John Ernberg @ 2025-07-16 14:54 UTC (permalink / raw)
  To: Jakub Kicinski, Oliver Neukum
  Cc: Andrew Lunn, David S . Miller, Eric Dumazet, Paolo Abeni,
	Ming Lei, netdev@vger.kernel.org, linux-usb@vger.kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org

Hi Jakub, Oliver,

On 7/15/25 3:54 PM, Jakub Kicinski wrote:
> On Tue, 15 Jul 2025 07:15:51 +0000 John Ernberg wrote:
>>> I'm worried that this is still racy.
>>> Since usbnet_bh checks if carrier is ok and __handle_link_change()
>>> checks the opposite something must be out of sync if both run.
>>> Most likely something restored the carrier while we're still handling
>>> the previous carrier loss.
>>
>> There could definitely be other factors, I'll try to dig some in
>> cdc_ether and see if something there could be causing problems for the
>> usbnet core.
>> I honestly kinda stopped digging when I found unlink_urbs() being
>> wrapped with a pause/unpause at another place tied to a commit seeing a
>> similar issue.
> 
> Looking at cdc_ether:
> 
> static void usbnet_cdc_zte_status(struct usbnet *dev, struct urb *urb)
> [...]
>          if (event->wValue &&
>              netif_carrier_ok(dev->net))
>                  netif_carrier_off(dev->net);
> 
> This looks sus. Is the Gemalto Cinterion PLS83-W ZTE based?
> 

The modem isn't ZTE-based, but I did spot that function as well when 
checking cdc_ether.

The modem is advertising USB High Speed with a bInterval of 5, which in 
usbnet gets bumped to 7 to avoid too aggressive polling, if I read that 
code right.

Through some rudimentary logging with the inlined patch on 6.12.20 (mail 
client mangled):

-----------------8<----------------------

diff --git a/drivers/net/usb/cdc_ether.c b/drivers/net/usb/cdc_ether.c
index a6469235d904..7b4827d2c797 100644
--- a/drivers/net/usb/cdc_ether.c
+++ b/drivers/net/usb/cdc_ether.c
@@ -418,8 +418,10 @@ void usbnet_cdc_status(struct usbnet *dev, struct 
urb *urb)
         case USB_CDC_NOTIFY_NETWORK_CONNECTION:
                 netif_dbg(dev, timer, dev->net, "CDC: carrier %s\n",
                           event->wValue ? "on" : "off");
-               if (netif_carrier_ok(dev->net) != !!event->wValue)
+               if (netif_carrier_ok(dev->net) != !!event->wValue) {
+                       netdev_err(dev->net, "network connection %d\n", 
(int)event->wValue);
                         usbnet_link_change(dev, !!event->wValue, 0);
+               }
                 break;
         case USB_CDC_NOTIFY_SPEED_CHANGE:       /* tx/rx rates */
                 netif_dbg(dev, timer, dev->net, "CDC: speed change (len 
%d)\n",
diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index 1e99a6ef4945..dc3e3d87b4f7 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -721,6 +721,7 @@ static int unlink_urbs (struct usbnet *dev, struct 
sk_buff_head *q)
         struct sk_buff          *skb;
         int                     count = 0;

+       netdev_err(dev->net, "unlink urb start: %d devflags=%lx\n", 
skb_queue_len_lockless(q), dev->flags);
         spin_lock_irqsave (&q->lock, flags);
         while (!skb_queue_empty(q)) {
                 struct skb_data         *entry;
@@ -757,6 +758,7 @@ static int unlink_urbs (struct usbnet *dev, struct 
sk_buff_head *q)
                 spin_lock_irqsave(&q->lock, flags);
         }
         spin_unlock_irqrestore (&q->lock, flags);
+       netdev_err(dev->net, "unlink urb counted %d\n", count);
         return count;
  }

-----------------8<----------------------

I ended up with the following log:

[   23.823289] cdc_ether 1-1.1:1.8 wwan0: network connection 0
[   23.830874] cdc_ether 1-1.1:1.8 wwan0: unlink urb start: 5 devflags=1880
[   23.840148] cdc_ether 1-1.1:1.8 wwan0: unlink urb counted 5
[   25.356741] cdc_ether 1-1.1:1.8 wwan0: network connection 1
[   25.364745] cdc_ether 1-1.1:1.8 wwan0: network connection 0
[   25.371106] cdc_ether 1-1.1:1.8 wwan0: unlink urb start: 5 devflags=880
[   25.378710] cdc_ether 1-1.1:1.8 wwan0: network connection 1
[   51.422757] rcu: INFO: rcu_sched self-detected stall on CPU
[   51.429081] rcu:     0-....: (6499 ticks this GP) 
idle=da7c/1/0x4000000000000000 softirq=2067/2067 fqs=2668
[   51.439717] rcu:              hardirqs   softirqs   csw/system
[   51.445897] rcu:      number:    62096      59017            0
[   51.452107] rcu:     cputime:        0      11397         1470   ==> 
12996(ms)
[   51.459852] rcu:     (t=6500 jiffies g=2397 q=663 ncpus=2)

 From a USB capture where the stall didn't happen I can see:
* A bunch of CDC_NETWORK_CONNECTION events with Disconnected state (0).
* Then a CDC_NETWORK_CONNECTION event with Connected state (1) once the 
WWAN interface is turned on by the modem.
* Followed by a Disconnected in the next USB INTR poll.
* Followed by a Connected in the next USB INTR poll.
(I'm not sure if I can achieve a different timing with enough captures 
or a faster system)

Which makes the off and on LINK_CHANGE events race on our system (ARM64 
based, iMX8QXP) as they cannot be handled fast enough. Nothing stops 
usbnet_link_change() from being called while the deferred work is running.

As Oliver points out usbnet_resume_rx() causes scheduling which seems 
unnecessary or maybe even inappropriate for all cases except when the 
carrier was turned on during the race.

I gave the ZTE modem quirk a go anyway, despite the comment explaining a 
different situation than what I am seeing, and it has no observable 
effect on this RCU stall.

Currently drawing a blank on what the correct fix would be.

Best regards // John Ernberg

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] net: usbnet: Avoid potential RCU stall on LINK_CHANGE event
  2025-07-16 14:54       ` John Ernberg
@ 2025-07-16 21:39         ` Jakub Kicinski
  2025-07-18  9:07           ` John Ernberg
  0 siblings, 1 reply; 10+ messages in thread
From: Jakub Kicinski @ 2025-07-16 21:39 UTC (permalink / raw)
  To: John Ernberg
  Cc: Oliver Neukum, Andrew Lunn, David S . Miller, Eric Dumazet,
	Paolo Abeni, Ming Lei, netdev@vger.kernel.org,
	linux-usb@vger.kernel.org, linux-kernel@vger.kernel.org,
	stable@vger.kernel.org

On Wed, 16 Jul 2025 14:54:46 +0000 John Ernberg wrote:
> I ended up with the following log:
> 
> [   23.823289] cdc_ether 1-1.1:1.8 wwan0: network connection 0
> [   23.830874] cdc_ether 1-1.1:1.8 wwan0: unlink urb start: 5 devflags=1880
> [   23.840148] cdc_ether 1-1.1:1.8 wwan0: unlink urb counted 5
> [   25.356741] cdc_ether 1-1.1:1.8 wwan0: network connection 1
> [   25.364745] cdc_ether 1-1.1:1.8 wwan0: network connection 0
> [   25.371106] cdc_ether 1-1.1:1.8 wwan0: unlink urb start: 5 devflags=880
> [   25.378710] cdc_ether 1-1.1:1.8 wwan0: network connection 1
> [   51.422757] rcu: INFO: rcu_sched self-detected stall on CPU
> [   51.429081] rcu:     0-....: (6499 ticks this GP) 
> idle=da7c/1/0x4000000000000000 softirq=2067/2067 fqs=2668
> [   51.439717] rcu:              hardirqs   softirqs   csw/system
> [   51.445897] rcu:      number:    62096      59017            0
> [   51.452107] rcu:     cputime:        0      11397         1470   ==> 
> 12996(ms)
> [   51.459852] rcu:     (t=6500 jiffies g=2397 q=663 ncpus=2)
> 
>  From a USB capture where the stall didn't happen I can see:
> * A bunch of CDC_NETWORK_CONNECTION events with Disconnected state (0).
> * Then a CDC_NETWORK_CONNECTION event with Connected state (1) once the 
> WWAN interface is turned on by the modem.
> * Followed by a Disconnected in the next USB INTR poll.
> * Followed by a Connected in the next USB INTR poll.
> (I'm not sure if I can achieve a different timing with enough captures 
> or a faster system)
> 
> Which makes the off and on LINK_CHANGE events race on our system (ARM64 
> based, iMX8QXP) as they cannot be handled fast enough. Nothing stops 
> usbnet_link_change() from being called while the deferred work is running.
> 
> As Oliver points out usbnet_resume_rx() causes scheduling which seems 
> unnecessary or maybe even inappropriate for all cases except when the 
> carrier was turned on during the race.
> 
> I gave the ZTE modem quirk a go anyway, despite the comment explaining a 
> different situation than what I am seeing, and it has no observable 
> effect on this RCU stall.
> 
> Currently drawing a blank on what the correct fix would be.

Thanks for the analysis, I think I may have misread the code.
What I was saying is that we are restoring the carrier while
we are still processing the previous carrier off event in
the workqueue. My thinking was that if we deferred the
netif_carrier_on() to the workqueue this race couldn't happen.

usbnet_bh() already checks netif_carrier_ok() - we're kinda duplicating
the carrier state with this RX_PAUSED workaround.

I don't feel strongly about this, but deferring the carrier_on()
the the workqueue would be a cleaner solution IMO.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] net: usbnet: Avoid potential RCU stall on LINK_CHANGE event
  2025-07-16 21:39         ` Jakub Kicinski
@ 2025-07-18  9:07           ` John Ernberg
  2025-07-18 23:18             ` Jakub Kicinski
  0 siblings, 1 reply; 10+ messages in thread
From: John Ernberg @ 2025-07-18  9:07 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Oliver Neukum, Andrew Lunn, David S . Miller, Eric Dumazet,
	Paolo Abeni, Ming Lei, netdev@vger.kernel.org,
	linux-usb@vger.kernel.org, linux-kernel@vger.kernel.org,
	stable@vger.kernel.org

Hi Jakub,

On 7/16/25 11:39 PM, Jakub Kicinski wrote:
> On Wed, 16 Jul 2025 14:54:46 +0000 John Ernberg wrote:
>> I ended up with the following log:
>>
>> [   23.823289] cdc_ether 1-1.1:1.8 wwan0: network connection 0
>> [   23.830874] cdc_ether 1-1.1:1.8 wwan0: unlink urb start: 5 devflags=1880
>> [   23.840148] cdc_ether 1-1.1:1.8 wwan0: unlink urb counted 5
>> [   25.356741] cdc_ether 1-1.1:1.8 wwan0: network connection 1
>> [   25.364745] cdc_ether 1-1.1:1.8 wwan0: network connection 0
>> [   25.371106] cdc_ether 1-1.1:1.8 wwan0: unlink urb start: 5 devflags=880
>> [   25.378710] cdc_ether 1-1.1:1.8 wwan0: network connection 1
>> [   51.422757] rcu: INFO: rcu_sched self-detected stall on CPU
>> [   51.429081] rcu:     0-....: (6499 ticks this GP)
>> idle=da7c/1/0x4000000000000000 softirq=2067/2067 fqs=2668
>> [   51.439717] rcu:              hardirqs   softirqs   csw/system
>> [   51.445897] rcu:      number:    62096      59017            0
>> [   51.452107] rcu:     cputime:        0      11397         1470   ==>
>> 12996(ms)
>> [   51.459852] rcu:     (t=6500 jiffies g=2397 q=663 ncpus=2)
>>
>>   From a USB capture where the stall didn't happen I can see:
>> * A bunch of CDC_NETWORK_CONNECTION events with Disconnected state (0).
>> * Then a CDC_NETWORK_CONNECTION event with Connected state (1) once the
>> WWAN interface is turned on by the modem.
>> * Followed by a Disconnected in the next USB INTR poll.
>> * Followed by a Connected in the next USB INTR poll.
>> (I'm not sure if I can achieve a different timing with enough captures
>> or a faster system)
>>
>> Which makes the off and on LINK_CHANGE events race on our system (ARM64
>> based, iMX8QXP) as they cannot be handled fast enough. Nothing stops
>> usbnet_link_change() from being called while the deferred work is running.
>>
>> As Oliver points out usbnet_resume_rx() causes scheduling which seems
>> unnecessary or maybe even inappropriate for all cases except when the
>> carrier was turned on during the race.
>>
>> I gave the ZTE modem quirk a go anyway, despite the comment explaining a
>> different situation than what I am seeing, and it has no observable
>> effect on this RCU stall.
>>
>> Currently drawing a blank on what the correct fix would be.
> 
> Thanks for the analysis, I think I may have misread the code.
> What I was saying is that we are restoring the carrier while
> we are still processing the previous carrier off event in
> the workqueue. My thinking was that if we deferred the
> netif_carrier_on() to the workqueue this race couldn't happen.
> 
> usbnet_bh() already checks netif_carrier_ok() - we're kinda duplicating
> the carrier state with this RX_PAUSED workaround.
> 
> I don't feel strongly about this, but deferring the carrier_on()
> the the workqueue would be a cleaner solution IMO.
> 

I've been thinking about this idea, but I'm concerned for the opposite 
direction. I cannot think of a way to fully guarantee that the carrier 
isn't turned on again incorrectly if an off gets queued.

The most I came up with was adding an extra flag bit to set carrier on, 
and then test_and_clear_bit() it in the __handle_link_change() function.
And also clear_bit() in the usbnet_link_change() function if an off 
arrives. I cannot convince myself that there isn't a way for that to go 
sideways. But perhaps that would be robust enough?

I've also considered the possibility of just not re-submitting the INTR 
poll URB until the last one was fully processed when handling a link 
change. But that might cause havoc with ASIX and Sierra devices as they 
are calling usbnet_link_change() in other ways than through the 
.status-callback. I don't have any of these devices so I cannot test 
them for regressions. So this path feels quite dangerous.
With a sub-driver property to enable this behavior it might work out?

Thanks! // John Ernberg

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] net: usbnet: Avoid potential RCU stall on LINK_CHANGE event
  2025-07-18  9:07           ` John Ernberg
@ 2025-07-18 23:18             ` Jakub Kicinski
  2025-07-23  9:29               ` John Ernberg
  0 siblings, 1 reply; 10+ messages in thread
From: Jakub Kicinski @ 2025-07-18 23:18 UTC (permalink / raw)
  To: John Ernberg
  Cc: Oliver Neukum, Andrew Lunn, David S . Miller, Eric Dumazet,
	Paolo Abeni, Ming Lei, netdev@vger.kernel.org,
	linux-usb@vger.kernel.org, linux-kernel@vger.kernel.org,
	stable@vger.kernel.org

On Fri, 18 Jul 2025 09:07:26 +0000 John Ernberg wrote:
> > Thanks for the analysis, I think I may have misread the code.
> > What I was saying is that we are restoring the carrier while
> > we are still processing the previous carrier off event in
> > the workqueue. My thinking was that if we deferred the
> > netif_carrier_on() to the workqueue this race couldn't happen.
> > 
> > usbnet_bh() already checks netif_carrier_ok() - we're kinda duplicating
> > the carrier state with this RX_PAUSED workaround.
> > 
> > I don't feel strongly about this, but deferring the carrier_on()
> > the the workqueue would be a cleaner solution IMO.
> >   
> 
> I've been thinking about this idea, but I'm concerned for the opposite 
> direction. I cannot think of a way to fully guarantee that the carrier 
> isn't turned on again incorrectly if an off gets queued.
> 
> The most I came up with was adding an extra flag bit to set carrier on, 
> and then test_and_clear_bit() it in the __handle_link_change() function.
> And also clear_bit() in the usbnet_link_change() function if an off 
> arrives. I cannot convince myself that there isn't a way for that to go 
> sideways. But perhaps that would be robust enough?

I think it should be robust enough.. Unless my grep skills are failing
me - no drivers which call usbnet_link_change() twiddle the link state
directly.

Give it a go, if you think your initial patch is cleaner -- it's fine.

> I've also considered the possibility of just not re-submitting the INTR 
> poll URB until the last one was fully processed when handling a link 
> change. But that might cause havoc with ASIX and Sierra devices as they 
> are calling usbnet_link_change() in other ways than through the 
> .status-callback. I don't have any of these devices so I cannot test 
> them for regressions. So this path feels quite dangerous.
> With a sub-driver property to enable this behavior it might work out?

Yeah, that seems more involved.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] net: usbnet: Avoid potential RCU stall on LINK_CHANGE event
  2025-07-18 23:18             ` Jakub Kicinski
@ 2025-07-23  9:29               ` John Ernberg
  0 siblings, 0 replies; 10+ messages in thread
From: John Ernberg @ 2025-07-23  9:29 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Oliver Neukum, Andrew Lunn, David S . Miller, Eric Dumazet,
	Paolo Abeni, Ming Lei, netdev@vger.kernel.org,
	linux-usb@vger.kernel.org, linux-kernel@vger.kernel.org,
	stable@vger.kernel.org

Hi Jakub,

On Fri, Jul 18, 2025 at 04:18:25PM -0700, Jakub Kicinski wrote:
> On Fri, 18 Jul 2025 09:07:26 +0000 John Ernberg wrote:
> > > Thanks for the analysis, I think I may have misread the code.
> > > What I was saying is that we are restoring the carrier while
> > > we are still processing the previous carrier off event in
> > > the workqueue. My thinking was that if we deferred the
> > > netif_carrier_on() to the workqueue this race couldn't happen.
> > > 
> > > usbnet_bh() already checks netif_carrier_ok() - we're kinda duplicating
> > > the carrier state with this RX_PAUSED workaround.
> > > 
> > > I don't feel strongly about this, but deferring the carrier_on()
> > > the the workqueue would be a cleaner solution IMO.
> > >   
> > 
> > I've been thinking about this idea, but I'm concerned for the opposite 
> > direction. I cannot think of a way to fully guarantee that the carrier 
> > isn't turned on again incorrectly if an off gets queued.
> > 
> > The most I came up with was adding an extra flag bit to set carrier on, 
> > and then test_and_clear_bit() it in the __handle_link_change() function.
> > And also clear_bit() in the usbnet_link_change() function if an off 
> > arrives. I cannot convince myself that there isn't a way for that to go 
> > sideways. But perhaps that would be robust enough?
> 
> I think it should be robust enough.. Unless my grep skills are failing
> me - no drivers which call usbnet_link_change() twiddle the link state
> directly.
> 
> Give it a go, if you think your initial patch is cleaner -- it's fine.
> 

Apologies for the delay, I was stuck in a higher priority issue.

I've tested this approach and it looks promising. Will send this approach
as a v2 later today.

Thank you for the guidance, very much appreciated.

Best regards // John Ernberg

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-07-23  9:29 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-10  8:50 [PATCH] net: usbnet: Avoid potential RCU stall on LINK_CHANGE event John Ernberg
2025-07-14 23:35 ` Jakub Kicinski
2025-07-15  7:15   ` John Ernberg
2025-07-15 13:54     ` Jakub Kicinski
2025-07-16 14:54       ` John Ernberg
2025-07-16 21:39         ` Jakub Kicinski
2025-07-18  9:07           ` John Ernberg
2025-07-18 23:18             ` Jakub Kicinski
2025-07-23  9:29               ` John Ernberg
2025-07-15 11:49 ` Oliver Neukum

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).