[PATCH 0/2] thunderbolt: fix wedge under sustained tbnet load on AM4 and AM5

public inbox for linux-usb@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] thunderbolt: fix wedge under sustained tbnet load on AM4 and AM5
@ 2026-04-28  1:55 Benjamin Berman
  2026-04-28  1:55 ` [PATCH 1/2] thunderbolt: drop start_poll guard in tb_ring_poll_complete() Benjamin Berman
  2026-04-28  1:55 ` [PATCH 2/2] net: thunderbolt: enlarge RX/TX ring and set NAPI weight for sustained load Benjamin Berman
  0 siblings, 2 replies; 6+ messages in thread
From: Benjamin Berman @ 2026-04-28  1:55 UTC (permalink / raw)
  To: Andreas Noever, Mika Westerberg, Yehezkel Bernat
  Cc: Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-usb, netdev, linux-kernel

Greetings Thunderbolt maintainers,

These patches for drivers were tested by me, Benjamin Berman, a
software developer, but they were authored by a coding agent that
had access to and ran the patches against real hardware.

The purpose of these patches was to fix Thunderbolt networking between
Thunderbolt 3 and Thunderbolt 4 (USB4) hosts on AM4 and AM5. I observed
these issues when using nccl across a Thunderbolt daisy chain: the
connection would drop abruptly, and performance was poorer than
expected. In any instance, I had to also update the NVM by exotic
methods on the TB3 controllers for AM4; AM5 generally ships the
Thunderbolt controller NVM in its UEFI patches.

Please advise on next steps for how to improve the patches. I can also
make my testing environment available, since it has a bunch of random
but useful Thunderbolt hardware.

Below is the generatively-authored explanation of the patch, and the
patch itself:

---

Two changes.

1. drivers/thunderbolt/nhi.c — tb_ring_poll_complete() gates the
   unmask on @start_poll rather than @running. Under load on NHIs
   with several rings in NAPI poll, a race with __ring_interrupt()'s
   unconditional mask leaves the ring masked: MSI-X stops, NAPI is
   not rescheduled, carrier stays up, no driver event fires. On NHIs
   without QUIRK_AUTO_CLEAR_INT, stale REG_RING_NOTIFY_BASE state
   blocks MSI-X re-arm. The patch gates on @running, adds a posted-
   write barrier, and clears the ring's pending bit before re-enable.

2. drivers/net/thunderbolt/main.c — TBNET_RING_SIZE=256 and the
   netif_napi_add() weight of 64 produce ~1 % rx_missed_errors on a
   TB4 transit under sustained tbnet bulk traffic. The patch raises
   ring size to 2048 and the NAPI weight to 256.

Hardware tested:
  ASRock X570 Phantom Gaming-ITX/TB3 (AM4), Intel JHL7540 2C TB3
    controller, NVM 50.0
  ASUS ROG STRIX X670E-I GAMING WIFI (AM5), Maple Ridge 4C TB4
    controller, NVM 43.83
  Monoprice USB4 Gen 3 40 Gb/s passive cables
  Linux 6.17.0-22-generic (Ubuntu HWE)

Workload: NCCL 2.28.9 all-reduce over tb-lo, NCCL_ALGO=Tree,
NCCL_PROTO=Simple, three ranks. Pre-patch the connection wedges
under 1 GB transferred. Post-patch a 192 GB run (3000 iterations
of a 64 MiB all-reduce) completes with mask/unmask counters
balanced and rx_missed_errors under 0.005 %.

Built clean against linux.git commit 3b3bea6d4b9c.

Benjamin Berman (2):
  thunderbolt: drop start_poll guard in tb_ring_poll_complete()
  net: thunderbolt: enlarge RX/TX ring and set NAPI weight for sustained
    load

 drivers/net/thunderbolt/main.c |  4 ++--
 drivers/thunderbolt/nhi.c      | 22 +++++++++++++++++++---
 2 files changed, 21 insertions(+), 5 deletions(-)

base-commit: 3b3bea6d4b9c162f9e555905d96b8c1da67ecd5b
--
2.43.0

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/2] thunderbolt: drop start_poll guard in tb_ring_poll_complete()
  2026-04-28  1:55 [PATCH 0/2] thunderbolt: fix wedge under sustained tbnet load on AM4 and AM5 Benjamin Berman
@ 2026-04-28  1:55 ` Benjamin Berman
  2026-04-28  7:33   ` Mika Westerberg
  2026-04-28  1:55 ` [PATCH 2/2] net: thunderbolt: enlarge RX/TX ring and set NAPI weight for sustained load Benjamin Berman
  1 sibling, 1 reply; 6+ messages in thread
From: Benjamin Berman @ 2026-04-28  1:55 UTC (permalink / raw)
  To: Andreas Noever, Mika Westerberg, Yehezkel Bernat
  Cc: Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-usb, netdev, linux-kernel

Under concurrent load on a single NHI with several rings simultaneously
in NAPI poll (e.g. a Maple Ridge TB4 transit forwarding tbnet traffic
between two peers), one ring's interrupt enable bit in
REG_RING_INTERRUPT_BASE can stay cleared.  MSI-X stops for that ring,
NAPI is never rescheduled, but carrier is reported up and no driver
event fires.  The ring stays masked until thunderbolt_net is reloaded.

tb_ring_poll_complete() gated the unmask on @start_poll:

	if (ring->start_poll)
		__ring_interrupt_mask(ring, false);

while the ISR path masks unconditionally via __ring_interrupt().  In a
window where @start_poll is observed as NULL by the unmask path while
the paired mask persists, the ring is left permanently masked.

Gate on @running instead and add an ioread32() barrier so the posted
enable reaches the device before the spinlock is dropped.

On NHIs without QUIRK_AUTO_CLEAR_INT a second issue compounds the
first: stale pending status in REG_RING_NOTIFY_BASE can prevent the
hardware from re-arming its MSI-X generator when the ring is
re-enabled.  Clear the ring's bit in REG_RING_INT_CLEAR before setting
the enable bit, mirroring what ring_msix() already does at ISR entry.

Verified on a Maple Ridge 4C transit and two TB3 Titan Ridge endpoints
running NCCL all-reduce over tb-lo: pre-patch the chain wedges in
under 1 GB; post-patch a 192 GB run (3000 iterations of a 64 MiB
all-reduce) completes with mask/unmask counters balanced.

Generated-by: Claude Opus 4.7 <claude-opus-4-7@anthropic.com>
Tested-by: Benjamin Berman <benjamin.s.berman@gmail.com>
Signed-off-by: Benjamin Berman <benjamin.s.berman@gmail.com>
---
 drivers/thunderbolt/nhi.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/thunderbolt/nhi.c b/drivers/thunderbolt/nhi.c
index 2bb2e79ca..bba45ec36 100644
--- a/drivers/thunderbolt/nhi.c
+++ b/drivers/thunderbolt/nhi.c
@@ -389,10 +389,24 @@ static void __ring_interrupt_mask(struct tb_ring *ring, bool mask)
 	u32 val;

 	val = ioread32(ring->nhi->iobase + reg);
-	if (mask)
+	if (mask) {
 		val &= ~BIT(bit);
-	else
+	} else {
+		if (!(ring->nhi->quirks & QUIRK_AUTO_CLEAR_INT)) {
+			int cbit = ring_interrupt_index(ring) & 31;
+
+			if (ring->is_tx)
+				iowrite32(BIT(cbit),
+					  ring->nhi->iobase +
+					  REG_RING_INT_CLEAR);
+			else
+				iowrite32(BIT(cbit),
+					  ring->nhi->iobase +
+					  REG_RING_INT_CLEAR +
+					  4 * (ring->nhi->hop_count / 32));
+		}
 		val |= BIT(bit);
+	}
 	iowrite32(val, ring->nhi->iobase + reg);
 }

@@ -423,8 +437,10 @@ void tb_ring_poll_complete(struct tb_ring *ring)

 	spin_lock_irqsave(&ring->nhi->lock, flags);
 	spin_lock(&ring->lock);
-	if (ring->start_poll)
+	if (ring->running) {
 		__ring_interrupt_mask(ring, false);
+		(void)ioread32(ring->nhi->iobase + REG_RING_INTERRUPT_BASE);
+	}
 	spin_unlock(&ring->lock);
 	spin_unlock_irqrestore(&ring->nhi->lock, flags);
 }
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/2] thunderbolt: drop start_poll guard in tb_ring_poll_complete()
  2026-04-28  1:55 ` [PATCH 1/2] thunderbolt: drop start_poll guard in tb_ring_poll_complete() Benjamin Berman
@ 2026-04-28  7:33   ` Mika Westerberg
  0 siblings, 0 replies; 6+ messages in thread
From: Mika Westerberg @ 2026-04-28  7:33 UTC (permalink / raw)
  To: Benjamin Berman
  Cc: Andreas Noever, Mika Westerberg, Yehezkel Bernat, Andrew Lunn,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	linux-usb, netdev, linux-kernel

Hi,

On Mon, Apr 27, 2026 at 06:55:20PM -0700, Benjamin Berman wrote:
> Under concurrent load on a single NHI with several rings simultaneously
> in NAPI poll (e.g. a Maple Ridge TB4 transit forwarding tbnet traffic
> between two peers), one ring's interrupt enable bit in
> REG_RING_INTERRUPT_BASE can stay cleared.  MSI-X stops for that ring,
> NAPI is never rescheduled, but carrier is reported up and no driver
> event fires.  The ring stays masked until thunderbolt_net is reloaded.
> 
> tb_ring_poll_complete() gated the unmask on @start_poll:
> 
> 	if (ring->start_poll)
> 		__ring_interrupt_mask(ring, false);
> 
> while the ISR path masks unconditionally via __ring_interrupt().  In a
> window where @start_poll is observed as NULL by the unmask path while
> the paired mask persists, the ring is left permanently masked.
> 
> Gate on @running instead and add an ioread32() barrier so the posted
> enable reaches the device before the spinlock is dropped.
> 
> On NHIs without QUIRK_AUTO_CLEAR_INT a second issue compounds the
> first: stale pending status in REG_RING_NOTIFY_BASE can prevent the
> hardware from re-arming its MSI-X generator when the ring is
> re-enabled.  Clear the ring's bit in REG_RING_INT_CLEAR before setting
> the enable bit, mirroring what ring_msix() already does at ISR entry.
> 
> Verified on a Maple Ridge 4C transit and two TB3 Titan Ridge endpoints
> running NCCL all-reduce over tb-lo: pre-patch the chain wedges in
> under 1 GB; post-patch a 192 GB run (3000 iterations of a 64 MiB
> all-reduce) completes with mask/unmask counters balanced.

I think this makes sense.

I do have few comments about the code itself. See below.

> Generated-by: Claude Opus 4.7 <claude-opus-4-7@anthropic.com>
> Tested-by: Benjamin Berman <benjamin.s.berman@gmail.com>
> Signed-off-by: Benjamin Berman <benjamin.s.berman@gmail.com>
> ---
>  drivers/thunderbolt/nhi.c | 22 +++++++++++++++++++---
>  1 file changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/thunderbolt/nhi.c b/drivers/thunderbolt/nhi.c
> index 2bb2e79ca..bba45ec36 100644
> --- a/drivers/thunderbolt/nhi.c
> +++ b/drivers/thunderbolt/nhi.c
> @@ -389,10 +389,24 @@ static void __ring_interrupt_mask(struct tb_ring *ring, bool mask)
>  	u32 val;
>  
>  	val = ioread32(ring->nhi->iobase + reg);
> -	if (mask)
> +	if (mask) {
>  		val &= ~BIT(bit);
> -	else
> +	} else {
> +		if (!(ring->nhi->quirks & QUIRK_AUTO_CLEAR_INT)) {
> +			int cbit = ring_interrupt_index(ring) & 31;
> +
> +			if (ring->is_tx)
> +				iowrite32(BIT(cbit),
> +					  ring->nhi->iobase +
> +					  REG_RING_INT_CLEAR);
> +			else
> +				iowrite32(BIT(cbit),
> +					  ring->nhi->iobase +
> +					  REG_RING_INT_CLEAR +
> +					  4 * (ring->nhi->hop_count / 32));
> +		}

This should be a separate helper function.

ring_interrupt_clear() or so. We actually have function with that name but
it clears with bit too big hammer for this. So I suggest to rework it with
the above code and user here and also in nhi_disable_interrupts().

>  		val |= BIT(bit);
> +	}
>  	iowrite32(val, ring->nhi->iobase + reg);
>  }
>  
> @@ -423,8 +437,10 @@ void tb_ring_poll_complete(struct tb_ring *ring)
>  
>  	spin_lock_irqsave(&ring->nhi->lock, flags);
>  	spin_lock(&ring->lock);
> -	if (ring->start_poll)
> +	if (ring->running) {
>  		__ring_interrupt_mask(ring, false);
> +		(void)ioread32(ring->nhi->iobase + REG_RING_INTERRUPT_BASE);

Drop the (void) cast but add a comment that this is for posted write.

> +	}
>  	spin_unlock(&ring->lock);
>  	spin_unlock_irqrestore(&ring->nhi->lock, flags);
>  }
> -- 
> 2.43.0

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 2/2] net: thunderbolt: enlarge RX/TX ring and set NAPI weight for sustained load
  2026-04-28  1:55 [PATCH 0/2] thunderbolt: fix wedge under sustained tbnet load on AM4 and AM5 Benjamin Berman
  2026-04-28  1:55 ` [PATCH 1/2] thunderbolt: drop start_poll guard in tb_ring_poll_complete() Benjamin Berman
@ 2026-04-28  1:55 ` Benjamin Berman
  2026-04-28  7:42   ` Mika Westerberg
  1 sibling, 1 reply; 6+ messages in thread
From: Benjamin Berman @ 2026-04-28  1:55 UTC (permalink / raw)
  To: Andreas Noever, Mika Westerberg, Yehezkel Bernat
  Cc: Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-usb, netdev, linux-kernel

The default TBNET_RING_SIZE of 256 and the NAPI_POLL_WEIGHT of 64
implicit in netif_napi_add() are too small for host-to-host Thunderbolt
networking under sustained bulk traffic.  Running NCCL all-reduce over
tb-lo on a three-node chain (two TB3 endpoints plus a TB4 Maple Ridge
transit) produces rx_missed_errors at ~1 % of rx_packets on the transit
and ~0.6 % on the endpoints, with rx_packets stalling against a peer's
continuing tx_packets.

Raise TBNET_RING_SIZE to 2048 (8x) and use netif_napi_add_weight() with
a per-NAPI weight of 256 so tbnet_poll() drains more frames per softirq
invocation.  With matching sysctls (net.core.netdev_budget=1024,
net.core.netdev_budget_usecs=8000) rx_missed_errors stays below 0.005 %
over a 192 GB all-reduce workload on the same hardware.

Generated-by: Claude Opus 4.7 <claude-opus-4-7@anthropic.com>
Tested-by: Benjamin Berman <benjamin.s.berman@gmail.com>
Signed-off-by: Benjamin Berman <benjamin.s.berman@gmail.com>
---
 drivers/net/thunderbolt/main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/thunderbolt/main.c b/drivers/net/thunderbolt/main.c
index 7aae5d915..3a096f7c5 100644
--- a/drivers/net/thunderbolt/main.c
+++ b/drivers/net/thunderbolt/main.c
@@ -31,7 +31,7 @@
 #define TBNET_LOGIN_TIMEOUT	500
 #define TBNET_LOGOUT_TIMEOUT	1000
 
-#define TBNET_RING_SIZE		256
+#define TBNET_RING_SIZE		2048
 #define TBNET_LOGIN_RETRIES	60
 #define TBNET_LOGOUT_RETRIES	10
 #define TBNET_E2E		BIT(0)
@@ -1383,7 +1383,7 @@ static int tbnet_probe(struct tb_service *svc, const struct tb_service_id *id)
 	dev->features = dev->hw_features | NETIF_F_HIGHDMA;
 	dev->hard_header_len += sizeof(struct thunderbolt_ip_frame_header);
 
-	netif_napi_add(dev, &net->napi, tbnet_poll);
+	netif_napi_add_weight(dev, &net->napi, tbnet_poll, 256);
 
 	/* MTU range: 68 - 65522 */
 	dev->min_mtu = ETH_MIN_MTU;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] net: thunderbolt: enlarge RX/TX ring and set NAPI weight for sustained load
  2026-04-28  1:55 ` [PATCH 2/2] net: thunderbolt: enlarge RX/TX ring and set NAPI weight for sustained load Benjamin Berman
@ 2026-04-28  7:42   ` Mika Westerberg
  2026-04-28 12:54     ` Andrew Lunn
  0 siblings, 1 reply; 6+ messages in thread
From: Mika Westerberg @ 2026-04-28  7:42 UTC (permalink / raw)
  To: Benjamin Berman
  Cc: Andreas Noever, Mika Westerberg, Yehezkel Bernat, Andrew Lunn,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	linux-usb, netdev, linux-kernel

On Mon, Apr 27, 2026 at 06:55:21PM -0700, Benjamin Berman wrote:
> The default TBNET_RING_SIZE of 256 and the NAPI_POLL_WEIGHT of 64
> implicit in netif_napi_add() are too small for host-to-host Thunderbolt
> networking under sustained bulk traffic.  Running NCCL all-reduce over
> tb-lo on a three-node chain (two TB3 endpoints plus a TB4 Maple Ridge
> transit) produces rx_missed_errors at ~1 % of rx_packets on the transit
> and ~0.6 % on the endpoints, with rx_packets stalling against a peer's
> continuing tx_packets.
> 
> Raise TBNET_RING_SIZE to 2048 (8x) and use netif_napi_add_weight() with
> a per-NAPI weight of 256 so tbnet_poll() drains more frames per softirq
> invocation.  With matching sysctls (net.core.netdev_budget=1024,
> net.core.netdev_budget_usecs=8000) rx_missed_errors stays below 0.005 %
> over a 192 GB all-reduce workload on the same hardware.
> 
> Generated-by: Claude Opus 4.7 <claude-opus-4-7@anthropic.com>
> Tested-by: Benjamin Berman <benjamin.s.berman@gmail.com>
> Signed-off-by: Benjamin Berman <benjamin.s.berman@gmail.com>

For ring size I don't have any objections. The current ring size 256 is
arbitrary and at the time seemed reasonable.

For the poll weigth there is the comment in netdevice.h:

/* Default NAPI poll() weight
 * Device drivers are strongly advised to not use bigger value
 */
#define NAPI_POLL_WEIGHT 64

But if you see improvement using 256 here I'm fine with that unless the
network folks advice otherwise.

Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] net: thunderbolt: enlarge RX/TX ring and set NAPI weight for sustained load
  2026-04-28  7:42   ` Mika Westerberg
@ 2026-04-28 12:54     ` Andrew Lunn
  0 siblings, 0 replies; 6+ messages in thread
From: Andrew Lunn @ 2026-04-28 12:54 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Benjamin Berman, Andreas Noever, Mika Westerberg, Yehezkel Bernat,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-usb, netdev, linux-kernel

On Tue, Apr 28, 2026 at 09:42:53AM +0200, Mika Westerberg wrote:
> On Mon, Apr 27, 2026 at 06:55:21PM -0700, Benjamin Berman wrote:
> > The default TBNET_RING_SIZE of 256 and the NAPI_POLL_WEIGHT of 64
> > implicit in netif_napi_add() are too small for host-to-host Thunderbolt
> > networking under sustained bulk traffic.  Running NCCL all-reduce over
> > tb-lo on a three-node chain (two TB3 endpoints plus a TB4 Maple Ridge
> > transit) produces rx_missed_errors at ~1 % of rx_packets on the transit
> > and ~0.6 % on the endpoints, with rx_packets stalling against a peer's
> > continuing tx_packets.
> > 
> > Raise TBNET_RING_SIZE to 2048 (8x) and use netif_napi_add_weight() with
> > a per-NAPI weight of 256 so tbnet_poll() drains more frames per softirq
> > invocation.  With matching sysctls (net.core.netdev_budget=1024,
> > net.core.netdev_budget_usecs=8000) rx_missed_errors stays below 0.005 %
> > over a 192 GB all-reduce workload on the same hardware.
> > 
> > Generated-by: Claude Opus 4.7 <claude-opus-4-7@anthropic.com>
> > Tested-by: Benjamin Berman <benjamin.s.berman@gmail.com>
> > Signed-off-by: Benjamin Berman <benjamin.s.berman@gmail.com>
> 
> For ring size I don't have any objections. The current ring size 256 is
> arbitrary and at the time seemed reasonable.
> 
> For the poll weigth there is the comment in netdevice.h:
> 
> /* Default NAPI poll() weight
>  * Device drivers are strongly advised to not use bigger value
>  */
> #define NAPI_POLL_WEIGHT 64
> 
> But if you see improvement using 256 here I'm fine with that unless the
> network folks advice otherwise.

I just did a quick sample of other drivers which change the NAPI
weight. Of the 10 i looked at, 9 reduced the weight. Only one
increased it.

I would like the core netdev people to comment on this, before it is
accepted.

Questions which come to mind:

Why is the polling not happening frequently enough? 

Is it frequently swapping between polling and interrupts?

Is there interrupt coalesce going on, and the coalesce time set too
high, so that by the time the interrupt fires the ring is full? Can
you play with ethtool -C?

	Andrew


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-04-28 12:55 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-28  1:55 [PATCH 0/2] thunderbolt: fix wedge under sustained tbnet load on AM4 and AM5 Benjamin Berman
2026-04-28  1:55 ` [PATCH 1/2] thunderbolt: drop start_poll guard in tb_ring_poll_complete() Benjamin Berman
2026-04-28  7:33   ` Mika Westerberg
2026-04-28  1:55 ` [PATCH 2/2] net: thunderbolt: enlarge RX/TX ring and set NAPI weight for sustained load Benjamin Berman
2026-04-28  7:42   ` Mika Westerberg
2026-04-28 12:54     ` Andrew Lunn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox