* [PATCH 0/2] thunderbolt: fix wedge under sustained tbnet load on AM4 and AM5
@ 2026-04-28 1:55 Benjamin Berman
2026-04-28 1:55 ` [PATCH 1/2] thunderbolt: drop start_poll guard in tb_ring_poll_complete() Benjamin Berman
2026-04-28 1:55 ` [PATCH 2/2] net: thunderbolt: enlarge RX/TX ring and set NAPI weight for sustained load Benjamin Berman
0 siblings, 2 replies; 6+ messages in thread
From: Benjamin Berman @ 2026-04-28 1:55 UTC (permalink / raw)
To: Andreas Noever, Mika Westerberg, Yehezkel Bernat
Cc: Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, linux-usb, netdev, linux-kernel
Greetings Thunderbolt maintainers,
These patches for drivers were tested by me, Benjamin Berman, a
software developer, but they were authored by a coding agent that
had access to and ran the patches against real hardware.
The purpose of these patches was to fix Thunderbolt networking between
Thunderbolt 3 and Thunderbolt 4 (USB4) hosts on AM4 and AM5. I observed
these issues when using nccl across a Thunderbolt daisy chain: the
connection would drop abruptly, and performance was poorer than
expected. In any instance, I had to also update the NVM by exotic
methods on the TB3 controllers for AM4; AM5 generally ships the
Thunderbolt controller NVM in its UEFI patches.
Please advise on next steps for how to improve the patches. I can also
make my testing environment available, since it has a bunch of random
but useful Thunderbolt hardware.
Below is the generatively-authored explanation of the patch, and the
patch itself:
---
Two changes.
1. drivers/thunderbolt/nhi.c — tb_ring_poll_complete() gates the
unmask on @start_poll rather than @running. Under load on NHIs
with several rings in NAPI poll, a race with __ring_interrupt()'s
unconditional mask leaves the ring masked: MSI-X stops, NAPI is
not rescheduled, carrier stays up, no driver event fires. On NHIs
without QUIRK_AUTO_CLEAR_INT, stale REG_RING_NOTIFY_BASE state
blocks MSI-X re-arm. The patch gates on @running, adds a posted-
write barrier, and clears the ring's pending bit before re-enable.
2. drivers/net/thunderbolt/main.c — TBNET_RING_SIZE=256 and the
netif_napi_add() weight of 64 produce ~1 % rx_missed_errors on a
TB4 transit under sustained tbnet bulk traffic. The patch raises
ring size to 2048 and the NAPI weight to 256.
Hardware tested:
ASRock X570 Phantom Gaming-ITX/TB3 (AM4), Intel JHL7540 2C TB3
controller, NVM 50.0
ASUS ROG STRIX X670E-I GAMING WIFI (AM5), Maple Ridge 4C TB4
controller, NVM 43.83
Monoprice USB4 Gen 3 40 Gb/s passive cables
Linux 6.17.0-22-generic (Ubuntu HWE)
Workload: NCCL 2.28.9 all-reduce over tb-lo, NCCL_ALGO=Tree,
NCCL_PROTO=Simple, three ranks. Pre-patch the connection wedges
under 1 GB transferred. Post-patch a 192 GB run (3000 iterations
of a 64 MiB all-reduce) completes with mask/unmask counters
balanced and rx_missed_errors under 0.005 %.
Built clean against linux.git commit 3b3bea6d4b9c.
Benjamin Berman (2):
thunderbolt: drop start_poll guard in tb_ring_poll_complete()
net: thunderbolt: enlarge RX/TX ring and set NAPI weight for sustained
load
drivers/net/thunderbolt/main.c | 4 ++--
drivers/thunderbolt/nhi.c | 22 +++++++++++++++++++---
2 files changed, 21 insertions(+), 5 deletions(-)
base-commit: 3b3bea6d4b9c162f9e555905d96b8c1da67ecd5b
--
2.43.0
^ permalink raw reply [flat|nested] 6+ messages in thread* [PATCH 1/2] thunderbolt: drop start_poll guard in tb_ring_poll_complete()
2026-04-28 1:55 [PATCH 0/2] thunderbolt: fix wedge under sustained tbnet load on AM4 and AM5 Benjamin Berman
@ 2026-04-28 1:55 ` Benjamin Berman
2026-04-28 7:33 ` Mika Westerberg
2026-04-28 1:55 ` [PATCH 2/2] net: thunderbolt: enlarge RX/TX ring and set NAPI weight for sustained load Benjamin Berman
1 sibling, 1 reply; 6+ messages in thread
From: Benjamin Berman @ 2026-04-28 1:55 UTC (permalink / raw)
To: Andreas Noever, Mika Westerberg, Yehezkel Bernat
Cc: Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, linux-usb, netdev, linux-kernel
Under concurrent load on a single NHI with several rings simultaneously
in NAPI poll (e.g. a Maple Ridge TB4 transit forwarding tbnet traffic
between two peers), one ring's interrupt enable bit in
REG_RING_INTERRUPT_BASE can stay cleared. MSI-X stops for that ring,
NAPI is never rescheduled, but carrier is reported up and no driver
event fires. The ring stays masked until thunderbolt_net is reloaded.
tb_ring_poll_complete() gated the unmask on @start_poll:
if (ring->start_poll)
__ring_interrupt_mask(ring, false);
while the ISR path masks unconditionally via __ring_interrupt(). In a
window where @start_poll is observed as NULL by the unmask path while
the paired mask persists, the ring is left permanently masked.
Gate on @running instead and add an ioread32() barrier so the posted
enable reaches the device before the spinlock is dropped.
On NHIs without QUIRK_AUTO_CLEAR_INT a second issue compounds the
first: stale pending status in REG_RING_NOTIFY_BASE can prevent the
hardware from re-arming its MSI-X generator when the ring is
re-enabled. Clear the ring's bit in REG_RING_INT_CLEAR before setting
the enable bit, mirroring what ring_msix() already does at ISR entry.
Verified on a Maple Ridge 4C transit and two TB3 Titan Ridge endpoints
running NCCL all-reduce over tb-lo: pre-patch the chain wedges in
under 1 GB; post-patch a 192 GB run (3000 iterations of a 64 MiB
all-reduce) completes with mask/unmask counters balanced.
Generated-by: Claude Opus 4.7 <claude-opus-4-7@anthropic.com>
Tested-by: Benjamin Berman <benjamin.s.berman@gmail.com>
Signed-off-by: Benjamin Berman <benjamin.s.berman@gmail.com>
---
drivers/thunderbolt/nhi.c | 22 +++++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)
diff --git a/drivers/thunderbolt/nhi.c b/drivers/thunderbolt/nhi.c
index 2bb2e79ca..bba45ec36 100644
--- a/drivers/thunderbolt/nhi.c
+++ b/drivers/thunderbolt/nhi.c
@@ -389,10 +389,24 @@ static void __ring_interrupt_mask(struct tb_ring *ring, bool mask)
u32 val;
val = ioread32(ring->nhi->iobase + reg);
- if (mask)
+ if (mask) {
val &= ~BIT(bit);
- else
+ } else {
+ if (!(ring->nhi->quirks & QUIRK_AUTO_CLEAR_INT)) {
+ int cbit = ring_interrupt_index(ring) & 31;
+
+ if (ring->is_tx)
+ iowrite32(BIT(cbit),
+ ring->nhi->iobase +
+ REG_RING_INT_CLEAR);
+ else
+ iowrite32(BIT(cbit),
+ ring->nhi->iobase +
+ REG_RING_INT_CLEAR +
+ 4 * (ring->nhi->hop_count / 32));
+ }
val |= BIT(bit);
+ }
iowrite32(val, ring->nhi->iobase + reg);
}
@@ -423,8 +437,10 @@ void tb_ring_poll_complete(struct tb_ring *ring)
spin_lock_irqsave(&ring->nhi->lock, flags);
spin_lock(&ring->lock);
- if (ring->start_poll)
+ if (ring->running) {
__ring_interrupt_mask(ring, false);
+ (void)ioread32(ring->nhi->iobase + REG_RING_INTERRUPT_BASE);
+ }
spin_unlock(&ring->lock);
spin_unlock_irqrestore(&ring->nhi->lock, flags);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread* Re: [PATCH 1/2] thunderbolt: drop start_poll guard in tb_ring_poll_complete()
2026-04-28 1:55 ` [PATCH 1/2] thunderbolt: drop start_poll guard in tb_ring_poll_complete() Benjamin Berman
@ 2026-04-28 7:33 ` Mika Westerberg
0 siblings, 0 replies; 6+ messages in thread
From: Mika Westerberg @ 2026-04-28 7:33 UTC (permalink / raw)
To: Benjamin Berman
Cc: Andreas Noever, Mika Westerberg, Yehezkel Bernat, Andrew Lunn,
David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
linux-usb, netdev, linux-kernel
Hi,
On Mon, Apr 27, 2026 at 06:55:20PM -0700, Benjamin Berman wrote:
> Under concurrent load on a single NHI with several rings simultaneously
> in NAPI poll (e.g. a Maple Ridge TB4 transit forwarding tbnet traffic
> between two peers), one ring's interrupt enable bit in
> REG_RING_INTERRUPT_BASE can stay cleared. MSI-X stops for that ring,
> NAPI is never rescheduled, but carrier is reported up and no driver
> event fires. The ring stays masked until thunderbolt_net is reloaded.
>
> tb_ring_poll_complete() gated the unmask on @start_poll:
>
> if (ring->start_poll)
> __ring_interrupt_mask(ring, false);
>
> while the ISR path masks unconditionally via __ring_interrupt(). In a
> window where @start_poll is observed as NULL by the unmask path while
> the paired mask persists, the ring is left permanently masked.
>
> Gate on @running instead and add an ioread32() barrier so the posted
> enable reaches the device before the spinlock is dropped.
>
> On NHIs without QUIRK_AUTO_CLEAR_INT a second issue compounds the
> first: stale pending status in REG_RING_NOTIFY_BASE can prevent the
> hardware from re-arming its MSI-X generator when the ring is
> re-enabled. Clear the ring's bit in REG_RING_INT_CLEAR before setting
> the enable bit, mirroring what ring_msix() already does at ISR entry.
>
> Verified on a Maple Ridge 4C transit and two TB3 Titan Ridge endpoints
> running NCCL all-reduce over tb-lo: pre-patch the chain wedges in
> under 1 GB; post-patch a 192 GB run (3000 iterations of a 64 MiB
> all-reduce) completes with mask/unmask counters balanced.
I think this makes sense.
I do have few comments about the code itself. See below.
> Generated-by: Claude Opus 4.7 <claude-opus-4-7@anthropic.com>
> Tested-by: Benjamin Berman <benjamin.s.berman@gmail.com>
> Signed-off-by: Benjamin Berman <benjamin.s.berman@gmail.com>
> ---
> drivers/thunderbolt/nhi.c | 22 +++++++++++++++++++---
> 1 file changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/thunderbolt/nhi.c b/drivers/thunderbolt/nhi.c
> index 2bb2e79ca..bba45ec36 100644
> --- a/drivers/thunderbolt/nhi.c
> +++ b/drivers/thunderbolt/nhi.c
> @@ -389,10 +389,24 @@ static void __ring_interrupt_mask(struct tb_ring *ring, bool mask)
> u32 val;
>
> val = ioread32(ring->nhi->iobase + reg);
> - if (mask)
> + if (mask) {
> val &= ~BIT(bit);
> - else
> + } else {
> + if (!(ring->nhi->quirks & QUIRK_AUTO_CLEAR_INT)) {
> + int cbit = ring_interrupt_index(ring) & 31;
> +
> + if (ring->is_tx)
> + iowrite32(BIT(cbit),
> + ring->nhi->iobase +
> + REG_RING_INT_CLEAR);
> + else
> + iowrite32(BIT(cbit),
> + ring->nhi->iobase +
> + REG_RING_INT_CLEAR +
> + 4 * (ring->nhi->hop_count / 32));
> + }
This should be a separate helper function.
ring_interrupt_clear() or so. We actually have function with that name but
it clears with bit too big hammer for this. So I suggest to rework it with
the above code and user here and also in nhi_disable_interrupts().
> val |= BIT(bit);
> + }
> iowrite32(val, ring->nhi->iobase + reg);
> }
>
> @@ -423,8 +437,10 @@ void tb_ring_poll_complete(struct tb_ring *ring)
>
> spin_lock_irqsave(&ring->nhi->lock, flags);
> spin_lock(&ring->lock);
> - if (ring->start_poll)
> + if (ring->running) {
> __ring_interrupt_mask(ring, false);
> + (void)ioread32(ring->nhi->iobase + REG_RING_INTERRUPT_BASE);
Drop the (void) cast but add a comment that this is for posted write.
> + }
> spin_unlock(&ring->lock);
> spin_unlock_irqrestore(&ring->nhi->lock, flags);
> }
> --
> 2.43.0
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 2/2] net: thunderbolt: enlarge RX/TX ring and set NAPI weight for sustained load
2026-04-28 1:55 [PATCH 0/2] thunderbolt: fix wedge under sustained tbnet load on AM4 and AM5 Benjamin Berman
2026-04-28 1:55 ` [PATCH 1/2] thunderbolt: drop start_poll guard in tb_ring_poll_complete() Benjamin Berman
@ 2026-04-28 1:55 ` Benjamin Berman
2026-04-28 7:42 ` Mika Westerberg
1 sibling, 1 reply; 6+ messages in thread
From: Benjamin Berman @ 2026-04-28 1:55 UTC (permalink / raw)
To: Andreas Noever, Mika Westerberg, Yehezkel Bernat
Cc: Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, linux-usb, netdev, linux-kernel
The default TBNET_RING_SIZE of 256 and the NAPI_POLL_WEIGHT of 64
implicit in netif_napi_add() are too small for host-to-host Thunderbolt
networking under sustained bulk traffic. Running NCCL all-reduce over
tb-lo on a three-node chain (two TB3 endpoints plus a TB4 Maple Ridge
transit) produces rx_missed_errors at ~1 % of rx_packets on the transit
and ~0.6 % on the endpoints, with rx_packets stalling against a peer's
continuing tx_packets.
Raise TBNET_RING_SIZE to 2048 (8x) and use netif_napi_add_weight() with
a per-NAPI weight of 256 so tbnet_poll() drains more frames per softirq
invocation. With matching sysctls (net.core.netdev_budget=1024,
net.core.netdev_budget_usecs=8000) rx_missed_errors stays below 0.005 %
over a 192 GB all-reduce workload on the same hardware.
Generated-by: Claude Opus 4.7 <claude-opus-4-7@anthropic.com>
Tested-by: Benjamin Berman <benjamin.s.berman@gmail.com>
Signed-off-by: Benjamin Berman <benjamin.s.berman@gmail.com>
---
drivers/net/thunderbolt/main.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/thunderbolt/main.c b/drivers/net/thunderbolt/main.c
index 7aae5d915..3a096f7c5 100644
--- a/drivers/net/thunderbolt/main.c
+++ b/drivers/net/thunderbolt/main.c
@@ -31,7 +31,7 @@
#define TBNET_LOGIN_TIMEOUT 500
#define TBNET_LOGOUT_TIMEOUT 1000
-#define TBNET_RING_SIZE 256
+#define TBNET_RING_SIZE 2048
#define TBNET_LOGIN_RETRIES 60
#define TBNET_LOGOUT_RETRIES 10
#define TBNET_E2E BIT(0)
@@ -1383,7 +1383,7 @@ static int tbnet_probe(struct tb_service *svc, const struct tb_service_id *id)
dev->features = dev->hw_features | NETIF_F_HIGHDMA;
dev->hard_header_len += sizeof(struct thunderbolt_ip_frame_header);
- netif_napi_add(dev, &net->napi, tbnet_poll);
+ netif_napi_add_weight(dev, &net->napi, tbnet_poll, 256);
/* MTU range: 68 - 65522 */
dev->min_mtu = ETH_MIN_MTU;
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH 2/2] net: thunderbolt: enlarge RX/TX ring and set NAPI weight for sustained load
2026-04-28 1:55 ` [PATCH 2/2] net: thunderbolt: enlarge RX/TX ring and set NAPI weight for sustained load Benjamin Berman
@ 2026-04-28 7:42 ` Mika Westerberg
2026-04-28 12:54 ` Andrew Lunn
0 siblings, 1 reply; 6+ messages in thread
From: Mika Westerberg @ 2026-04-28 7:42 UTC (permalink / raw)
To: Benjamin Berman
Cc: Andreas Noever, Mika Westerberg, Yehezkel Bernat, Andrew Lunn,
David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
linux-usb, netdev, linux-kernel
On Mon, Apr 27, 2026 at 06:55:21PM -0700, Benjamin Berman wrote:
> The default TBNET_RING_SIZE of 256 and the NAPI_POLL_WEIGHT of 64
> implicit in netif_napi_add() are too small for host-to-host Thunderbolt
> networking under sustained bulk traffic. Running NCCL all-reduce over
> tb-lo on a three-node chain (two TB3 endpoints plus a TB4 Maple Ridge
> transit) produces rx_missed_errors at ~1 % of rx_packets on the transit
> and ~0.6 % on the endpoints, with rx_packets stalling against a peer's
> continuing tx_packets.
>
> Raise TBNET_RING_SIZE to 2048 (8x) and use netif_napi_add_weight() with
> a per-NAPI weight of 256 so tbnet_poll() drains more frames per softirq
> invocation. With matching sysctls (net.core.netdev_budget=1024,
> net.core.netdev_budget_usecs=8000) rx_missed_errors stays below 0.005 %
> over a 192 GB all-reduce workload on the same hardware.
>
> Generated-by: Claude Opus 4.7 <claude-opus-4-7@anthropic.com>
> Tested-by: Benjamin Berman <benjamin.s.berman@gmail.com>
> Signed-off-by: Benjamin Berman <benjamin.s.berman@gmail.com>
For ring size I don't have any objections. The current ring size 256 is
arbitrary and at the time seemed reasonable.
For the poll weigth there is the comment in netdevice.h:
/* Default NAPI poll() weight
* Device drivers are strongly advised to not use bigger value
*/
#define NAPI_POLL_WEIGHT 64
But if you see improvement using 256 here I'm fine with that unless the
network folks advice otherwise.
Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 2/2] net: thunderbolt: enlarge RX/TX ring and set NAPI weight for sustained load
2026-04-28 7:42 ` Mika Westerberg
@ 2026-04-28 12:54 ` Andrew Lunn
0 siblings, 0 replies; 6+ messages in thread
From: Andrew Lunn @ 2026-04-28 12:54 UTC (permalink / raw)
To: Mika Westerberg
Cc: Benjamin Berman, Andreas Noever, Mika Westerberg, Yehezkel Bernat,
Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, linux-usb, netdev, linux-kernel
On Tue, Apr 28, 2026 at 09:42:53AM +0200, Mika Westerberg wrote:
> On Mon, Apr 27, 2026 at 06:55:21PM -0700, Benjamin Berman wrote:
> > The default TBNET_RING_SIZE of 256 and the NAPI_POLL_WEIGHT of 64
> > implicit in netif_napi_add() are too small for host-to-host Thunderbolt
> > networking under sustained bulk traffic. Running NCCL all-reduce over
> > tb-lo on a three-node chain (two TB3 endpoints plus a TB4 Maple Ridge
> > transit) produces rx_missed_errors at ~1 % of rx_packets on the transit
> > and ~0.6 % on the endpoints, with rx_packets stalling against a peer's
> > continuing tx_packets.
> >
> > Raise TBNET_RING_SIZE to 2048 (8x) and use netif_napi_add_weight() with
> > a per-NAPI weight of 256 so tbnet_poll() drains more frames per softirq
> > invocation. With matching sysctls (net.core.netdev_budget=1024,
> > net.core.netdev_budget_usecs=8000) rx_missed_errors stays below 0.005 %
> > over a 192 GB all-reduce workload on the same hardware.
> >
> > Generated-by: Claude Opus 4.7 <claude-opus-4-7@anthropic.com>
> > Tested-by: Benjamin Berman <benjamin.s.berman@gmail.com>
> > Signed-off-by: Benjamin Berman <benjamin.s.berman@gmail.com>
>
> For ring size I don't have any objections. The current ring size 256 is
> arbitrary and at the time seemed reasonable.
>
> For the poll weigth there is the comment in netdevice.h:
>
> /* Default NAPI poll() weight
> * Device drivers are strongly advised to not use bigger value
> */
> #define NAPI_POLL_WEIGHT 64
>
> But if you see improvement using 256 here I'm fine with that unless the
> network folks advice otherwise.
I just did a quick sample of other drivers which change the NAPI
weight. Of the 10 i looked at, 9 reduced the weight. Only one
increased it.
I would like the core netdev people to comment on this, before it is
accepted.
Questions which come to mind:
Why is the polling not happening frequently enough?
Is it frequently swapping between polling and interrupts?
Is there interrupt coalesce going on, and the coalesce time set too
high, so that by the time the interrupt fires the ring is full? Can
you play with ethtool -C?
Andrew
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-04-28 12:55 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-28 1:55 [PATCH 0/2] thunderbolt: fix wedge under sustained tbnet load on AM4 and AM5 Benjamin Berman
2026-04-28 1:55 ` [PATCH 1/2] thunderbolt: drop start_poll guard in tb_ring_poll_complete() Benjamin Berman
2026-04-28 7:33 ` Mika Westerberg
2026-04-28 1:55 ` [PATCH 2/2] net: thunderbolt: enlarge RX/TX ring and set NAPI weight for sustained load Benjamin Berman
2026-04-28 7:42 ` Mika Westerberg
2026-04-28 12:54 ` Andrew Lunn
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox