[BUG] ixgbe: Detected Tx Unit Hang (XDP)

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [BUG] ixgbe: Detected Tx Unit Hang (XDP)
@ 2025-04-09 15:17 Marcus Wichelmann
  2025-04-10 14:30 ` Michal Kubiak
  0 siblings, 1 reply; 10+ messages in thread
From: Marcus Wichelmann @ 2025-04-09 15:17 UTC (permalink / raw)
  To: Tony Nguyen, Jay Vosburgh, Przemek Kitszel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend
  Cc: intel-wired-lan, netdev, bpf, linux-kernel, sdn

Hi,

in a setup where I use native XDP to redirect packets to a bonding interface
that's backed by two ixgbe slaves, I noticed that the ixgbe driver constantly
resets the NIC with the following kernel output:

  ixgbe 0000:01:00.1 ixgbe-x520-2: Detected Tx Unit Hang (XDP)
    Tx Queue             <4>
    TDH, TDT             <17e>, <17e>
    next_to_use          <181>
    next_to_clean        <17e>
  tx_buffer_info[next_to_clean]
    time_stamp           <0>
    jiffies              <10025c380>
  ixgbe 0000:01:00.1 ixgbe-x520-2: tx hang 19 detected on queue 4, resetting adapter
  ixgbe 0000:01:00.1 ixgbe-x520-2: initiating reset due to tx timeout
  ixgbe 0000:01:00.1 ixgbe-x520-2: Reset adapter

This only occurs in combination with a bonding interface and XDP, so I don't
know if this is an issue with ixgbe or the bonding driver.
I first discovered this with Linux 6.8.0-57, but kernel 6.14.0 and 6.15.0-rc1
show the same issue.


I managed to reproduce this bug in a lab environment. Here are some details
about my setup and the steps to reproduce the bug:

NIC: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)

CPU: Ampere(R) Altra(R) Processor Q80-30 CPU @ 3.0GHz
     Also reproduced on:
     - Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
     - Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz

Kernel: 6.15.0-rc1 (built from mainline)

  # ethtool -i ixgbe-x520-1
  driver: ixgbe
  version: 6.15.0-rc1
  firmware-version: 0x00012b2c, 1.3429.0
  expansion-rom-version: 
  bus-info: 0000:01:00.0
  supports-statistics: yes
  supports-test: yes
  supports-eeprom-access: yes
  supports-register-dump: yes
  supports-priv-flags: yes

The two ports of the NIC (named "ixgbe-x520-1" and "ixgbe-x520-2") are directly
connected with each other using a DAC cable. Both ports are configured to be
slaves of a bonding with mode balance-rr.
Neither the direct connection of  both ports nor the round-robin bonding mode
are a requirement to reproduce the issue. This setup just allows it to be easier
reproduced in an isolated environment. The issue is also visible with a regular
802.3ad link aggregation with a switch on the other side.

  # modprobe bonding
  # ip link set dev ixgbe-x520-1 down
  # ip link set dev ixgbe-x520-2 down
  # ip link add bond0 type bond mode balance-rr
  # ip link set dev ixgbe-x520-1 master bond0
  # ip link set dev ixgbe-x520-2 master bond0
  # ip link set dev ixgbe-x520-1 up
  # ip link set dev ixgbe-x520-2 up
  # ip link set dev bond0 up
        
  # cat /proc/net/bonding/bond0
  Ethernet Channel Bonding Driver: v6.15.0-rc1

  Bonding Mode: load balancing (round-robin)
  MII Status: up
  MII Polling Interval (ms): 0
  Up Delay (ms): 0
  Down Delay (ms): 0
  Peer Notification Delay (ms): 0

  Slave Interface: ixgbe-x520-1
  MII Status: up
  Speed: 10000 Mbps
  Duplex: full
  Link Failure Count: 0
  Permanent HW addr: 6c:b3:11:08:5c:3c
  Slave queue ID: 0

  Slave Interface: ixgbe-x520-2
  MII Status: up
  Speed: 10000 Mbps
  Duplex: full
  Link Failure Count: 0
  Permanent HW addr: 6c:b3:11:08:5c:3e
  Slave queue ID: 0

  # ethtool -l ixgbe-x520-1
  Channel parameters for ixgbe-x520-1:
  Pre-set maximums:
  RX:             n/a
  TX:             n/a
  Other:          1
  Combined:       63
  Current hardware settings:
  RX:             n/a
  TX:             n/a
  Other:          1
  Combined:       63
  (same for ixgbe-x520-2)

In the following the xdp-tools from https://github.com/xdp-project/xdp-tools/
are used.

Enable XDP on the bonding and make sure all received packets will be dropped:
  # xdp-tools/xdp-bench/xdp-bench drop -e -i 1 bond0

Redirect a batch of packets to the bonding interface:
  # xdp-tools/xdp-trafficgen/xdp-trafficgen udp --dst-mac <mac of bond0>
    --src-port 5000 --dst-port 6000 --threads 16 --num-packets 1000000 bond0

Shortly after that (3-4 seconds), one or more "Detected Tx Unit Hang" errors
(see above) will show up in the kernel log.

The high number of packets and thread count (--threads 16) is not required to
trigger the issue but greatly improves the probability.


Do you have any ideas what may be causing this issue or what I can do to
diagnose this further?

Please let me know when I should provide any more information.


Thanks!
Marcus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] ixgbe: Detected Tx Unit Hang (XDP)
  2025-04-09 15:17 [BUG] ixgbe: Detected Tx Unit Hang (XDP) Marcus Wichelmann
@ 2025-04-10 14:30 ` Michal Kubiak
  2025-04-10 14:54   ` Marcus Wichelmann
  0 siblings, 1 reply; 10+ messages in thread
From: Michal Kubiak @ 2025-04-10 14:30 UTC (permalink / raw)
  To: Marcus Wichelmann
  Cc: Tony Nguyen, Jay Vosburgh, Przemek Kitszel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, intel-wired-lan, netdev, bpf, linux-kernel, sdn

On Wed, Apr 09, 2025 at 05:17:49PM +0200, Marcus Wichelmann wrote:
> Hi,
> 
> in a setup where I use native XDP to redirect packets to a bonding interface
> that's backed by two ixgbe slaves, I noticed that the ixgbe driver constantly
> resets the NIC with the following kernel output:
> 
>   ixgbe 0000:01:00.1 ixgbe-x520-2: Detected Tx Unit Hang (XDP)
>     Tx Queue             <4>
>     TDH, TDT             <17e>, <17e>
>     next_to_use          <181>
>     next_to_clean        <17e>
>   tx_buffer_info[next_to_clean]
>     time_stamp           <0>
>     jiffies              <10025c380>
>   ixgbe 0000:01:00.1 ixgbe-x520-2: tx hang 19 detected on queue 4, resetting adapter
>   ixgbe 0000:01:00.1 ixgbe-x520-2: initiating reset due to tx timeout
>   ixgbe 0000:01:00.1 ixgbe-x520-2: Reset adapter
> 
> This only occurs in combination with a bonding interface and XDP, so I don't
> know if this is an issue with ixgbe or the bonding driver.
> I first discovered this with Linux 6.8.0-57, but kernel 6.14.0 and 6.15.0-rc1
> show the same issue.
> 
> 
> I managed to reproduce this bug in a lab environment. Here are some details
> about my setup and the steps to reproduce the bug:
> 
> NIC: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
> 
> CPU: Ampere(R) Altra(R) Processor Q80-30 CPU @ 3.0GHz
>      Also reproduced on:
>      - Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
>      - Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
> 
> Kernel: 6.15.0-rc1 (built from mainline)
> 
>   # ethtool -i ixgbe-x520-1
>   driver: ixgbe
>   version: 6.15.0-rc1
>   firmware-version: 0x00012b2c, 1.3429.0
>   expansion-rom-version: 
>   bus-info: 0000:01:00.0
>   supports-statistics: yes
>   supports-test: yes
>   supports-eeprom-access: yes
>   supports-register-dump: yes
>   supports-priv-flags: yes
> 
> The two ports of the NIC (named "ixgbe-x520-1" and "ixgbe-x520-2") are directly
> connected with each other using a DAC cable. Both ports are configured to be
> slaves of a bonding with mode balance-rr.
> Neither the direct connection of  both ports nor the round-robin bonding mode
> are a requirement to reproduce the issue. This setup just allows it to be easier
> reproduced in an isolated environment. The issue is also visible with a regular
> 802.3ad link aggregation with a switch on the other side.
> 
>   # modprobe bonding
>   # ip link set dev ixgbe-x520-1 down
>   # ip link set dev ixgbe-x520-2 down
>   # ip link add bond0 type bond mode balance-rr
>   # ip link set dev ixgbe-x520-1 master bond0
>   # ip link set dev ixgbe-x520-2 master bond0
>   # ip link set dev ixgbe-x520-1 up
>   # ip link set dev ixgbe-x520-2 up
>   # ip link set dev bond0 up
>         
>   # cat /proc/net/bonding/bond0
>   Ethernet Channel Bonding Driver: v6.15.0-rc1
> 
>   Bonding Mode: load balancing (round-robin)
>   MII Status: up
>   MII Polling Interval (ms): 0
>   Up Delay (ms): 0
>   Down Delay (ms): 0
>   Peer Notification Delay (ms): 0
> 
>   Slave Interface: ixgbe-x520-1
>   MII Status: up
>   Speed: 10000 Mbps
>   Duplex: full
>   Link Failure Count: 0
>   Permanent HW addr: 6c:b3:11:08:5c:3c
>   Slave queue ID: 0
> 
>   Slave Interface: ixgbe-x520-2
>   MII Status: up
>   Speed: 10000 Mbps
>   Duplex: full
>   Link Failure Count: 0
>   Permanent HW addr: 6c:b3:11:08:5c:3e
>   Slave queue ID: 0
> 
>   # ethtool -l ixgbe-x520-1
>   Channel parameters for ixgbe-x520-1:
>   Pre-set maximums:
>   RX:             n/a
>   TX:             n/a
>   Other:          1
>   Combined:       63
>   Current hardware settings:
>   RX:             n/a
>   TX:             n/a
>   Other:          1
>   Combined:       63
>   (same for ixgbe-x520-2)
> 
> In the following the xdp-tools from https://github.com/xdp-project/xdp-tools/
> are used.
> 
> Enable XDP on the bonding and make sure all received packets will be dropped:
>   # xdp-tools/xdp-bench/xdp-bench drop -e -i 1 bond0
> 
> Redirect a batch of packets to the bonding interface:
>   # xdp-tools/xdp-trafficgen/xdp-trafficgen udp --dst-mac <mac of bond0>
>     --src-port 5000 --dst-port 6000 --threads 16 --num-packets 1000000 bond0
> 
> Shortly after that (3-4 seconds), one or more "Detected Tx Unit Hang" errors
> (see above) will show up in the kernel log.
> 
> The high number of packets and thread count (--threads 16) is not required to
> trigger the issue but greatly improves the probability.
> 
> 
> Do you have any ideas what may be causing this issue or what I can do to
> diagnose this further?
> 
> Please let me know when I should provide any more information.
> 
> 
> Thanks!
> Marcus
> 

Hi Marcus,

Thank you for reporting this issue!
I have just successfully reproduced the problem on our lab machine. What
is interesting is that I do not seem to have to use a bonding interface
to get the "Tx timeout" that causes the adapter to reset.

I will try to debug the problem more closely and let you know of any
updates.

Thanks,
Michal

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] ixgbe: Detected Tx Unit Hang (XDP)
  2025-04-10 14:30 ` Michal Kubiak
@ 2025-04-10 14:54   ` Marcus Wichelmann
  2025-04-11  8:14     ` Michal Kubiak
  0 siblings, 1 reply; 10+ messages in thread
From: Marcus Wichelmann @ 2025-04-10 14:54 UTC (permalink / raw)
  To: Michal Kubiak
  Cc: Tony Nguyen, Jay Vosburgh, Przemek Kitszel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, intel-wired-lan, netdev, bpf, linux-kernel, sdn

Am 10.04.25 um 16:30 schrieb Michal Kubiak:
> On Wed, Apr 09, 2025 at 05:17:49PM +0200, Marcus Wichelmann wrote:
>> Hi,
>>
>> in a setup where I use native XDP to redirect packets to a bonding interface
>> that's backed by two ixgbe slaves, I noticed that the ixgbe driver constantly
>> resets the NIC with the following kernel output:
>>
>>   ixgbe 0000:01:00.1 ixgbe-x520-2: Detected Tx Unit Hang (XDP)
>>     Tx Queue             <4>
>>     TDH, TDT             <17e>, <17e>
>>     next_to_use          <181>
>>     next_to_clean        <17e>
>>   tx_buffer_info[next_to_clean]
>>     time_stamp           <0>
>>     jiffies              <10025c380>
>>   ixgbe 0000:01:00.1 ixgbe-x520-2: tx hang 19 detected on queue 4, resetting adapter
>>   ixgbe 0000:01:00.1 ixgbe-x520-2: initiating reset due to tx timeout
>>   ixgbe 0000:01:00.1 ixgbe-x520-2: Reset adapter
>>
>> This only occurs in combination with a bonding interface and XDP, so I don't
>> know if this is an issue with ixgbe or the bonding driver.
>> I first discovered this with Linux 6.8.0-57, but kernel 6.14.0 and 6.15.0-rc1
>> show the same issue.
>>
>>
>> I managed to reproduce this bug in a lab environment. Here are some details
>> about my setup and the steps to reproduce the bug:
>>
>> [...]
>>
>> Do you have any ideas what may be causing this issue or what I can do to
>> diagnose this further?
>>
>> Please let me know when I should provide any more information.
>>
>>
>> Thanks!
>> Marcus
>>
> 
> Hi Marcus,

Hi Michal,

thank you for looking into it. And not even 24 hours after my report, I'm
very impressed! ;)

> I have just successfully reproduced the problem on our lab machine. What
> is interesting is that I do not seem to have to use a bonding interface
> to get the "Tx timeout" that causes the adapter to reset.

Interesting. I just tried again but had no luck yet with reproducing it
without a bonding interface. May I ask how your setup looks like?

> I will try to debug the problem more closely and let you know of any
> updates.
> 
> Thanks,
> Michal

Great!

Marcus


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] ixgbe: Detected Tx Unit Hang (XDP)
  2025-04-10 14:54   ` Marcus Wichelmann
@ 2025-04-11  8:14     ` Michal Kubiak
  2025-04-17 14:47       ` Maciej Fijalkowski
  0 siblings, 1 reply; 10+ messages in thread
From: Michal Kubiak @ 2025-04-11  8:14 UTC (permalink / raw)
  To: Marcus Wichelmann
  Cc: Tony Nguyen, Jay Vosburgh, Przemek Kitszel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, intel-wired-lan, netdev, bpf, linux-kernel, sdn

On Thu, Apr 10, 2025 at 04:54:35PM +0200, Marcus Wichelmann wrote:
> Am 10.04.25 um 16:30 schrieb Michal Kubiak:
> > On Wed, Apr 09, 2025 at 05:17:49PM +0200, Marcus Wichelmann wrote:
> >> Hi,
> >>
> >> in a setup where I use native XDP to redirect packets to a bonding interface
> >> that's backed by two ixgbe slaves, I noticed that the ixgbe driver constantly
> >> resets the NIC with the following kernel output:
> >>
> >>   ixgbe 0000:01:00.1 ixgbe-x520-2: Detected Tx Unit Hang (XDP)
> >>     Tx Queue             <4>
> >>     TDH, TDT             <17e>, <17e>
> >>     next_to_use          <181>
> >>     next_to_clean        <17e>
> >>   tx_buffer_info[next_to_clean]
> >>     time_stamp           <0>
> >>     jiffies              <10025c380>
> >>   ixgbe 0000:01:00.1 ixgbe-x520-2: tx hang 19 detected on queue 4, resetting adapter
> >>   ixgbe 0000:01:00.1 ixgbe-x520-2: initiating reset due to tx timeout
> >>   ixgbe 0000:01:00.1 ixgbe-x520-2: Reset adapter
> >>
> >> This only occurs in combination with a bonding interface and XDP, so I don't
> >> know if this is an issue with ixgbe or the bonding driver.
> >> I first discovered this with Linux 6.8.0-57, but kernel 6.14.0 and 6.15.0-rc1
> >> show the same issue.
> >>
> >>
> >> I managed to reproduce this bug in a lab environment. Here are some details
> >> about my setup and the steps to reproduce the bug:
> >>
> >> [...]
> >>
> >> Do you have any ideas what may be causing this issue or what I can do to
> >> diagnose this further?
> >>
> >> Please let me know when I should provide any more information.
> >>
> >>
> >> Thanks!
> >> Marcus
> >>
> > 
> > Hi Marcus,
> 
> Hi Michal,
> 
> thank you for looking into it. And not even 24 hours after my report, I'm
> very impressed! ;)
> 
> > I have just successfully reproduced the problem on our lab machine. What
> > is interesting is that I do not seem to have to use a bonding interface
> > to get the "Tx timeout" that causes the adapter to reset.
> 
> Interesting. I just tried again but had no luck yet with reproducing it
> without a bonding interface. May I ask how your setup looks like?
> 
> > I will try to debug the problem more closely and let you know of any
> > updates.
> > 
> > Thanks,
> > Michal
> 
> Great!
> 
> Marcus
>

Hi Marcus,

> thank you for looking into it. And not even 24 hours after my report, I'm
> very impressed! ;)

Thanks! :-)

> Interesting. I just tried again but had no luck yet with reproducing it
> without a bonding interface. May I ask how your setup looks like?

For now, I've just grabbed the first available system with the HW
controlled by the "ixgbe" driver. In my case it was:

  Ethernet controller: Intel Corporation Ethernet Controller X550

Also, for my first attempt, I didn't use the upstream kernel - I just tried
the kernel installed on that system. It was the Fedora kernel:

  6.12.8-200.fc41.x86_64


I think that may be the "beauty" of timing issues - sometimes you can change
just one piece in your system and get a completely different replication ratio.
Anyway, the higher the repro probability, the easier it is to debug
the timing problem. :-)

Thanks,
Michal


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] ixgbe: Detected Tx Unit Hang (XDP)
  2025-04-11  8:14     ` Michal Kubiak
@ 2025-04-17 14:47       ` Maciej Fijalkowski
  2025-04-23 14:20         ` Marcus Wichelmann
  0 siblings, 1 reply; 10+ messages in thread
From: Maciej Fijalkowski @ 2025-04-17 14:47 UTC (permalink / raw)
  To: Michal Kubiak
  Cc: Marcus Wichelmann, Tony Nguyen, Jay Vosburgh, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, intel-wired-lan, netdev,
	bpf, linux-kernel, sdn

On Fri, Apr 11, 2025 at 10:14:57AM +0200, Michal Kubiak wrote:
> On Thu, Apr 10, 2025 at 04:54:35PM +0200, Marcus Wichelmann wrote:
> > Am 10.04.25 um 16:30 schrieb Michal Kubiak:
> > > On Wed, Apr 09, 2025 at 05:17:49PM +0200, Marcus Wichelmann wrote:
> > >> Hi,
> > >>
> > >> in a setup where I use native XDP to redirect packets to a bonding interface
> > >> that's backed by two ixgbe slaves, I noticed that the ixgbe driver constantly
> > >> resets the NIC with the following kernel output:
> > >>
> > >>   ixgbe 0000:01:00.1 ixgbe-x520-2: Detected Tx Unit Hang (XDP)
> > >>     Tx Queue             <4>
> > >>     TDH, TDT             <17e>, <17e>
> > >>     next_to_use          <181>
> > >>     next_to_clean        <17e>
> > >>   tx_buffer_info[next_to_clean]
> > >>     time_stamp           <0>
> > >>     jiffies              <10025c380>
> > >>   ixgbe 0000:01:00.1 ixgbe-x520-2: tx hang 19 detected on queue 4, resetting adapter
> > >>   ixgbe 0000:01:00.1 ixgbe-x520-2: initiating reset due to tx timeout
> > >>   ixgbe 0000:01:00.1 ixgbe-x520-2: Reset adapter
> > >>
> > >> This only occurs in combination with a bonding interface and XDP, so I don't
> > >> know if this is an issue with ixgbe or the bonding driver.
> > >> I first discovered this with Linux 6.8.0-57, but kernel 6.14.0 and 6.15.0-rc1
> > >> show the same issue.
> > >>
> > >>
> > >> I managed to reproduce this bug in a lab environment. Here are some details
> > >> about my setup and the steps to reproduce the bug:
> > >>
> > >> [...]
> > >>
> > >> Do you have any ideas what may be causing this issue or what I can do to
> > >> diagnose this further?
> > >>
> > >> Please let me know when I should provide any more information.
> > >>
> > >>
> > >> Thanks!
> > >> Marcus
> > >>
> > > 
> > > Hi Marcus,
> > 
> > Hi Michal,
> > 
> > thank you for looking into it. And not even 24 hours after my report, I'm
> > very impressed! ;)
> > 
> > > I have just successfully reproduced the problem on our lab machine. What
> > > is interesting is that I do not seem to have to use a bonding interface
> > > to get the "Tx timeout" that causes the adapter to reset.
> > 
> > Interesting. I just tried again but had no luck yet with reproducing it
> > without a bonding interface. May I ask how your setup looks like?
> > 
> > > I will try to debug the problem more closely and let you know of any
> > > updates.
> > > 
> > > Thanks,
> > > Michal
> > 
> > Great!
> > 
> > Marcus
> >
> 
> Hi Marcus,
> 
> > thank you for looking into it. And not even 24 hours after my report, I'm
> > very impressed! ;)
> 
> Thanks! :-)
> 
> > Interesting. I just tried again but had no luck yet with reproducing it
> > without a bonding interface. May I ask how your setup looks like?
> 
> For now, I've just grabbed the first available system with the HW
> controlled by the "ixgbe" driver. In my case it was:
> 
>   Ethernet controller: Intel Corporation Ethernet Controller X550
> 
> Also, for my first attempt, I didn't use the upstream kernel - I just tried
> the kernel installed on that system. It was the Fedora kernel:
> 
>   6.12.8-200.fc41.x86_64
> 
> 
> I think that may be the "beauty" of timing issues - sometimes you can change
> just one piece in your system and get a completely different replication ratio.
> Anyway, the higher the repro probability, the easier it is to debug
> the timing problem. :-)

Hi Marcus, to break the silence could you try to apply the diff below on
your side? We see several issues around XDP queues in ixgbe, but before we
proceed let's this small change on your side.

Additional question, do you have enabled pause frames on your setup?

From 6bf437ee12b4ef927a9015b568654cf7d8cabab2 Mon Sep 17 00:00:00 2001
From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Date: Thu, 17 Apr 2025 14:42:45 +0000
Subject: [PATCH] ixgbe: don't check hangs on XDP queues

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 21 ++++++-------------
 1 file changed, 6 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 467f81239e12..06c62ec445b5 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1263,10 +1263,13 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 				   total_bytes);
 	adapter->tx_ipsec += total_ipsec;
 
+	if (ring_is_xdp(tx_ring))
+		return !!budget;
+
 	if (check_for_tx_hang(tx_ring) && ixgbe_check_tx_hang(tx_ring)) {
 		/* schedule immediate reset if we believe we hung */
 		struct ixgbe_hw *hw = &adapter->hw;
-		e_err(drv, "Detected Tx Unit Hang %s\n"
+		e_err(drv, "Detected Tx Unit Hang\n"
 			"  Tx Queue             <%d>\n"
 			"  TDH, TDT             <%x>, <%x>\n"
 			"  next_to_use          <%x>\n"
@@ -1274,16 +1277,14 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 			"tx_buffer_info[next_to_clean]\n"
 			"  time_stamp           <%lx>\n"
 			"  jiffies              <%lx>\n",
-			ring_is_xdp(tx_ring) ? "(XDP)" : "",
 			tx_ring->queue_index,
 			IXGBE_READ_REG(hw, IXGBE_TDH(tx_ring->reg_idx)),
 			IXGBE_READ_REG(hw, IXGBE_TDT(tx_ring->reg_idx)),
 			tx_ring->next_to_use, i,
 			tx_ring->tx_buffer_info[i].time_stamp, jiffies);
 
-		if (!ring_is_xdp(tx_ring))
-			netif_stop_subqueue(tx_ring->netdev,
-					    tx_ring->queue_index);
+		netif_stop_subqueue(tx_ring->netdev,
+				    tx_ring->queue_index);
 
 		e_info(probe,
 		       "tx hang %d detected on queue %d, resetting adapter\n",
@@ -1296,9 +1297,6 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 		return true;
 	}
 
-	if (ring_is_xdp(tx_ring))
-		return !!budget;
-
 #define TX_WAKE_THRESHOLD (DESC_NEEDED * 2)
 	txq = netdev_get_tx_queue(tx_ring->netdev, tx_ring->queue_index);
 	if (!__netif_txq_completed_wake(txq, total_packets, total_bytes,
@@ -8011,13 +8009,6 @@ static bool ixgbe_ring_tx_pending(struct ixgbe_adapter *adapter)
 			return true;
 	}
 
-	for (i = 0; i < adapter->num_xdp_queues; i++) {
-		struct ixgbe_ring *ring = adapter->xdp_ring[i];
-
-		if (ring->next_to_use != ring->next_to_clean)
-			return true;
-	}
-
 	return false;
 }
 
-- 
2.43.0


> 
> Thanks,
> Michal
> 
> 

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [BUG] ixgbe: Detected Tx Unit Hang (XDP)
  2025-04-17 14:47       ` Maciej Fijalkowski
@ 2025-04-23 14:20         ` Marcus Wichelmann
  2025-04-23 18:39           ` Maciej Fijalkowski
  0 siblings, 1 reply; 10+ messages in thread
From: Marcus Wichelmann @ 2025-04-23 14:20 UTC (permalink / raw)
  To: Maciej Fijalkowski, Michal Kubiak
  Cc: Tony Nguyen, Jay Vosburgh, Przemek Kitszel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, intel-wired-lan, netdev, bpf, linux-kernel, sdn

Am 17.04.25 um 16:47 schrieb Maciej Fijalkowski:
> On Fri, Apr 11, 2025 at 10:14:57AM +0200, Michal Kubiak wrote:
>> On Thu, Apr 10, 2025 at 04:54:35PM +0200, Marcus Wichelmann wrote:
>>> Am 10.04.25 um 16:30 schrieb Michal Kubiak:
>>>> On Wed, Apr 09, 2025 at 05:17:49PM +0200, Marcus Wichelmann wrote:
>>>>> Hi,
>>>>>
>>>>> in a setup where I use native XDP to redirect packets to a bonding interface
>>>>> that's backed by two ixgbe slaves, I noticed that the ixgbe driver constantly
>>>>> resets the NIC with the following kernel output:
>>>>>
>>>>>   ixgbe 0000:01:00.1 ixgbe-x520-2: Detected Tx Unit Hang (XDP)
>>>>>     Tx Queue             <4>
>>>>>     TDH, TDT             <17e>, <17e>
>>>>>     next_to_use          <181>
>>>>>     next_to_clean        <17e>
>>>>>   tx_buffer_info[next_to_clean]
>>>>>     time_stamp           <0>
>>>>>     jiffies              <10025c380>
>>>>>   ixgbe 0000:01:00.1 ixgbe-x520-2: tx hang 19 detected on queue 4, resetting adapter
>>>>>   ixgbe 0000:01:00.1 ixgbe-x520-2: initiating reset due to tx timeout
>>>>>   ixgbe 0000:01:00.1 ixgbe-x520-2: Reset adapter
>>>>>
>>>>> This only occurs in combination with a bonding interface and XDP, so I don't
>>>>> know if this is an issue with ixgbe or the bonding driver.
>>>>> I first discovered this with Linux 6.8.0-57, but kernel 6.14.0 and 6.15.0-rc1
>>>>> show the same issue.
>>>>>
>>>>>
>>>>> I managed to reproduce this bug in a lab environment. Here are some details
>>>>> about my setup and the steps to reproduce the bug:
>>>>>
>>>>> [...]
>>>>>
>>>>> Do you have any ideas what may be causing this issue or what I can do to
>>>>> diagnose this further?
>>>>>
>>>>> Please let me know when I should provide any more information.
>>>>>
>>>>>
>>>>> Thanks!
>>>>> Marcus
>>>>>
>>>>
>> [...]
>>
>> Hi Marcus,
>>
>>> thank you for looking into it. And not even 24 hours after my report, I'm
>>> very impressed! ;)
>>
>> Thanks! :-)
>>
>>> Interesting. I just tried again but had no luck yet with reproducing it
>>> without a bonding interface. May I ask how your setup looks like?
>>
>> For now, I've just grabbed the first available system with the HW
>> controlled by the "ixgbe" driver. In my case it was:
>>
>>   Ethernet controller: Intel Corporation Ethernet Controller X550
>>
>> Also, for my first attempt, I didn't use the upstream kernel - I just tried
>> the kernel installed on that system. It was the Fedora kernel:
>>
>>   6.12.8-200.fc41.x86_64
>>
>>
>> I think that may be the "beauty" of timing issues - sometimes you can change
>> just one piece in your system and get a completely different replication ratio.
>> Anyway, the higher the repro probability, the easier it is to debug
>> the timing problem. :-)
> 
> Hi Marcus, to break the silence could you try to apply the diff below on
> your side?

Hi, thank you for the patch. We've tried it and with your changes we can no
longer trigger the error and the NIC is no longer being reset.

> We see several issues around XDP queues in ixgbe, but before we
> proceed let's this small change on your side.

How confident are you that this patch is sufficient to make things stable enough
for production use? Was it just the Tx hang detection that was misbehaving for
the XDP case, or is there an underlying issue with the XDP queues that is not
solved by disabling the detection for it?

With our current setup we cannot verify accurately, that we have no packet loss 
or stuck queues. We can do additional tests to verify that. 
 
> Additional question, do you have enabled pause frames on your setup?

Pause frames were enabled, but we can also reproduce it after disabling them,
without your patch.

Thanks!
Marcus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] ixgbe: Detected Tx Unit Hang (XDP)
  2025-04-23 14:20         ` Marcus Wichelmann
@ 2025-04-23 18:39           ` Maciej Fijalkowski
  2025-04-24 10:19             ` Tobias Böhm
  0 siblings, 1 reply; 10+ messages in thread
From: Maciej Fijalkowski @ 2025-04-23 18:39 UTC (permalink / raw)
  To: Marcus Wichelmann
  Cc: Michal Kubiak, Tony Nguyen, Jay Vosburgh, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, intel-wired-lan, netdev,
	bpf, linux-kernel, sdn

On Wed, Apr 23, 2025 at 04:20:07PM +0200, Marcus Wichelmann wrote:
> Am 17.04.25 um 16:47 schrieb Maciej Fijalkowski:
> > On Fri, Apr 11, 2025 at 10:14:57AM +0200, Michal Kubiak wrote:
> >> On Thu, Apr 10, 2025 at 04:54:35PM +0200, Marcus Wichelmann wrote:
> >>> Am 10.04.25 um 16:30 schrieb Michal Kubiak:
> >>>> On Wed, Apr 09, 2025 at 05:17:49PM +0200, Marcus Wichelmann wrote:
> >>>>> Hi,
> >>>>>
> >>>>> in a setup where I use native XDP to redirect packets to a bonding interface
> >>>>> that's backed by two ixgbe slaves, I noticed that the ixgbe driver constantly
> >>>>> resets the NIC with the following kernel output:
> >>>>>
> >>>>>   ixgbe 0000:01:00.1 ixgbe-x520-2: Detected Tx Unit Hang (XDP)
> >>>>>     Tx Queue             <4>
> >>>>>     TDH, TDT             <17e>, <17e>
> >>>>>     next_to_use          <181>
> >>>>>     next_to_clean        <17e>
> >>>>>   tx_buffer_info[next_to_clean]
> >>>>>     time_stamp           <0>
> >>>>>     jiffies              <10025c380>
> >>>>>   ixgbe 0000:01:00.1 ixgbe-x520-2: tx hang 19 detected on queue 4, resetting adapter
> >>>>>   ixgbe 0000:01:00.1 ixgbe-x520-2: initiating reset due to tx timeout
> >>>>>   ixgbe 0000:01:00.1 ixgbe-x520-2: Reset adapter
> >>>>>
> >>>>> This only occurs in combination with a bonding interface and XDP, so I don't
> >>>>> know if this is an issue with ixgbe or the bonding driver.
> >>>>> I first discovered this with Linux 6.8.0-57, but kernel 6.14.0 and 6.15.0-rc1
> >>>>> show the same issue.
> >>>>>
> >>>>>
> >>>>> I managed to reproduce this bug in a lab environment. Here are some details
> >>>>> about my setup and the steps to reproduce the bug:
> >>>>>
> >>>>> [...]
> >>>>>
> >>>>> Do you have any ideas what may be causing this issue or what I can do to
> >>>>> diagnose this further?
> >>>>>
> >>>>> Please let me know when I should provide any more information.
> >>>>>
> >>>>>
> >>>>> Thanks!
> >>>>> Marcus
> >>>>>
> >>>>
> >> [...]
> >>
> >> Hi Marcus,
> >>
> >>> thank you for looking into it. And not even 24 hours after my report, I'm
> >>> very impressed! ;)
> >>
> >> Thanks! :-)
> >>
> >>> Interesting. I just tried again but had no luck yet with reproducing it
> >>> without a bonding interface. May I ask how your setup looks like?
> >>
> >> For now, I've just grabbed the first available system with the HW
> >> controlled by the "ixgbe" driver. In my case it was:
> >>
> >>   Ethernet controller: Intel Corporation Ethernet Controller X550
> >>
> >> Also, for my first attempt, I didn't use the upstream kernel - I just tried
> >> the kernel installed on that system. It was the Fedora kernel:
> >>
> >>   6.12.8-200.fc41.x86_64
> >>
> >>
> >> I think that may be the "beauty" of timing issues - sometimes you can change
> >> just one piece in your system and get a completely different replication ratio.
> >> Anyway, the higher the repro probability, the easier it is to debug
> >> the timing problem. :-)
> > 
> > Hi Marcus, to break the silence could you try to apply the diff below on
> > your side?
> 
> Hi, thank you for the patch. We've tried it and with your changes we can no
> longer trigger the error and the NIC is no longer being reset.
> 
> > We see several issues around XDP queues in ixgbe, but before we
> > proceed let's this small change on your side.
> 
> How confident are you that this patch is sufficient to make things stable enough
> for production use? Was it just the Tx hang detection that was misbehaving for
> the XDP case, or is there an underlying issue with the XDP queues that is not
> solved by disabling the detection for it?

I believe that correct way to approach this is to move the Tx hang
detection onto ixgbe_tx_timeout() as that is the place where this logic
belongs to. By doing so I suppose we would kill two birds with one stone
as mentioned ndo is called under netdev watchdog which is not a subject
for XDP Tx queues.

> 
> With our current setup we cannot verify accurately, that we have no packet loss 
> or stuck queues. We can do additional tests to verify that. 
>  
> > Additional question, do you have enabled pause frames on your setup?
> 
> Pause frames were enabled, but we can also reproduce it after disabling them,
> without your patch.

Please give your setup a go with pause frames enabled and applied patch
that i shared previously and let us see the results. As said above I do
not think it is correct to check for hung queues in Tx descriptor cleaning
routine. This is a job of ndo_tx_timeout callback.

> 
> Thanks!

Thanks for feedback and testing. I'll provide a proper fix tomorrow and CC
you so you could take it for a spin.

> Marcus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] ixgbe: Detected Tx Unit Hang (XDP)
  2025-04-23 18:39           ` Maciej Fijalkowski
@ 2025-04-24 10:19             ` Tobias Böhm
  2025-05-05 15:23               ` Tobias Böhm
  0 siblings, 1 reply; 10+ messages in thread
From: Tobias Böhm @ 2025-04-24 10:19 UTC (permalink / raw)
  To: Maciej Fijalkowski, Marcus Wichelmann
  Cc: Michal Kubiak, Tony Nguyen, Jay Vosburgh, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, intel-wired-lan, netdev,
	bpf, linux-kernel, sdn


[-- Attachment #1.1: Type: text/plain, Size: 5151 bytes --]

Am 23.04.25 um 20:39 schrieb Maciej Fijalkowski:
> On Wed, Apr 23, 2025 at 04:20:07PM +0200, Marcus Wichelmann wrote:
>> Am 17.04.25 um 16:47 schrieb Maciej Fijalkowski:
>>> On Fri, Apr 11, 2025 at 10:14:57AM +0200, Michal Kubiak wrote:
>>>> On Thu, Apr 10, 2025 at 04:54:35PM +0200, Marcus Wichelmann wrote:
>>>>> Am 10.04.25 um 16:30 schrieb Michal Kubiak:
>>>>>> On Wed, Apr 09, 2025 at 05:17:49PM +0200, Marcus Wichelmann wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> in a setup where I use native XDP to redirect packets to a bonding interface
>>>>>>> that's backed by two ixgbe slaves, I noticed that the ixgbe driver constantly
>>>>>>> resets the NIC with the following kernel output:
>>>>>>>
>>>>>>>    ixgbe 0000:01:00.1 ixgbe-x520-2: Detected Tx Unit Hang (XDP)
>>>>>>>      Tx Queue             <4>
>>>>>>>      TDH, TDT             <17e>, <17e>
>>>>>>>      next_to_use          <181>
>>>>>>>      next_to_clean        <17e>
>>>>>>>    tx_buffer_info[next_to_clean]
>>>>>>>      time_stamp           <0>
>>>>>>>      jiffies              <10025c380>
>>>>>>>    ixgbe 0000:01:00.1 ixgbe-x520-2: tx hang 19 detected on queue 4, resetting adapter
>>>>>>>    ixgbe 0000:01:00.1 ixgbe-x520-2: initiating reset due to tx timeout
>>>>>>>    ixgbe 0000:01:00.1 ixgbe-x520-2: Reset adapter
>>>>>>>
>>>>>>> This only occurs in combination with a bonding interface and XDP, so I don't
>>>>>>> know if this is an issue with ixgbe or the bonding driver.
>>>>>>> I first discovered this with Linux 6.8.0-57, but kernel 6.14.0 and 6.15.0-rc1
>>>>>>> show the same issue.
>>>>>>>
>>>>>>>
>>>>>>> I managed to reproduce this bug in a lab environment. Here are some details
>>>>>>> about my setup and the steps to reproduce the bug:
>>>>>>>
>>>>>>> [...]
>>>>>>>
>>>>>>> Do you have any ideas what may be causing this issue or what I can do to
>>>>>>> diagnose this further?
>>>>>>>
>>>>>>> Please let me know when I should provide any more information.
>>>>>>>
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Marcus
>>>>>>>
>>>>>>
>>>> [...]
>>>>
>>>> Hi Marcus,
>>>>
>>>>> thank you for looking into it. And not even 24 hours after my report, I'm
>>>>> very impressed! ;)
>>>>
>>>> Thanks! :-)
>>>>
>>>>> Interesting. I just tried again but had no luck yet with reproducing it
>>>>> without a bonding interface. May I ask how your setup looks like?
>>>>
>>>> For now, I've just grabbed the first available system with the HW
>>>> controlled by the "ixgbe" driver. In my case it was:
>>>>
>>>>    Ethernet controller: Intel Corporation Ethernet Controller X550
>>>>
>>>> Also, for my first attempt, I didn't use the upstream kernel - I just tried
>>>> the kernel installed on that system. It was the Fedora kernel:
>>>>
>>>>    6.12.8-200.fc41.x86_64
>>>>
>>>>
>>>> I think that may be the "beauty" of timing issues - sometimes you can change
>>>> just one piece in your system and get a completely different replication ratio.
>>>> Anyway, the higher the repro probability, the easier it is to debug
>>>> the timing problem. :-)
>>>
>>> Hi Marcus, to break the silence could you try to apply the diff below on
>>> your side?
>>
>> Hi, thank you for the patch. We've tried it and with your changes we can no
>> longer trigger the error and the NIC is no longer being reset.
>>
>>> We see several issues around XDP queues in ixgbe, but before we
>>> proceed let's this small change on your side.
>>
>> How confident are you that this patch is sufficient to make things stable enough
>> for production use? Was it just the Tx hang detection that was misbehaving for
>> the XDP case, or is there an underlying issue with the XDP queues that is not
>> solved by disabling the detection for it?
> 
> I believe that correct way to approach this is to move the Tx hang
> detection onto ixgbe_tx_timeout() as that is the place where this logic
> belongs to. By doing so I suppose we would kill two birds with one stone
> as mentioned ndo is called under netdev watchdog which is not a subject
> for XDP Tx queues.
> 
>>
>> With our current setup we cannot verify accurately, that we have no packet loss
>> or stuck queues. We can do additional tests to verify that.


Hi Maciej,

I'm a colleague of Marcus and involved in the testing as well.
>>> Additional question, do you have enabled pause frames on your setup?
>>
>> Pause frames were enabled, but we can also reproduce it after disabling them,
>> without your patch.
> 
> Please give your setup a go with pause frames enabled and applied patch
> that i shared previously and let us see the results. As said above I do
> not think it is correct to check for hung queues in Tx descriptor cleaning
> routine. This is a job of ndo_tx_timeout callback.
> 

We have tested with pause frames enabled and applied patch and can not 
trigger the error anymore in our lab setup.

>>
>> Thanks!
> 
> Thanks for feedback and testing. I'll provide a proper fix tomorrow and CC
> you so you could take it for a spin.
> 

That sounds great. We'd be happy to test with the proper fix in our 
original setup.

Thanks,
Tobias

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 840 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] ixgbe: Detected Tx Unit Hang (XDP)
  2025-04-24 10:19             ` Tobias Böhm
@ 2025-05-05 15:23               ` Tobias Böhm
  2025-05-08 19:25                 ` Maciej Fijalkowski
  0 siblings, 1 reply; 10+ messages in thread
From: Tobias Böhm @ 2025-05-05 15:23 UTC (permalink / raw)
  To: Maciej Fijalkowski, Marcus Wichelmann
  Cc: Michal Kubiak, Tony Nguyen, Jay Vosburgh, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, intel-wired-lan, netdev,
	bpf, linux-kernel, sdn

Am 24.04.25 um 12:19 schrieb Tobias Böhm:
> Am 23.04.25 um 20:39 schrieb Maciej Fijalkowski:
>> On Wed, Apr 23, 2025 at 04:20:07PM +0200, Marcus Wichelmann wrote:
>>> Am 17.04.25 um 16:47 schrieb Maciej Fijalkowski:
>>>> On Fri, Apr 11, 2025 at 10:14:57AM +0200, Michal Kubiak wrote:
>>>>> On Thu, Apr 10, 2025 at 04:54:35PM +0200, Marcus Wichelmann wrote:
>>>>>> Am 10.04.25 um 16:30 schrieb Michal Kubiak:
>>>>>>> On Wed, Apr 09, 2025 at 05:17:49PM +0200, Marcus Wichelmann wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> in a setup where I use native XDP to redirect packets to a 
>>>>>>>> bonding interface
>>>>>>>> that's backed by two ixgbe slaves, I noticed that the ixgbe 
>>>>>>>> driver constantly
>>>>>>>> resets the NIC with the following kernel output:
>>>>>>>>
>>>>>>>>    ixgbe 0000:01:00.1 ixgbe-x520-2: Detected Tx Unit Hang (XDP)
>>>>>>>>      Tx Queue             <4>
>>>>>>>>      TDH, TDT             <17e>, <17e>
>>>>>>>>      next_to_use          <181>
>>>>>>>>      next_to_clean        <17e>
>>>>>>>>    tx_buffer_info[next_to_clean]
>>>>>>>>      time_stamp           <0>
>>>>>>>>      jiffies              <10025c380>
>>>>>>>>    ixgbe 0000:01:00.1 ixgbe-x520-2: tx hang 19 detected on queue 
>>>>>>>> 4, resetting adapter
>>>>>>>>    ixgbe 0000:01:00.1 ixgbe-x520-2: initiating reset due to tx 
>>>>>>>> timeout
>>>>>>>>    ixgbe 0000:01:00.1 ixgbe-x520-2: Reset adapter
>>>>>>>>
>>>>>>>> This only occurs in combination with a bonding interface and 
>>>>>>>> XDP, so I don't
>>>>>>>> know if this is an issue with ixgbe or the bonding driver.
>>>>>>>> I first discovered this with Linux 6.8.0-57, but kernel 6.14.0 
>>>>>>>> and 6.15.0-rc1
>>>>>>>> show the same issue.
>>>>>>>>
>>>>>>>>
>>>>>>>> I managed to reproduce this bug in a lab environment. Here are 
>>>>>>>> some details
>>>>>>>> about my setup and the steps to reproduce the bug:
>>>>>>>>
>>>>>>>> [...]
>>>>>>>>
>>>>>>>> Do you have any ideas what may be causing this issue or what I 
>>>>>>>> can do to
>>>>>>>> diagnose this further?
>>>>>>>>
>>>>>>>> Please let me know when I should provide any more information.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>> Marcus
>>>>>>>>
>>>>>>>
>>>>> [...]
>>>>>
>>>>> Hi Marcus,
>>>>>
>>>>>> thank you for looking into it. And not even 24 hours after my 
>>>>>> report, I'm
>>>>>> very impressed! ;)
>>>>>
>>>>> Thanks! :-)
>>>>>
>>>>>> Interesting. I just tried again but had no luck yet with 
>>>>>> reproducing it
>>>>>> without a bonding interface. May I ask how your setup looks like?
>>>>>
>>>>> For now, I've just grabbed the first available system with the HW
>>>>> controlled by the "ixgbe" driver. In my case it was:
>>>>>
>>>>>    Ethernet controller: Intel Corporation Ethernet Controller X550
>>>>>
>>>>> Also, for my first attempt, I didn't use the upstream kernel - I 
>>>>> just tried
>>>>> the kernel installed on that system. It was the Fedora kernel:
>>>>>
>>>>>    6.12.8-200.fc41.x86_64
>>>>>
>>>>>
>>>>> I think that may be the "beauty" of timing issues - sometimes you 
>>>>> can change
>>>>> just one piece in your system and get a completely different 
>>>>> replication ratio.
>>>>> Anyway, the higher the repro probability, the easier it is to debug
>>>>> the timing problem. :-)
>>>>
>>>> Hi Marcus, to break the silence could you try to apply the diff 
>>>> below on
>>>> your side?
>>>
>>> Hi, thank you for the patch. We've tried it and with your changes we 
>>> can no
>>> longer trigger the error and the NIC is no longer being reset.
>>>
>>>> We see several issues around XDP queues in ixgbe, but before we
>>>> proceed let's this small change on your side.
>>>
>>> How confident are you that this patch is sufficient to make things 
>>> stable enough
>>> for production use? Was it just the Tx hang detection that was 
>>> misbehaving for
>>> the XDP case, or is there an underlying issue with the XDP queues 
>>> that is not
>>> solved by disabling the detection for it?
>>
>> I believe that correct way to approach this is to move the Tx hang
>> detection onto ixgbe_tx_timeout() as that is the place where this logic
>> belongs to. By doing so I suppose we would kill two birds with one stone
>> as mentioned ndo is called under netdev watchdog which is not a subject
>> for XDP Tx queues.
>>
>>>
>>> With our current setup we cannot verify accurately, that we have no 
>>> packet loss
>>> or stuck queues. We can do additional tests to verify that.
> 
> 
> Hi Maciej,
> 
> I'm a colleague of Marcus and involved in the testing as well.
>>>> Additional question, do you have enabled pause frames on your setup?
>>>
>>> Pause frames were enabled, but we can also reproduce it after 
>>> disabling them,
>>> without your patch.
>>
>> Please give your setup a go with pause frames enabled and applied patch
>> that i shared previously and let us see the results. As said above I do
>> not think it is correct to check for hung queues in Tx descriptor 
>> cleaning
>> routine. This is a job of ndo_tx_timeout callback.
>>
> 
> We have tested with pause frames enabled and applied patch and can not 
> trigger the error anymore in our lab setup.
> 
>>>
>>> Thanks!
>>
>> Thanks for feedback and testing. I'll provide a proper fix tomorrow 
>> and CC
>> you so you could take it for a spin.
>>
> 
> That sounds great. We'd be happy to test with the proper fix in our 
> original setup.

Hi,

During further testing with this patch applied we noticed new warnings 
that show up. We've also tested with the new patch sent ("[PATCH 
iwl-net] ixgbe: fix ndo_xdp_xmit() workloads") and see the same warnings.

I'm sending this observation to this thread because I'm not sure if it 
is related to those patches or if it was already present but hidden by 
the resets of the original issue reported by Marcus.

After processing test traffic (~10kk packets as described in Marcus' 
reproducer setup) and idling for a minute the following warnings keep 
being logged as long as the NIC idles:

   page_pool_release_retry() stalled pool shutdown: id 968, 2 inflight 
60 sec
   page_pool_release_retry() stalled pool shutdown: id 963, 2 inflight 
60 sec
   page_pool_release_retry() stalled pool shutdown: id 968, 2 inflight 
120 sec
   page_pool_release_retry() stalled pool shutdown: id 963, 2 inflight 
120 sec
   page_pool_release_retry() stalled pool shutdown: id 968, 2 inflight 
181 sec
   page_pool_release_retry() stalled pool shutdown: id 963, 2 inflight 
181 sec
   page_pool_release_retry() stalled pool shutdown: id 968, 2 inflight 
241 sec
   page_pool_release_retry() stalled pool shutdown: id 963, 2 inflight 
241 sec

Just sending a single packet makes the warnings stop being logged.

After sending heavy test traffic again new warnings start to be logged 
after a minute of idling:

   page_pool_release_retry() stalled pool shutdown: id 987, 2 inflight 
60 sec
   page_pool_release_retry() stalled pool shutdown: id 979, 2 inflight 
60 sec
   page_pool_release_retry() stalled pool shutdown: id 987, 2 inflight 
120 sec
   page_pool_release_retry() stalled pool shutdown: id 979, 2 inflight 
120 sec

Detaching the XDP program stops the warnings as well.

As before pause frames were enabled.

Just like with the original issue we were not always successful to 
reproduce those warnings. With more traffic chances seem to be higher to 
trigger it.

Please let me know if I should provide any further information.

Thanks,
Tobias

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] ixgbe: Detected Tx Unit Hang (XDP)
  2025-05-05 15:23               ` Tobias Böhm
@ 2025-05-08 19:25                 ` Maciej Fijalkowski
  0 siblings, 0 replies; 10+ messages in thread
From: Maciej Fijalkowski @ 2025-05-08 19:25 UTC (permalink / raw)
  To: Tobias Böhm
  Cc: Marcus Wichelmann, Michal Kubiak, Tony Nguyen, Jay Vosburgh,
	Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, intel-wired-lan, netdev,
	bpf, linux-kernel, sdn

On Mon, May 05, 2025 at 05:23:02PM +0200, Tobias Böhm wrote:
> Am 24.04.25 um 12:19 schrieb Tobias Böhm:
> > Am 23.04.25 um 20:39 schrieb Maciej Fijalkowski:
> > > On Wed, Apr 23, 2025 at 04:20:07PM +0200, Marcus Wichelmann wrote:
> > > > Am 17.04.25 um 16:47 schrieb Maciej Fijalkowski:
> > > > > On Fri, Apr 11, 2025 at 10:14:57AM +0200, Michal Kubiak wrote:
> > > > > > On Thu, Apr 10, 2025 at 04:54:35PM +0200, Marcus Wichelmann wrote:
> > > > > > > Am 10.04.25 um 16:30 schrieb Michal Kubiak:
> > > > > > > > On Wed, Apr 09, 2025 at 05:17:49PM +0200, Marcus Wichelmann wrote:
> > > > > > > > > Hi,
> > > > > > > > > 
> > > > > > > > > in a setup where I use native XDP to
> > > > > > > > > redirect packets to a bonding interface
> > > > > > > > > that's backed by two ixgbe slaves, I noticed
> > > > > > > > > that the ixgbe driver constantly
> > > > > > > > > resets the NIC with the following kernel output:
> > > > > > > > > 
> > > > > > > > >    ixgbe 0000:01:00.1 ixgbe-x520-2: Detected Tx Unit Hang (XDP)
> > > > > > > > >      Tx Queue             <4>
> > > > > > > > >      TDH, TDT             <17e>, <17e>
> > > > > > > > >      next_to_use          <181>
> > > > > > > > >      next_to_clean        <17e>
> > > > > > > > >    tx_buffer_info[next_to_clean]
> > > > > > > > >      time_stamp           <0>
> > > > > > > > >      jiffies              <10025c380>
> > > > > > > > >    ixgbe 0000:01:00.1 ixgbe-x520-2: tx hang
> > > > > > > > > 19 detected on queue 4, resetting adapter
> > > > > > > > >    ixgbe 0000:01:00.1 ixgbe-x520-2:
> > > > > > > > > initiating reset due to tx timeout
> > > > > > > > >    ixgbe 0000:01:00.1 ixgbe-x520-2: Reset adapter
> > > > > > > > > 
> > > > > > > > > This only occurs in combination with a
> > > > > > > > > bonding interface and XDP, so I don't
> > > > > > > > > know if this is an issue with ixgbe or the bonding driver.
> > > > > > > > > I first discovered this with Linux 6.8.0-57,
> > > > > > > > > but kernel 6.14.0 and 6.15.0-rc1
> > > > > > > > > show the same issue.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > I managed to reproduce this bug in a lab
> > > > > > > > > environment. Here are some details
> > > > > > > > > about my setup and the steps to reproduce the bug:
> > > > > > > > > 
> > > > > > > > > [...]
> > > > > > > > > 
> > > > > > > > > Do you have any ideas what may be causing
> > > > > > > > > this issue or what I can do to
> > > > > > > > > diagnose this further?
> > > > > > > > > 
> > > > > > > > > Please let me know when I should provide any more information.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Thanks!
> > > > > > > > > Marcus
> > > > > > > > > 
> > > > > > > > 
> > > > > > [...]
> > > > > > 
> > > > > > Hi Marcus,
> > > > > > 
> > > > > > > thank you for looking into it. And not even 24 hours
> > > > > > > after my report, I'm
> > > > > > > very impressed! ;)
> > > > > > 
> > > > > > Thanks! :-)
> > > > > > 
> > > > > > > Interesting. I just tried again but had no luck yet
> > > > > > > with reproducing it
> > > > > > > without a bonding interface. May I ask how your setup looks like?
> > > > > > 
> > > > > > For now, I've just grabbed the first available system with the HW
> > > > > > controlled by the "ixgbe" driver. In my case it was:
> > > > > > 
> > > > > >    Ethernet controller: Intel Corporation Ethernet Controller X550
> > > > > > 
> > > > > > Also, for my first attempt, I didn't use the upstream
> > > > > > kernel - I just tried
> > > > > > the kernel installed on that system. It was the Fedora kernel:
> > > > > > 
> > > > > >    6.12.8-200.fc41.x86_64
> > > > > > 
> > > > > > 
> > > > > > I think that may be the "beauty" of timing issues -
> > > > > > sometimes you can change
> > > > > > just one piece in your system and get a completely
> > > > > > different replication ratio.
> > > > > > Anyway, the higher the repro probability, the easier it is to debug
> > > > > > the timing problem. :-)
> > > > > 
> > > > > Hi Marcus, to break the silence could you try to apply the
> > > > > diff below on
> > > > > your side?
> > > > 
> > > > Hi, thank you for the patch. We've tried it and with your
> > > > changes we can no
> > > > longer trigger the error and the NIC is no longer being reset.
> > > > 
> > > > > We see several issues around XDP queues in ixgbe, but before we
> > > > > proceed let's this small change on your side.
> > > > 
> > > > How confident are you that this patch is sufficient to make
> > > > things stable enough
> > > > for production use? Was it just the Tx hang detection that was
> > > > misbehaving for
> > > > the XDP case, or is there an underlying issue with the XDP
> > > > queues that is not
> > > > solved by disabling the detection for it?
> > > 
> > > I believe that correct way to approach this is to move the Tx hang
> > > detection onto ixgbe_tx_timeout() as that is the place where this logic
> > > belongs to. By doing so I suppose we would kill two birds with one stone
> > > as mentioned ndo is called under netdev watchdog which is not a subject
> > > for XDP Tx queues.
> > > 
> > > > 
> > > > With our current setup we cannot verify accurately, that we have
> > > > no packet loss
> > > > or stuck queues. We can do additional tests to verify that.
> > 
> > 
> > Hi Maciej,
> > 
> > I'm a colleague of Marcus and involved in the testing as well.
> > > > > Additional question, do you have enabled pause frames on your setup?
> > > > 
> > > > Pause frames were enabled, but we can also reproduce it after
> > > > disabling them,
> > > > without your patch.
> > > 
> > > Please give your setup a go with pause frames enabled and applied patch
> > > that i shared previously and let us see the results. As said above I do
> > > not think it is correct to check for hung queues in Tx descriptor
> > > cleaning
> > > routine. This is a job of ndo_tx_timeout callback.
> > > 
> > 
> > We have tested with pause frames enabled and applied patch and can not
> > trigger the error anymore in our lab setup.
> > 
> > > > 
> > > > Thanks!
> > > 
> > > Thanks for feedback and testing. I'll provide a proper fix tomorrow
> > > and CC
> > > you so you could take it for a spin.
> > > 
> > 
> > That sounds great. We'd be happy to test with the proper fix in our
> > original setup.
> 
> Hi,
> 
> During further testing with this patch applied we noticed new warnings that
> show up. We've also tested with the new patch sent ("[PATCH iwl-net] ixgbe:
> fix ndo_xdp_xmit() workloads") and see the same warnings.
> 
> I'm sending this observation to this thread because I'm not sure if it is
> related to those patches or if it was already present but hidden by the
> resets of the original issue reported by Marcus.
> 
> After processing test traffic (~10kk packets as described in Marcus'
> reproducer setup) and idling for a minute the following warnings keep being
> logged as long as the NIC idles:
> 
>   page_pool_release_retry() stalled pool shutdown: id 968, 2 inflight 60 sec
>   page_pool_release_retry() stalled pool shutdown: id 963, 2 inflight 60 sec
>   page_pool_release_retry() stalled pool shutdown: id 968, 2 inflight 120
> sec
>   page_pool_release_retry() stalled pool shutdown: id 963, 2 inflight 120
> sec
>   page_pool_release_retry() stalled pool shutdown: id 968, 2 inflight 181
> sec
>   page_pool_release_retry() stalled pool shutdown: id 963, 2 inflight 181
> sec
>   page_pool_release_retry() stalled pool shutdown: id 968, 2 inflight 241
> sec
>   page_pool_release_retry() stalled pool shutdown: id 963, 2 inflight 241
> sec
> 
> Just sending a single packet makes the warnings stop being logged.
> 
> After sending heavy test traffic again new warnings start to be logged after
> a minute of idling:
> 
>   page_pool_release_retry() stalled pool shutdown: id 987, 2 inflight 60 sec
>   page_pool_release_retry() stalled pool shutdown: id 979, 2 inflight 60 sec
>   page_pool_release_retry() stalled pool shutdown: id 987, 2 inflight 120
> sec
>   page_pool_release_retry() stalled pool shutdown: id 979, 2 inflight 120
> sec
> 
> Detaching the XDP program stops the warnings as well.
> 
> As before pause frames were enabled.
> 
> Just like with the original issue we were not always successful to reproduce
> those warnings. With more traffic chances seem to be higher to trigger it.
> 
> Please let me know if I should provide any further information.

i can't reproduce this on my system but FWIW these are coming from page
pool created by xdp-trafficgen, my bet is that ixgbe Tx cleaning routine
misses two entries for some reason.

What are your ring sizes? If you're going to insist I can provide patch
that optimizes Tx cleaning processing and see if this will silence the
warnings on your side.

> 
> Thanks,
> Tobias

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-05-08 19:26 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-09 15:17 [BUG] ixgbe: Detected Tx Unit Hang (XDP) Marcus Wichelmann
2025-04-10 14:30 ` Michal Kubiak
2025-04-10 14:54   ` Marcus Wichelmann
2025-04-11  8:14     ` Michal Kubiak
2025-04-17 14:47       ` Maciej Fijalkowski
2025-04-23 14:20         ` Marcus Wichelmann
2025-04-23 18:39           ` Maciej Fijalkowski
2025-04-24 10:19             ` Tobias Böhm
2025-05-05 15:23               ` Tobias Böhm
2025-05-08 19:25                 ` Maciej Fijalkowski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).