netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [BUG] ixgbe: Detected Tx Unit Hang (XDP)
@ 2025-04-09 15:17 Marcus Wichelmann
  2025-04-10 14:30 ` Michal Kubiak
  0 siblings, 1 reply; 10+ messages in thread
From: Marcus Wichelmann @ 2025-04-09 15:17 UTC (permalink / raw)
  To: Tony Nguyen, Jay Vosburgh, Przemek Kitszel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend
  Cc: intel-wired-lan, netdev, bpf, linux-kernel, sdn

Hi,

in a setup where I use native XDP to redirect packets to a bonding interface
that's backed by two ixgbe slaves, I noticed that the ixgbe driver constantly
resets the NIC with the following kernel output:

  ixgbe 0000:01:00.1 ixgbe-x520-2: Detected Tx Unit Hang (XDP)
    Tx Queue             <4>
    TDH, TDT             <17e>, <17e>
    next_to_use          <181>
    next_to_clean        <17e>
  tx_buffer_info[next_to_clean]
    time_stamp           <0>
    jiffies              <10025c380>
  ixgbe 0000:01:00.1 ixgbe-x520-2: tx hang 19 detected on queue 4, resetting adapter
  ixgbe 0000:01:00.1 ixgbe-x520-2: initiating reset due to tx timeout
  ixgbe 0000:01:00.1 ixgbe-x520-2: Reset adapter

This only occurs in combination with a bonding interface and XDP, so I don't
know if this is an issue with ixgbe or the bonding driver.
I first discovered this with Linux 6.8.0-57, but kernel 6.14.0 and 6.15.0-rc1
show the same issue.


I managed to reproduce this bug in a lab environment. Here are some details
about my setup and the steps to reproduce the bug:

NIC: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)

CPU: Ampere(R) Altra(R) Processor Q80-30 CPU @ 3.0GHz
     Also reproduced on:
     - Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
     - Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz

Kernel: 6.15.0-rc1 (built from mainline)

  # ethtool -i ixgbe-x520-1
  driver: ixgbe
  version: 6.15.0-rc1
  firmware-version: 0x00012b2c, 1.3429.0
  expansion-rom-version: 
  bus-info: 0000:01:00.0
  supports-statistics: yes
  supports-test: yes
  supports-eeprom-access: yes
  supports-register-dump: yes
  supports-priv-flags: yes

The two ports of the NIC (named "ixgbe-x520-1" and "ixgbe-x520-2") are directly
connected with each other using a DAC cable. Both ports are configured to be
slaves of a bonding with mode balance-rr.
Neither the direct connection of  both ports nor the round-robin bonding mode
are a requirement to reproduce the issue. This setup just allows it to be easier
reproduced in an isolated environment. The issue is also visible with a regular
802.3ad link aggregation with a switch on the other side.

  # modprobe bonding
  # ip link set dev ixgbe-x520-1 down
  # ip link set dev ixgbe-x520-2 down
  # ip link add bond0 type bond mode balance-rr
  # ip link set dev ixgbe-x520-1 master bond0
  # ip link set dev ixgbe-x520-2 master bond0
  # ip link set dev ixgbe-x520-1 up
  # ip link set dev ixgbe-x520-2 up
  # ip link set dev bond0 up
        
  # cat /proc/net/bonding/bond0
  Ethernet Channel Bonding Driver: v6.15.0-rc1

  Bonding Mode: load balancing (round-robin)
  MII Status: up
  MII Polling Interval (ms): 0
  Up Delay (ms): 0
  Down Delay (ms): 0
  Peer Notification Delay (ms): 0

  Slave Interface: ixgbe-x520-1
  MII Status: up
  Speed: 10000 Mbps
  Duplex: full
  Link Failure Count: 0
  Permanent HW addr: 6c:b3:11:08:5c:3c
  Slave queue ID: 0

  Slave Interface: ixgbe-x520-2
  MII Status: up
  Speed: 10000 Mbps
  Duplex: full
  Link Failure Count: 0
  Permanent HW addr: 6c:b3:11:08:5c:3e
  Slave queue ID: 0

  # ethtool -l ixgbe-x520-1
  Channel parameters for ixgbe-x520-1:
  Pre-set maximums:
  RX:             n/a
  TX:             n/a
  Other:          1
  Combined:       63
  Current hardware settings:
  RX:             n/a
  TX:             n/a
  Other:          1
  Combined:       63
  (same for ixgbe-x520-2)

In the following the xdp-tools from https://github.com/xdp-project/xdp-tools/
are used.

Enable XDP on the bonding and make sure all received packets will be dropped:
  # xdp-tools/xdp-bench/xdp-bench drop -e -i 1 bond0

Redirect a batch of packets to the bonding interface:
  # xdp-tools/xdp-trafficgen/xdp-trafficgen udp --dst-mac <mac of bond0>
    --src-port 5000 --dst-port 6000 --threads 16 --num-packets 1000000 bond0

Shortly after that (3-4 seconds), one or more "Detected Tx Unit Hang" errors
(see above) will show up in the kernel log.

The high number of packets and thread count (--threads 16) is not required to
trigger the issue but greatly improves the probability.


Do you have any ideas what may be causing this issue or what I can do to
diagnose this further?

Please let me know when I should provide any more information.


Thanks!
Marcus

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-05-08 19:26 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-09 15:17 [BUG] ixgbe: Detected Tx Unit Hang (XDP) Marcus Wichelmann
2025-04-10 14:30 ` Michal Kubiak
2025-04-10 14:54   ` Marcus Wichelmann
2025-04-11  8:14     ` Michal Kubiak
2025-04-17 14:47       ` Maciej Fijalkowski
2025-04-23 14:20         ` Marcus Wichelmann
2025-04-23 18:39           ` Maciej Fijalkowski
2025-04-24 10:19             ` Tobias Böhm
2025-05-05 15:23               ` Tobias Böhm
2025-05-08 19:25                 ` Maciej Fijalkowski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).