netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Michal Kubiak <michal.kubiak@intel.com>
To: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de>
Cc: Tony Nguyen <anthony.l.nguyen@intel.com>,
	Jay Vosburgh <jv@jvosburgh.net>,
	Przemek Kitszel <przemyslaw.kitszel@intel.com>,
	Andrew Lunn <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>,
	"Eric Dumazet" <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	"John Fastabend" <john.fastabend@gmail.com>,
	<intel-wired-lan@lists.osuosl.org>, <netdev@vger.kernel.org>,
	<bpf@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<sdn@hetzner-cloud.de>
Subject: Re: [BUG] ixgbe: Detected Tx Unit Hang (XDP)
Date: Thu, 10 Apr 2025 16:30:53 +0200	[thread overview]
Message-ID: <Z/fWHYETBYQuCno5@localhost.localdomain> (raw)
In-Reply-To: <d33f0ab4-4dc4-49cd-bbd0-055f58dd6758@hetzner-cloud.de>

On Wed, Apr 09, 2025 at 05:17:49PM +0200, Marcus Wichelmann wrote:
> Hi,
> 
> in a setup where I use native XDP to redirect packets to a bonding interface
> that's backed by two ixgbe slaves, I noticed that the ixgbe driver constantly
> resets the NIC with the following kernel output:
> 
>   ixgbe 0000:01:00.1 ixgbe-x520-2: Detected Tx Unit Hang (XDP)
>     Tx Queue             <4>
>     TDH, TDT             <17e>, <17e>
>     next_to_use          <181>
>     next_to_clean        <17e>
>   tx_buffer_info[next_to_clean]
>     time_stamp           <0>
>     jiffies              <10025c380>
>   ixgbe 0000:01:00.1 ixgbe-x520-2: tx hang 19 detected on queue 4, resetting adapter
>   ixgbe 0000:01:00.1 ixgbe-x520-2: initiating reset due to tx timeout
>   ixgbe 0000:01:00.1 ixgbe-x520-2: Reset adapter
> 
> This only occurs in combination with a bonding interface and XDP, so I don't
> know if this is an issue with ixgbe or the bonding driver.
> I first discovered this with Linux 6.8.0-57, but kernel 6.14.0 and 6.15.0-rc1
> show the same issue.
> 
> 
> I managed to reproduce this bug in a lab environment. Here are some details
> about my setup and the steps to reproduce the bug:
> 
> NIC: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
> 
> CPU: Ampere(R) Altra(R) Processor Q80-30 CPU @ 3.0GHz
>      Also reproduced on:
>      - Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
>      - Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
> 
> Kernel: 6.15.0-rc1 (built from mainline)
> 
>   # ethtool -i ixgbe-x520-1
>   driver: ixgbe
>   version: 6.15.0-rc1
>   firmware-version: 0x00012b2c, 1.3429.0
>   expansion-rom-version: 
>   bus-info: 0000:01:00.0
>   supports-statistics: yes
>   supports-test: yes
>   supports-eeprom-access: yes
>   supports-register-dump: yes
>   supports-priv-flags: yes
> 
> The two ports of the NIC (named "ixgbe-x520-1" and "ixgbe-x520-2") are directly
> connected with each other using a DAC cable. Both ports are configured to be
> slaves of a bonding with mode balance-rr.
> Neither the direct connection of  both ports nor the round-robin bonding mode
> are a requirement to reproduce the issue. This setup just allows it to be easier
> reproduced in an isolated environment. The issue is also visible with a regular
> 802.3ad link aggregation with a switch on the other side.
> 
>   # modprobe bonding
>   # ip link set dev ixgbe-x520-1 down
>   # ip link set dev ixgbe-x520-2 down
>   # ip link add bond0 type bond mode balance-rr
>   # ip link set dev ixgbe-x520-1 master bond0
>   # ip link set dev ixgbe-x520-2 master bond0
>   # ip link set dev ixgbe-x520-1 up
>   # ip link set dev ixgbe-x520-2 up
>   # ip link set dev bond0 up
>         
>   # cat /proc/net/bonding/bond0
>   Ethernet Channel Bonding Driver: v6.15.0-rc1
> 
>   Bonding Mode: load balancing (round-robin)
>   MII Status: up
>   MII Polling Interval (ms): 0
>   Up Delay (ms): 0
>   Down Delay (ms): 0
>   Peer Notification Delay (ms): 0
> 
>   Slave Interface: ixgbe-x520-1
>   MII Status: up
>   Speed: 10000 Mbps
>   Duplex: full
>   Link Failure Count: 0
>   Permanent HW addr: 6c:b3:11:08:5c:3c
>   Slave queue ID: 0
> 
>   Slave Interface: ixgbe-x520-2
>   MII Status: up
>   Speed: 10000 Mbps
>   Duplex: full
>   Link Failure Count: 0
>   Permanent HW addr: 6c:b3:11:08:5c:3e
>   Slave queue ID: 0
> 
>   # ethtool -l ixgbe-x520-1
>   Channel parameters for ixgbe-x520-1:
>   Pre-set maximums:
>   RX:             n/a
>   TX:             n/a
>   Other:          1
>   Combined:       63
>   Current hardware settings:
>   RX:             n/a
>   TX:             n/a
>   Other:          1
>   Combined:       63
>   (same for ixgbe-x520-2)
> 
> In the following the xdp-tools from https://github.com/xdp-project/xdp-tools/
> are used.
> 
> Enable XDP on the bonding and make sure all received packets will be dropped:
>   # xdp-tools/xdp-bench/xdp-bench drop -e -i 1 bond0
> 
> Redirect a batch of packets to the bonding interface:
>   # xdp-tools/xdp-trafficgen/xdp-trafficgen udp --dst-mac <mac of bond0>
>     --src-port 5000 --dst-port 6000 --threads 16 --num-packets 1000000 bond0
> 
> Shortly after that (3-4 seconds), one or more "Detected Tx Unit Hang" errors
> (see above) will show up in the kernel log.
> 
> The high number of packets and thread count (--threads 16) is not required to
> trigger the issue but greatly improves the probability.
> 
> 
> Do you have any ideas what may be causing this issue or what I can do to
> diagnose this further?
> 
> Please let me know when I should provide any more information.
> 
> 
> Thanks!
> Marcus
> 

Hi Marcus,

Thank you for reporting this issue!
I have just successfully reproduced the problem on our lab machine. What
is interesting is that I do not seem to have to use a bonding interface
to get the "Tx timeout" that causes the adapter to reset.

I will try to debug the problem more closely and let you know of any
updates.

Thanks,
Michal

  reply	other threads:[~2025-04-10 14:31 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-09 15:17 [BUG] ixgbe: Detected Tx Unit Hang (XDP) Marcus Wichelmann
2025-04-10 14:30 ` Michal Kubiak [this message]
2025-04-10 14:54   ` Marcus Wichelmann
2025-04-11  8:14     ` Michal Kubiak
2025-04-17 14:47       ` Maciej Fijalkowski
2025-04-23 14:20         ` Marcus Wichelmann
2025-04-23 18:39           ` Maciej Fijalkowski
2025-04-24 10:19             ` Tobias Böhm
2025-05-05 15:23               ` Tobias Böhm
2025-05-08 19:25                 ` Maciej Fijalkowski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z/fWHYETBYQuCno5@localhost.localdomain \
    --to=michal.kubiak@intel.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=anthony.l.nguyen@intel.com \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=intel-wired-lan@lists.osuosl.org \
    --cc=john.fastabend@gmail.com \
    --cc=jv@jvosburgh.net \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=marcus.wichelmann@hetzner-cloud.de \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=przemyslaw.kitszel@intel.com \
    --cc=sdn@hetzner-cloud.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).