Re: [BUG] ixgbe: Detected Tx Unit Hang (XDP)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Michal Kubiak <michal.kubiak@intel.com>
To: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de>
Cc: Tony Nguyen <anthony.l.nguyen@intel.com>,
	Jay Vosburgh <jv@jvosburgh.net>,
	Przemek Kitszel <przemyslaw.kitszel@intel.com>,
	Andrew Lunn <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>,
	"Eric Dumazet" <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	"John Fastabend" <john.fastabend@gmail.com>,
	<intel-wired-lan@lists.osuosl.org>, <netdev@vger.kernel.org>,
	<bpf@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<sdn@hetzner-cloud.de>
Subject: Re: [BUG] ixgbe: Detected Tx Unit Hang (XDP)
Date: Thu, 10 Apr 2025 16:30:53 +0200	[thread overview]
Message-ID: <Z/fWHYETBYQuCno5@localhost.localdomain> (raw)
In-Reply-To: <d33f0ab4-4dc4-49cd-bbd0-055f58dd6758@hetzner-cloud.de>

On Wed, Apr 09, 2025 at 05:17:49PM +0200, Marcus Wichelmann wrote:
> Hi,
> 
> in a setup where I use native XDP to redirect packets to a bonding interface
> that's backed by two ixgbe slaves, I noticed that the ixgbe driver constantly
> resets the NIC with the following kernel output:
> 
>   ixgbe 0000:01:00.1 ixgbe-x520-2: Detected Tx Unit Hang (XDP)
>     Tx Queue             <4>
>     TDH, TDT             <17e>, <17e>
>     next_to_use          <181>
>     next_to_clean        <17e>
>   tx_buffer_info[next_to_clean]
>     time_stamp           <0>
>     jiffies              <10025c380>
>   ixgbe 0000:01:00.1 ixgbe-x520-2: tx hang 19 detected on queue 4, resetting adapter
>   ixgbe 0000:01:00.1 ixgbe-x520-2: initiating reset due to tx timeout
>   ixgbe 0000:01:00.1 ixgbe-x520-2: Reset adapter
> 
> This only occurs in combination with a bonding interface and XDP, so I don't
> know if this is an issue with ixgbe or the bonding driver.
> I first discovered this with Linux 6.8.0-57, but kernel 6.14.0 and 6.15.0-rc1
> show the same issue.
> 
> 
> I managed to reproduce this bug in a lab environment. Here are some details
> about my setup and the steps to reproduce the bug:
> 
> NIC: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
> 
> CPU: Ampere(R) Altra(R) Processor Q80-30 CPU @ 3.0GHz
>      Also reproduced on:
>      - Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
>      - Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
> 
> Kernel: 6.15.0-rc1 (built from mainline)
> 
>   # ethtool -i ixgbe-x520-1
>   driver: ixgbe
>   version: 6.15.0-rc1
>   firmware-version: 0x00012b2c, 1.3429.0
>   expansion-rom-version: 
>   bus-info: 0000:01:00.0
>   supports-statistics: yes
>   supports-test: yes
>   supports-eeprom-access: yes
>   supports-register-dump: yes
>   supports-priv-flags: yes
> 
> The two ports of the NIC (named "ixgbe-x520-1" and "ixgbe-x520-2") are directly
> connected with each other using a DAC cable. Both ports are configured to be
> slaves of a bonding with mode balance-rr.
> Neither the direct connection of  both ports nor the round-robin bonding mode
> are a requirement to reproduce the issue. This setup just allows it to be easier
> reproduced in an isolated environment. The issue is also visible with a regular
> 802.3ad link aggregation with a switch on the other side.
> 
>   # modprobe bonding
>   # ip link set dev ixgbe-x520-1 down
>   # ip link set dev ixgbe-x520-2 down
>   # ip link add bond0 type bond mode balance-rr
>   # ip link set dev ixgbe-x520-1 master bond0
>   # ip link set dev ixgbe-x520-2 master bond0
>   # ip link set dev ixgbe-x520-1 up
>   # ip link set dev ixgbe-x520-2 up
>   # ip link set dev bond0 up
>         
>   # cat /proc/net/bonding/bond0
>   Ethernet Channel Bonding Driver: v6.15.0-rc1
> 
>   Bonding Mode: load balancing (round-robin)
>   MII Status: up
>   MII Polling Interval (ms): 0
>   Up Delay (ms): 0
>   Down Delay (ms): 0
>   Peer Notification Delay (ms): 0
> 
>   Slave Interface: ixgbe-x520-1
>   MII Status: up
>   Speed: 10000 Mbps
>   Duplex: full
>   Link Failure Count: 0
>   Permanent HW addr: 6c:b3:11:08:5c:3c
>   Slave queue ID: 0
> 
>   Slave Interface: ixgbe-x520-2
>   MII Status: up
>   Speed: 10000 Mbps
>   Duplex: full
>   Link Failure Count: 0
>   Permanent HW addr: 6c:b3:11:08:5c:3e
>   Slave queue ID: 0
> 
>   # ethtool -l ixgbe-x520-1
>   Channel parameters for ixgbe-x520-1:
>   Pre-set maximums:
>   RX:             n/a
>   TX:             n/a
>   Other:          1
>   Combined:       63
>   Current hardware settings:
>   RX:             n/a
>   TX:             n/a
>   Other:          1
>   Combined:       63
>   (same for ixgbe-x520-2)
> 
> In the following the xdp-tools from https://github.com/xdp-project/xdp-tools/
> are used.
> 
> Enable XDP on the bonding and make sure all received packets will be dropped:
>   # xdp-tools/xdp-bench/xdp-bench drop -e -i 1 bond0
> 
> Redirect a batch of packets to the bonding interface:
>   # xdp-tools/xdp-trafficgen/xdp-trafficgen udp --dst-mac <mac of bond0>
>     --src-port 5000 --dst-port 6000 --threads 16 --num-packets 1000000 bond0
> 
> Shortly after that (3-4 seconds), one or more "Detected Tx Unit Hang" errors
> (see above) will show up in the kernel log.
> 
> The high number of packets and thread count (--threads 16) is not required to
> trigger the issue but greatly improves the probability.
> 
> 
> Do you have any ideas what may be causing this issue or what I can do to
> diagnose this further?
> 
> Please let me know when I should provide any more information.
> 
> 
> Thanks!
> Marcus
> 

Hi Marcus,

Thank you for reporting this issue!
I have just successfully reproduced the problem on our lab machine. What
is interesting is that I do not seem to have to use a bonding interface
to get the "Tx timeout" that causes the adapter to reset.

I will try to debug the problem more closely and let you know of any
updates.

Thanks,
Michal

WARNING: multiple messages have this Message-ID (diff)

From: Michal Kubiak <michal.kubiak@intel.com>
To: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de>
Cc: Tony Nguyen <anthony.l.nguyen@intel.com>,
	Jay Vosburgh <jv@jvosburgh.net>,
	 Przemek Kitszel <przemyslaw.kitszel@intel.com>,
	Andrew Lunn <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>,
	"Eric Dumazet" <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	"John Fastabend" <john.fastabend@gmail.com>,
	<intel-wired-lan@lists.osuosl.org>, <netdev@vger.kernel.org>,
	<bpf@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<sdn@hetzner-cloud.de>
Subject: Re: [Intel-wired-lan] [BUG] ixgbe: Detected Tx Unit Hang (XDP)
Date: Thu, 10 Apr 2025 16:30:53 +0200	[thread overview]
Message-ID: <Z/fWHYETBYQuCno5@localhost.localdomain> (raw)
In-Reply-To: <d33f0ab4-4dc4-49cd-bbd0-055f58dd6758@hetzner-cloud.de>

On Wed, Apr 09, 2025 at 05:17:49PM +0200, Marcus Wichelmann wrote:
> Hi,
> 
> in a setup where I use native XDP to redirect packets to a bonding interface
> that's backed by two ixgbe slaves, I noticed that the ixgbe driver constantly
> resets the NIC with the following kernel output:
> 
>   ixgbe 0000:01:00.1 ixgbe-x520-2: Detected Tx Unit Hang (XDP)
>     Tx Queue             <4>
>     TDH, TDT             <17e>, <17e>
>     next_to_use          <181>
>     next_to_clean        <17e>
>   tx_buffer_info[next_to_clean]
>     time_stamp           <0>
>     jiffies              <10025c380>
>   ixgbe 0000:01:00.1 ixgbe-x520-2: tx hang 19 detected on queue 4, resetting adapter
>   ixgbe 0000:01:00.1 ixgbe-x520-2: initiating reset due to tx timeout
>   ixgbe 0000:01:00.1 ixgbe-x520-2: Reset adapter
> 
> This only occurs in combination with a bonding interface and XDP, so I don't
> know if this is an issue with ixgbe or the bonding driver.
> I first discovered this with Linux 6.8.0-57, but kernel 6.14.0 and 6.15.0-rc1
> show the same issue.
> 
> 
> I managed to reproduce this bug in a lab environment. Here are some details
> about my setup and the steps to reproduce the bug:
> 
> NIC: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
> 
> CPU: Ampere(R) Altra(R) Processor Q80-30 CPU @ 3.0GHz
>      Also reproduced on:
>      - Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
>      - Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
> 
> Kernel: 6.15.0-rc1 (built from mainline)
> 
>   # ethtool -i ixgbe-x520-1
>   driver: ixgbe
>   version: 6.15.0-rc1
>   firmware-version: 0x00012b2c, 1.3429.0
>   expansion-rom-version: 
>   bus-info: 0000:01:00.0
>   supports-statistics: yes
>   supports-test: yes
>   supports-eeprom-access: yes
>   supports-register-dump: yes
>   supports-priv-flags: yes
> 
> The two ports of the NIC (named "ixgbe-x520-1" and "ixgbe-x520-2") are directly
> connected with each other using a DAC cable. Both ports are configured to be
> slaves of a bonding with mode balance-rr.
> Neither the direct connection of  both ports nor the round-robin bonding mode
> are a requirement to reproduce the issue. This setup just allows it to be easier
> reproduced in an isolated environment. The issue is also visible with a regular
> 802.3ad link aggregation with a switch on the other side.
> 
>   # modprobe bonding
>   # ip link set dev ixgbe-x520-1 down
>   # ip link set dev ixgbe-x520-2 down
>   # ip link add bond0 type bond mode balance-rr
>   # ip link set dev ixgbe-x520-1 master bond0
>   # ip link set dev ixgbe-x520-2 master bond0
>   # ip link set dev ixgbe-x520-1 up
>   # ip link set dev ixgbe-x520-2 up
>   # ip link set dev bond0 up
>         
>   # cat /proc/net/bonding/bond0
>   Ethernet Channel Bonding Driver: v6.15.0-rc1
> 
>   Bonding Mode: load balancing (round-robin)
>   MII Status: up
>   MII Polling Interval (ms): 0
>   Up Delay (ms): 0
>   Down Delay (ms): 0
>   Peer Notification Delay (ms): 0
> 
>   Slave Interface: ixgbe-x520-1
>   MII Status: up
>   Speed: 10000 Mbps
>   Duplex: full
>   Link Failure Count: 0
>   Permanent HW addr: 6c:b3:11:08:5c:3c
>   Slave queue ID: 0
> 
>   Slave Interface: ixgbe-x520-2
>   MII Status: up
>   Speed: 10000 Mbps
>   Duplex: full
>   Link Failure Count: 0
>   Permanent HW addr: 6c:b3:11:08:5c:3e
>   Slave queue ID: 0
> 
>   # ethtool -l ixgbe-x520-1
>   Channel parameters for ixgbe-x520-1:
>   Pre-set maximums:
>   RX:             n/a
>   TX:             n/a
>   Other:          1
>   Combined:       63
>   Current hardware settings:
>   RX:             n/a
>   TX:             n/a
>   Other:          1
>   Combined:       63
>   (same for ixgbe-x520-2)
> 
> In the following the xdp-tools from https://github.com/xdp-project/xdp-tools/
> are used.
> 
> Enable XDP on the bonding and make sure all received packets will be dropped:
>   # xdp-tools/xdp-bench/xdp-bench drop -e -i 1 bond0
> 
> Redirect a batch of packets to the bonding interface:
>   # xdp-tools/xdp-trafficgen/xdp-trafficgen udp --dst-mac <mac of bond0>
>     --src-port 5000 --dst-port 6000 --threads 16 --num-packets 1000000 bond0
> 
> Shortly after that (3-4 seconds), one or more "Detected Tx Unit Hang" errors
> (see above) will show up in the kernel log.
> 
> The high number of packets and thread count (--threads 16) is not required to
> trigger the issue but greatly improves the probability.
> 
> 
> Do you have any ideas what may be causing this issue or what I can do to
> diagnose this further?
> 
> Please let me know when I should provide any more information.
> 
> 
> Thanks!
> Marcus
> 

Hi Marcus,

Thank you for reporting this issue!
I have just successfully reproduced the problem on our lab machine. What
is interesting is that I do not seem to have to use a bonding interface
to get the "Tx timeout" that causes the adapter to reset.

I will try to debug the problem more closely and let you know of any
updates.

Thanks,
Michal

next prev parent reply	other threads:[~2025-04-10 14:31 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-09 15:17 [BUG] ixgbe: Detected Tx Unit Hang (XDP) Marcus Wichelmann
2025-04-09 15:17 ` [Intel-wired-lan] " Marcus Wichelmann
2025-04-10 14:30 ` Michal Kubiak [this message]
2025-04-10 14:30   ` Michal Kubiak
2025-04-10 14:54   ` Marcus Wichelmann
2025-04-10 14:54     ` [Intel-wired-lan] " Marcus Wichelmann
2025-04-11  8:14     ` Michal Kubiak
2025-04-11  8:14       ` [Intel-wired-lan] " Michal Kubiak
2025-04-17 14:47       ` Maciej Fijalkowski
2025-04-17 14:47         ` [Intel-wired-lan] " Maciej Fijalkowski
2025-04-23 14:20         ` Marcus Wichelmann
2025-04-23 14:20           ` [Intel-wired-lan] " Marcus Wichelmann
2025-04-23 18:39           ` Maciej Fijalkowski
2025-04-23 18:39             ` [Intel-wired-lan] " Maciej Fijalkowski
2025-04-24 10:19             ` Tobias Böhm
2025-04-24 10:19               ` [Intel-wired-lan] " Tobias Böhm
2025-05-05 15:23               ` Tobias Böhm
2025-05-05 15:23                 ` [Intel-wired-lan] " Tobias Böhm
2025-05-08 19:25                 ` Maciej Fijalkowski
2025-05-08 19:25                   ` [Intel-wired-lan] " Maciej Fijalkowski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z/fWHYETBYQuCno5@localhost.localdomain \
    --to=michal.kubiak@intel.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=anthony.l.nguyen@intel.com \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=intel-wired-lan@lists.osuosl.org \
    --cc=john.fastabend@gmail.com \
    --cc=jv@jvosburgh.net \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=marcus.wichelmann@hetzner-cloud.de \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=przemyslaw.kitszel@intel.com \
    --cc=sdn@hetzner-cloud.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.