From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
To: "Tobias Böhm" <tobias.boehm@hetzner-cloud.de>
Cc: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de>,
Michal Kubiak <michal.kubiak@intel.com>,
Tony Nguyen <anthony.l.nguyen@intel.com>,
"Jay Vosburgh" <jv@jvosburgh.net>,
Przemek Kitszel <przemyslaw.kitszel@intel.com>,
Andrew Lunn <andrew+netdev@lunn.ch>,
"David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>,
"Paolo Abeni" <pabeni@redhat.com>,
Alexei Starovoitov <ast@kernel.org>,
"Daniel Borkmann" <daniel@iogearbox.net>,
Jesper Dangaard Brouer <hawk@kernel.org>,
John Fastabend <john.fastabend@gmail.com>,
<intel-wired-lan@lists.osuosl.org>, <netdev@vger.kernel.org>,
<bpf@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
<sdn@hetzner-cloud.de>
Subject: Re: [BUG] ixgbe: Detected Tx Unit Hang (XDP)
Date: Thu, 8 May 2025 21:25:12 +0200 [thread overview]
Message-ID: <aB0FGKnwPq4rqkVq@boxer> (raw)
In-Reply-To: <1713bf39-2bcb-4a43-94c7-a61ff97e2522@hetzner-cloud.de>
On Mon, May 05, 2025 at 05:23:02PM +0200, Tobias Böhm wrote:
> Am 24.04.25 um 12:19 schrieb Tobias Böhm:
> > Am 23.04.25 um 20:39 schrieb Maciej Fijalkowski:
> > > On Wed, Apr 23, 2025 at 04:20:07PM +0200, Marcus Wichelmann wrote:
> > > > Am 17.04.25 um 16:47 schrieb Maciej Fijalkowski:
> > > > > On Fri, Apr 11, 2025 at 10:14:57AM +0200, Michal Kubiak wrote:
> > > > > > On Thu, Apr 10, 2025 at 04:54:35PM +0200, Marcus Wichelmann wrote:
> > > > > > > Am 10.04.25 um 16:30 schrieb Michal Kubiak:
> > > > > > > > On Wed, Apr 09, 2025 at 05:17:49PM +0200, Marcus Wichelmann wrote:
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > in a setup where I use native XDP to
> > > > > > > > > redirect packets to a bonding interface
> > > > > > > > > that's backed by two ixgbe slaves, I noticed
> > > > > > > > > that the ixgbe driver constantly
> > > > > > > > > resets the NIC with the following kernel output:
> > > > > > > > >
> > > > > > > > > ixgbe 0000:01:00.1 ixgbe-x520-2: Detected Tx Unit Hang (XDP)
> > > > > > > > > Tx Queue <4>
> > > > > > > > > TDH, TDT <17e>, <17e>
> > > > > > > > > next_to_use <181>
> > > > > > > > > next_to_clean <17e>
> > > > > > > > > tx_buffer_info[next_to_clean]
> > > > > > > > > time_stamp <0>
> > > > > > > > > jiffies <10025c380>
> > > > > > > > > ixgbe 0000:01:00.1 ixgbe-x520-2: tx hang
> > > > > > > > > 19 detected on queue 4, resetting adapter
> > > > > > > > > ixgbe 0000:01:00.1 ixgbe-x520-2:
> > > > > > > > > initiating reset due to tx timeout
> > > > > > > > > ixgbe 0000:01:00.1 ixgbe-x520-2: Reset adapter
> > > > > > > > >
> > > > > > > > > This only occurs in combination with a
> > > > > > > > > bonding interface and XDP, so I don't
> > > > > > > > > know if this is an issue with ixgbe or the bonding driver.
> > > > > > > > > I first discovered this with Linux 6.8.0-57,
> > > > > > > > > but kernel 6.14.0 and 6.15.0-rc1
> > > > > > > > > show the same issue.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I managed to reproduce this bug in a lab
> > > > > > > > > environment. Here are some details
> > > > > > > > > about my setup and the steps to reproduce the bug:
> > > > > > > > >
> > > > > > > > > [...]
> > > > > > > > >
> > > > > > > > > Do you have any ideas what may be causing
> > > > > > > > > this issue or what I can do to
> > > > > > > > > diagnose this further?
> > > > > > > > >
> > > > > > > > > Please let me know when I should provide any more information.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > > > Marcus
> > > > > > > > >
> > > > > > > >
> > > > > > [...]
> > > > > >
> > > > > > Hi Marcus,
> > > > > >
> > > > > > > thank you for looking into it. And not even 24 hours
> > > > > > > after my report, I'm
> > > > > > > very impressed! ;)
> > > > > >
> > > > > > Thanks! :-)
> > > > > >
> > > > > > > Interesting. I just tried again but had no luck yet
> > > > > > > with reproducing it
> > > > > > > without a bonding interface. May I ask how your setup looks like?
> > > > > >
> > > > > > For now, I've just grabbed the first available system with the HW
> > > > > > controlled by the "ixgbe" driver. In my case it was:
> > > > > >
> > > > > > Ethernet controller: Intel Corporation Ethernet Controller X550
> > > > > >
> > > > > > Also, for my first attempt, I didn't use the upstream
> > > > > > kernel - I just tried
> > > > > > the kernel installed on that system. It was the Fedora kernel:
> > > > > >
> > > > > > 6.12.8-200.fc41.x86_64
> > > > > >
> > > > > >
> > > > > > I think that may be the "beauty" of timing issues -
> > > > > > sometimes you can change
> > > > > > just one piece in your system and get a completely
> > > > > > different replication ratio.
> > > > > > Anyway, the higher the repro probability, the easier it is to debug
> > > > > > the timing problem. :-)
> > > > >
> > > > > Hi Marcus, to break the silence could you try to apply the
> > > > > diff below on
> > > > > your side?
> > > >
> > > > Hi, thank you for the patch. We've tried it and with your
> > > > changes we can no
> > > > longer trigger the error and the NIC is no longer being reset.
> > > >
> > > > > We see several issues around XDP queues in ixgbe, but before we
> > > > > proceed let's this small change on your side.
> > > >
> > > > How confident are you that this patch is sufficient to make
> > > > things stable enough
> > > > for production use? Was it just the Tx hang detection that was
> > > > misbehaving for
> > > > the XDP case, or is there an underlying issue with the XDP
> > > > queues that is not
> > > > solved by disabling the detection for it?
> > >
> > > I believe that correct way to approach this is to move the Tx hang
> > > detection onto ixgbe_tx_timeout() as that is the place where this logic
> > > belongs to. By doing so I suppose we would kill two birds with one stone
> > > as mentioned ndo is called under netdev watchdog which is not a subject
> > > for XDP Tx queues.
> > >
> > > >
> > > > With our current setup we cannot verify accurately, that we have
> > > > no packet loss
> > > > or stuck queues. We can do additional tests to verify that.
> >
> >
> > Hi Maciej,
> >
> > I'm a colleague of Marcus and involved in the testing as well.
> > > > > Additional question, do you have enabled pause frames on your setup?
> > > >
> > > > Pause frames were enabled, but we can also reproduce it after
> > > > disabling them,
> > > > without your patch.
> > >
> > > Please give your setup a go with pause frames enabled and applied patch
> > > that i shared previously and let us see the results. As said above I do
> > > not think it is correct to check for hung queues in Tx descriptor
> > > cleaning
> > > routine. This is a job of ndo_tx_timeout callback.
> > >
> >
> > We have tested with pause frames enabled and applied patch and can not
> > trigger the error anymore in our lab setup.
> >
> > > >
> > > > Thanks!
> > >
> > > Thanks for feedback and testing. I'll provide a proper fix tomorrow
> > > and CC
> > > you so you could take it for a spin.
> > >
> >
> > That sounds great. We'd be happy to test with the proper fix in our
> > original setup.
>
> Hi,
>
> During further testing with this patch applied we noticed new warnings that
> show up. We've also tested with the new patch sent ("[PATCH iwl-net] ixgbe:
> fix ndo_xdp_xmit() workloads") and see the same warnings.
>
> I'm sending this observation to this thread because I'm not sure if it is
> related to those patches or if it was already present but hidden by the
> resets of the original issue reported by Marcus.
>
> After processing test traffic (~10kk packets as described in Marcus'
> reproducer setup) and idling for a minute the following warnings keep being
> logged as long as the NIC idles:
>
> page_pool_release_retry() stalled pool shutdown: id 968, 2 inflight 60 sec
> page_pool_release_retry() stalled pool shutdown: id 963, 2 inflight 60 sec
> page_pool_release_retry() stalled pool shutdown: id 968, 2 inflight 120
> sec
> page_pool_release_retry() stalled pool shutdown: id 963, 2 inflight 120
> sec
> page_pool_release_retry() stalled pool shutdown: id 968, 2 inflight 181
> sec
> page_pool_release_retry() stalled pool shutdown: id 963, 2 inflight 181
> sec
> page_pool_release_retry() stalled pool shutdown: id 968, 2 inflight 241
> sec
> page_pool_release_retry() stalled pool shutdown: id 963, 2 inflight 241
> sec
>
> Just sending a single packet makes the warnings stop being logged.
>
> After sending heavy test traffic again new warnings start to be logged after
> a minute of idling:
>
> page_pool_release_retry() stalled pool shutdown: id 987, 2 inflight 60 sec
> page_pool_release_retry() stalled pool shutdown: id 979, 2 inflight 60 sec
> page_pool_release_retry() stalled pool shutdown: id 987, 2 inflight 120
> sec
> page_pool_release_retry() stalled pool shutdown: id 979, 2 inflight 120
> sec
>
> Detaching the XDP program stops the warnings as well.
>
> As before pause frames were enabled.
>
> Just like with the original issue we were not always successful to reproduce
> those warnings. With more traffic chances seem to be higher to trigger it.
>
> Please let me know if I should provide any further information.
i can't reproduce this on my system but FWIW these are coming from page
pool created by xdp-trafficgen, my bet is that ixgbe Tx cleaning routine
misses two entries for some reason.
What are your ring sizes? If you're going to insist I can provide patch
that optimizes Tx cleaning processing and see if this will silence the
warnings on your side.
>
> Thanks,
> Tobias
prev parent reply other threads:[~2025-05-08 19:26 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-09 15:17 [BUG] ixgbe: Detected Tx Unit Hang (XDP) Marcus Wichelmann
2025-04-10 14:30 ` Michal Kubiak
2025-04-10 14:54 ` Marcus Wichelmann
2025-04-11 8:14 ` Michal Kubiak
2025-04-17 14:47 ` Maciej Fijalkowski
2025-04-23 14:20 ` Marcus Wichelmann
2025-04-23 18:39 ` Maciej Fijalkowski
2025-04-24 10:19 ` Tobias Böhm
2025-05-05 15:23 ` Tobias Böhm
2025-05-08 19:25 ` Maciej Fijalkowski [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aB0FGKnwPq4rqkVq@boxer \
--to=maciej.fijalkowski@intel.com \
--cc=andrew+netdev@lunn.ch \
--cc=anthony.l.nguyen@intel.com \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=hawk@kernel.org \
--cc=intel-wired-lan@lists.osuosl.org \
--cc=john.fastabend@gmail.com \
--cc=jv@jvosburgh.net \
--cc=kuba@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=marcus.wichelmann@hetzner-cloud.de \
--cc=michal.kubiak@intel.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=przemyslaw.kitszel@intel.com \
--cc=sdn@hetzner-cloud.de \
--cc=tobias.boehm@hetzner-cloud.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).