Re: [BUG] ixgbe: Detected Tx Unit Hang (XDP)

bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
To: "Tobias Böhm" <tobias.boehm@hetzner-cloud.de>
Cc: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de>,
	Michal Kubiak <michal.kubiak@intel.com>,
	Tony Nguyen <anthony.l.nguyen@intel.com>,
	"Jay Vosburgh" <jv@jvosburgh.net>,
	Przemek Kitszel <przemyslaw.kitszel@intel.com>,
	Andrew Lunn <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>,
	"Paolo Abeni" <pabeni@redhat.com>,
	Alexei Starovoitov <ast@kernel.org>,
	"Daniel Borkmann" <daniel@iogearbox.net>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	John Fastabend <john.fastabend@gmail.com>,
	<intel-wired-lan@lists.osuosl.org>, <netdev@vger.kernel.org>,
	<bpf@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<sdn@hetzner-cloud.de>
Subject: Re: [BUG] ixgbe: Detected Tx Unit Hang (XDP)
Date: Thu, 8 May 2025 21:25:12 +0200	[thread overview]
Message-ID: <aB0FGKnwPq4rqkVq@boxer> (raw)
In-Reply-To: <1713bf39-2bcb-4a43-94c7-a61ff97e2522@hetzner-cloud.de>

On Mon, May 05, 2025 at 05:23:02PM +0200, Tobias Böhm wrote:
> Am 24.04.25 um 12:19 schrieb Tobias Böhm:
> > Am 23.04.25 um 20:39 schrieb Maciej Fijalkowski:
> > > On Wed, Apr 23, 2025 at 04:20:07PM +0200, Marcus Wichelmann wrote:
> > > > Am 17.04.25 um 16:47 schrieb Maciej Fijalkowski:
> > > > > On Fri, Apr 11, 2025 at 10:14:57AM +0200, Michal Kubiak wrote:
> > > > > > On Thu, Apr 10, 2025 at 04:54:35PM +0200, Marcus Wichelmann wrote:
> > > > > > > Am 10.04.25 um 16:30 schrieb Michal Kubiak:
> > > > > > > > On Wed, Apr 09, 2025 at 05:17:49PM +0200, Marcus Wichelmann wrote:
> > > > > > > > > Hi,
> > > > > > > > > 
> > > > > > > > > in a setup where I use native XDP to
> > > > > > > > > redirect packets to a bonding interface
> > > > > > > > > that's backed by two ixgbe slaves, I noticed
> > > > > > > > > that the ixgbe driver constantly
> > > > > > > > > resets the NIC with the following kernel output:
> > > > > > > > > 
> > > > > > > > >    ixgbe 0000:01:00.1 ixgbe-x520-2: Detected Tx Unit Hang (XDP)
> > > > > > > > >      Tx Queue             <4>
> > > > > > > > >      TDH, TDT             <17e>, <17e>
> > > > > > > > >      next_to_use          <181>
> > > > > > > > >      next_to_clean        <17e>
> > > > > > > > >    tx_buffer_info[next_to_clean]
> > > > > > > > >      time_stamp           <0>
> > > > > > > > >      jiffies              <10025c380>
> > > > > > > > >    ixgbe 0000:01:00.1 ixgbe-x520-2: tx hang
> > > > > > > > > 19 detected on queue 4, resetting adapter
> > > > > > > > >    ixgbe 0000:01:00.1 ixgbe-x520-2:
> > > > > > > > > initiating reset due to tx timeout
> > > > > > > > >    ixgbe 0000:01:00.1 ixgbe-x520-2: Reset adapter
> > > > > > > > > 
> > > > > > > > > This only occurs in combination with a
> > > > > > > > > bonding interface and XDP, so I don't
> > > > > > > > > know if this is an issue with ixgbe or the bonding driver.
> > > > > > > > > I first discovered this with Linux 6.8.0-57,
> > > > > > > > > but kernel 6.14.0 and 6.15.0-rc1
> > > > > > > > > show the same issue.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > I managed to reproduce this bug in a lab
> > > > > > > > > environment. Here are some details
> > > > > > > > > about my setup and the steps to reproduce the bug:
> > > > > > > > > 
> > > > > > > > > [...]
> > > > > > > > > 
> > > > > > > > > Do you have any ideas what may be causing
> > > > > > > > > this issue or what I can do to
> > > > > > > > > diagnose this further?
> > > > > > > > > 
> > > > > > > > > Please let me know when I should provide any more information.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Thanks!
> > > > > > > > > Marcus
> > > > > > > > > 
> > > > > > > > 
> > > > > > [...]
> > > > > > 
> > > > > > Hi Marcus,
> > > > > > 
> > > > > > > thank you for looking into it. And not even 24 hours
> > > > > > > after my report, I'm
> > > > > > > very impressed! ;)
> > > > > > 
> > > > > > Thanks! :-)
> > > > > > 
> > > > > > > Interesting. I just tried again but had no luck yet
> > > > > > > with reproducing it
> > > > > > > without a bonding interface. May I ask how your setup looks like?
> > > > > > 
> > > > > > For now, I've just grabbed the first available system with the HW
> > > > > > controlled by the "ixgbe" driver. In my case it was:
> > > > > > 
> > > > > >    Ethernet controller: Intel Corporation Ethernet Controller X550
> > > > > > 
> > > > > > Also, for my first attempt, I didn't use the upstream
> > > > > > kernel - I just tried
> > > > > > the kernel installed on that system. It was the Fedora kernel:
> > > > > > 
> > > > > >    6.12.8-200.fc41.x86_64
> > > > > > 
> > > > > > 
> > > > > > I think that may be the "beauty" of timing issues -
> > > > > > sometimes you can change
> > > > > > just one piece in your system and get a completely
> > > > > > different replication ratio.
> > > > > > Anyway, the higher the repro probability, the easier it is to debug
> > > > > > the timing problem. :-)
> > > > > 
> > > > > Hi Marcus, to break the silence could you try to apply the
> > > > > diff below on
> > > > > your side?
> > > > 
> > > > Hi, thank you for the patch. We've tried it and with your
> > > > changes we can no
> > > > longer trigger the error and the NIC is no longer being reset.
> > > > 
> > > > > We see several issues around XDP queues in ixgbe, but before we
> > > > > proceed let's this small change on your side.
> > > > 
> > > > How confident are you that this patch is sufficient to make
> > > > things stable enough
> > > > for production use? Was it just the Tx hang detection that was
> > > > misbehaving for
> > > > the XDP case, or is there an underlying issue with the XDP
> > > > queues that is not
> > > > solved by disabling the detection for it?
> > > 
> > > I believe that correct way to approach this is to move the Tx hang
> > > detection onto ixgbe_tx_timeout() as that is the place where this logic
> > > belongs to. By doing so I suppose we would kill two birds with one stone
> > > as mentioned ndo is called under netdev watchdog which is not a subject
> > > for XDP Tx queues.
> > > 
> > > > 
> > > > With our current setup we cannot verify accurately, that we have
> > > > no packet loss
> > > > or stuck queues. We can do additional tests to verify that.
> > 
> > 
> > Hi Maciej,
> > 
> > I'm a colleague of Marcus and involved in the testing as well.
> > > > > Additional question, do you have enabled pause frames on your setup?
> > > > 
> > > > Pause frames were enabled, but we can also reproduce it after
> > > > disabling them,
> > > > without your patch.
> > > 
> > > Please give your setup a go with pause frames enabled and applied patch
> > > that i shared previously and let us see the results. As said above I do
> > > not think it is correct to check for hung queues in Tx descriptor
> > > cleaning
> > > routine. This is a job of ndo_tx_timeout callback.
> > > 
> > 
> > We have tested with pause frames enabled and applied patch and can not
> > trigger the error anymore in our lab setup.
> > 
> > > > 
> > > > Thanks!
> > > 
> > > Thanks for feedback and testing. I'll provide a proper fix tomorrow
> > > and CC
> > > you so you could take it for a spin.
> > > 
> > 
> > That sounds great. We'd be happy to test with the proper fix in our
> > original setup.
> 
> Hi,
> 
> During further testing with this patch applied we noticed new warnings that
> show up. We've also tested with the new patch sent ("[PATCH iwl-net] ixgbe:
> fix ndo_xdp_xmit() workloads") and see the same warnings.
> 
> I'm sending this observation to this thread because I'm not sure if it is
> related to those patches or if it was already present but hidden by the
> resets of the original issue reported by Marcus.
> 
> After processing test traffic (~10kk packets as described in Marcus'
> reproducer setup) and idling for a minute the following warnings keep being
> logged as long as the NIC idles:
> 
>   page_pool_release_retry() stalled pool shutdown: id 968, 2 inflight 60 sec
>   page_pool_release_retry() stalled pool shutdown: id 963, 2 inflight 60 sec
>   page_pool_release_retry() stalled pool shutdown: id 968, 2 inflight 120
> sec
>   page_pool_release_retry() stalled pool shutdown: id 963, 2 inflight 120
> sec
>   page_pool_release_retry() stalled pool shutdown: id 968, 2 inflight 181
> sec
>   page_pool_release_retry() stalled pool shutdown: id 963, 2 inflight 181
> sec
>   page_pool_release_retry() stalled pool shutdown: id 968, 2 inflight 241
> sec
>   page_pool_release_retry() stalled pool shutdown: id 963, 2 inflight 241
> sec
> 
> Just sending a single packet makes the warnings stop being logged.
> 
> After sending heavy test traffic again new warnings start to be logged after
> a minute of idling:
> 
>   page_pool_release_retry() stalled pool shutdown: id 987, 2 inflight 60 sec
>   page_pool_release_retry() stalled pool shutdown: id 979, 2 inflight 60 sec
>   page_pool_release_retry() stalled pool shutdown: id 987, 2 inflight 120
> sec
>   page_pool_release_retry() stalled pool shutdown: id 979, 2 inflight 120
> sec
> 
> Detaching the XDP program stops the warnings as well.
> 
> As before pause frames were enabled.
> 
> Just like with the original issue we were not always successful to reproduce
> those warnings. With more traffic chances seem to be higher to trigger it.
> 
> Please let me know if I should provide any further information.

i can't reproduce this on my system but FWIW these are coming from page
pool created by xdp-trafficgen, my bet is that ixgbe Tx cleaning routine
misses two entries for some reason.

What are your ring sizes? If you're going to insist I can provide patch
that optimizes Tx cleaning processing and see if this will silence the
warnings on your side.

> 
> Thanks,
> Tobias

     prev parent reply	other threads:[~2025-05-08 19:26 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-09 15:17 [BUG] ixgbe: Detected Tx Unit Hang (XDP) Marcus Wichelmann
2025-04-10 14:30 ` Michal Kubiak
2025-04-10 14:54   ` Marcus Wichelmann
2025-04-11  8:14     ` Michal Kubiak
2025-04-17 14:47       ` Maciej Fijalkowski
2025-04-23 14:20         ` Marcus Wichelmann
2025-04-23 18:39           ` Maciej Fijalkowski
2025-04-24 10:19             ` Tobias Böhm
2025-05-05 15:23               ` Tobias Böhm
2025-05-08 19:25                 ` Maciej Fijalkowski [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aB0FGKnwPq4rqkVq@boxer \
    --to=maciej.fijalkowski@intel.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=anthony.l.nguyen@intel.com \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=intel-wired-lan@lists.osuosl.org \
    --cc=john.fastabend@gmail.com \
    --cc=jv@jvosburgh.net \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=marcus.wichelmann@hetzner-cloud.de \
    --cc=michal.kubiak@intel.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=przemyslaw.kitszel@intel.com \
    --cc=sdn@hetzner-cloud.de \
    --cc=tobias.boehm@hetzner-cloud.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).