public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
To: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de>
Cc: Michal Kubiak <michal.kubiak@intel.com>,
	Tony Nguyen <anthony.l.nguyen@intel.com>,
	Jay Vosburgh <jv@jvosburgh.net>,
	"Przemek Kitszel" <przemyslaw.kitszel@intel.com>,
	Andrew Lunn <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	"Alexei Starovoitov" <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	"Jesper Dangaard Brouer" <hawk@kernel.org>,
	John Fastabend <john.fastabend@gmail.com>,
	<intel-wired-lan@lists.osuosl.org>, <netdev@vger.kernel.org>,
	<bpf@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<sdn@hetzner-cloud.de>
Subject: Re: [BUG] ixgbe: Detected Tx Unit Hang (XDP)
Date: Wed, 23 Apr 2025 20:39:59 +0200	[thread overview]
Message-ID: <aAkz/+Rx5w3OHH4/@boxer> (raw)
In-Reply-To: <b06ede77-541b-453f-9e7a-79f3e5591f66@hetzner-cloud.de>

On Wed, Apr 23, 2025 at 04:20:07PM +0200, Marcus Wichelmann wrote:
> Am 17.04.25 um 16:47 schrieb Maciej Fijalkowski:
> > On Fri, Apr 11, 2025 at 10:14:57AM +0200, Michal Kubiak wrote:
> >> On Thu, Apr 10, 2025 at 04:54:35PM +0200, Marcus Wichelmann wrote:
> >>> Am 10.04.25 um 16:30 schrieb Michal Kubiak:
> >>>> On Wed, Apr 09, 2025 at 05:17:49PM +0200, Marcus Wichelmann wrote:
> >>>>> Hi,
> >>>>>
> >>>>> in a setup where I use native XDP to redirect packets to a bonding interface
> >>>>> that's backed by two ixgbe slaves, I noticed that the ixgbe driver constantly
> >>>>> resets the NIC with the following kernel output:
> >>>>>
> >>>>>   ixgbe 0000:01:00.1 ixgbe-x520-2: Detected Tx Unit Hang (XDP)
> >>>>>     Tx Queue             <4>
> >>>>>     TDH, TDT             <17e>, <17e>
> >>>>>     next_to_use          <181>
> >>>>>     next_to_clean        <17e>
> >>>>>   tx_buffer_info[next_to_clean]
> >>>>>     time_stamp           <0>
> >>>>>     jiffies              <10025c380>
> >>>>>   ixgbe 0000:01:00.1 ixgbe-x520-2: tx hang 19 detected on queue 4, resetting adapter
> >>>>>   ixgbe 0000:01:00.1 ixgbe-x520-2: initiating reset due to tx timeout
> >>>>>   ixgbe 0000:01:00.1 ixgbe-x520-2: Reset adapter
> >>>>>
> >>>>> This only occurs in combination with a bonding interface and XDP, so I don't
> >>>>> know if this is an issue with ixgbe or the bonding driver.
> >>>>> I first discovered this with Linux 6.8.0-57, but kernel 6.14.0 and 6.15.0-rc1
> >>>>> show the same issue.
> >>>>>
> >>>>>
> >>>>> I managed to reproduce this bug in a lab environment. Here are some details
> >>>>> about my setup and the steps to reproduce the bug:
> >>>>>
> >>>>> [...]
> >>>>>
> >>>>> Do you have any ideas what may be causing this issue or what I can do to
> >>>>> diagnose this further?
> >>>>>
> >>>>> Please let me know when I should provide any more information.
> >>>>>
> >>>>>
> >>>>> Thanks!
> >>>>> Marcus
> >>>>>
> >>>>
> >> [...]
> >>
> >> Hi Marcus,
> >>
> >>> thank you for looking into it. And not even 24 hours after my report, I'm
> >>> very impressed! ;)
> >>
> >> Thanks! :-)
> >>
> >>> Interesting. I just tried again but had no luck yet with reproducing it
> >>> without a bonding interface. May I ask how your setup looks like?
> >>
> >> For now, I've just grabbed the first available system with the HW
> >> controlled by the "ixgbe" driver. In my case it was:
> >>
> >>   Ethernet controller: Intel Corporation Ethernet Controller X550
> >>
> >> Also, for my first attempt, I didn't use the upstream kernel - I just tried
> >> the kernel installed on that system. It was the Fedora kernel:
> >>
> >>   6.12.8-200.fc41.x86_64
> >>
> >>
> >> I think that may be the "beauty" of timing issues - sometimes you can change
> >> just one piece in your system and get a completely different replication ratio.
> >> Anyway, the higher the repro probability, the easier it is to debug
> >> the timing problem. :-)
> > 
> > Hi Marcus, to break the silence could you try to apply the diff below on
> > your side?
> 
> Hi, thank you for the patch. We've tried it and with your changes we can no
> longer trigger the error and the NIC is no longer being reset.
> 
> > We see several issues around XDP queues in ixgbe, but before we
> > proceed let's this small change on your side.
> 
> How confident are you that this patch is sufficient to make things stable enough
> for production use? Was it just the Tx hang detection that was misbehaving for
> the XDP case, or is there an underlying issue with the XDP queues that is not
> solved by disabling the detection for it?

I believe that correct way to approach this is to move the Tx hang
detection onto ixgbe_tx_timeout() as that is the place where this logic
belongs to. By doing so I suppose we would kill two birds with one stone
as mentioned ndo is called under netdev watchdog which is not a subject
for XDP Tx queues.

> 
> With our current setup we cannot verify accurately, that we have no packet loss 
> or stuck queues. We can do additional tests to verify that. 
>  
> > Additional question, do you have enabled pause frames on your setup?
> 
> Pause frames were enabled, but we can also reproduce it after disabling them,
> without your patch.

Please give your setup a go with pause frames enabled and applied patch
that i shared previously and let us see the results. As said above I do
not think it is correct to check for hung queues in Tx descriptor cleaning
routine. This is a job of ndo_tx_timeout callback.

> 
> Thanks!

Thanks for feedback and testing. I'll provide a proper fix tomorrow and CC
you so you could take it for a spin.

> Marcus

  reply	other threads:[~2025-04-23 18:40 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-09 15:17 [BUG] ixgbe: Detected Tx Unit Hang (XDP) Marcus Wichelmann
2025-04-10 14:30 ` Michal Kubiak
2025-04-10 14:54   ` Marcus Wichelmann
2025-04-11  8:14     ` Michal Kubiak
2025-04-17 14:47       ` Maciej Fijalkowski
2025-04-23 14:20         ` Marcus Wichelmann
2025-04-23 18:39           ` Maciej Fijalkowski [this message]
2025-04-24 10:19             ` Tobias Böhm
2025-05-05 15:23               ` Tobias Böhm
2025-05-08 19:25                 ` Maciej Fijalkowski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aAkz/+Rx5w3OHH4/@boxer \
    --to=maciej.fijalkowski@intel.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=anthony.l.nguyen@intel.com \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=intel-wired-lan@lists.osuosl.org \
    --cc=john.fastabend@gmail.com \
    --cc=jv@jvosburgh.net \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=marcus.wichelmann@hetzner-cloud.de \
    --cc=michal.kubiak@intel.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=przemyslaw.kitszel@intel.com \
    --cc=sdn@hetzner-cloud.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox