From: Jeff Layton <jlayton@kernel.org>
To: "Switzer, David" <david.switzer@intel.com>,
"intel-wired-lan@lists.osuosl.org"
<intel-wired-lan@lists.osuosl.org>,
"Nguyen, Anthony L" <anthony.l.nguyen@intel.com>,
"Brandeburg, Jesse" <jesse.brandeburg@intel.com>
Cc: Ilya Dryomov <idryomov@gmail.com>, Xiubo Li <xiubli@redhat.com>,
Venky Shankar <vshankar@redhat.com>
Subject: Re: [Intel-wired-lan] intermittent ixgbe transmit queue timeouts in v5.18 kernels
Date: Wed, 08 Jun 2022 08:44:49 -0400 [thread overview]
Message-ID: <5522f28fa2aef2890c1d5533899b0a7954bddc6a.camel@kernel.org> (raw)
In-Reply-To: <SN6PR11MB351830AC7CCF4B4D2C49F165EBA59@SN6PR11MB3518.namprd11.prod.outlook.com>
On Tue, 2022-06-07 at 21:22 +0000, Switzer, David wrote:
> > -----Original Message-----
> > From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf
> > Of
> > Jeff Layton
> > Sent: Thursday, June 2, 2022 2:38 PM
> > To: intel-wired-lan@lists.osuosl.org; Nguyen, Anthony L
> > <anthony.l.nguyen@intel.com>; Brandeburg, Jesse
> > <jesse.brandeburg@intel.com>
> > Cc: Ilya Dryomov <idryomov@gmail.com>; Xiubo Li <xiubli@redhat.com>;
> > Venky Shankar <vshankar@redhat.com>
> > Subject: [Intel-wired-lan] intermittent ixgbe transmit queue
> > timeouts in v5.18
> > kernels
> >
> > The Ceph project test lab has a fairly large cluster of machines
> > with ixgbe
> > adapters:
> >
> > 03:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
> > SFI/SFP+
> > Network Connection (rev 01)
> >
> We are attempting to reproduce your issue, and the output from lspci -
> s 03:00.0
> -vv would help us make sure we're looking at the exact adapter that
> the issue is
> Being seen on.
>
> > Recently, we've started getting intermittent tx queue timeouts with
> > these
> > machines. One of them is reported here:
> >
> > https://tracker.ceph.com/issues/55823
> >
> > Usually this happens when we're trying to do a sync, and there is a
> > flurry of
> > transmission activity. Afterward we see a lot of fallout in ceph
> > culminating in
> > softlockups.
> >
> > The kernels we're testing have some patches that are not yet in
> > mainline, but
> > mostly they are confined to net/ceph and fs/ceph, and shouldn't
> > really affect
> > hw drivers.
> >
> > The problem manifested pretty regularly during v5.18 and then I
> > didn't see it
> > for a while. I had figured it was something that had been fixed, but
> > I think it
> > was just "luck".
> >
> > I attempted a bisect a while back, and ruled out recent ceph changes
> > as the
> > issue. Unfortunately, I wasn't able to get to a conclusive patch
> > that broke it,
> > but I think it likely crept in during the initial merge window for
> > v5.18 (pre-rc1).
> >
> > One other oddity: the test lab often installs bleeding-edge kernels
> > on old
> > distros (RHEL8 and Ubuntu from similar era). Is it possible that the
> > firmware
> > that ships with these older distros is not suitable for the more
> > recent driver in
> > v5.18 ?
> >
> Thank you for this information, we'll look into it if we're having
> trouble
> reproducing the issue!
>
>
> > Any thoughts or suggestions on things we can do to fix this?
> >
> Nothing yet, but we'll be sure to let you know when we find it.
>
Thanks for getting back to us.
Since I emailed you, I've found a bug in ceph that could make the cephfs
client spin in an (essentially) infinite loop if there were delays
getting MDS replies in some situations. We've fixed that and I haven't
seen any tx queue timeouts since, though I've only had the fix in place
for a day or so.
For now, I think we can just consider this to be fallout from the ceph
bug. If the problems return though, I'll let you know!
Thanks again!
--
Jeff Layton <jlayton@kernel.org>
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
prev parent reply other threads:[~2022-06-08 16:18 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-06-02 21:37 [Intel-wired-lan] intermittent ixgbe transmit queue timeouts in v5.18 kernels Jeff Layton
2022-06-07 21:22 ` Switzer, David
2022-06-08 12:44 ` Jeff Layton [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5522f28fa2aef2890c1d5533899b0a7954bddc6a.camel@kernel.org \
--to=jlayton@kernel.org \
--cc=anthony.l.nguyen@intel.com \
--cc=david.switzer@intel.com \
--cc=idryomov@gmail.com \
--cc=intel-wired-lan@lists.osuosl.org \
--cc=jesse.brandeburg@intel.com \
--cc=vshankar@redhat.com \
--cc=xiubli@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox