From: Chuck Lever III <chuck.lever@oracle.com>
To: "hedrick@rutgers.edu" <hedrick@rutgers.edu>,
Timothy Pearson <tpearson@raptorengineering.com>
Cc: Bruce Fields <bfields@fieldses.org>,
Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load
Date: Mon, 9 Aug 2021 17:37:19 +0000 [thread overview]
Message-ID: <77ED566A-7738-4F62-867C-1C2DFC5D34AB@oracle.com> (raw)
In-Reply-To: <2FEAFB26-C723-450D-A115-1D82841BBF73@rutgers.edu>
> On Aug 9, 2021, at 1:15 PM, hedrick@rutgers.edu wrote:
>
> There seems to be a soft lockup message on the console, but that’s all I can find.
Then when you say "server hangs" you mean that the entire NFS server
system deadlocks. It's not just unresponsive on one or more exports.
A soft lockup is typically caused by a segmentation fault in code
that is not running in process context.
> I’m currently considering whether it’s best to move to NFS 4.0, which seems not to cause the issue, or 4.2 with delegations disabled. This is the primary server for the department. If it fails, everything fails, VMs because read-only, user jobs fai, etc.
>
> We ran for a year before this showed up, so I’m pretty sure going to 4.0 will fix it. But I have use cases for ACLs that will only work with 4.2. Since the problem seems to be in the callback mechanism, and as far as I can tell that’s only used for delegations, I assume turning off delegations will fix it.
In NFSv4.1 and later, the callback channel is also used for pNFS. It
can also be used for lock notification in all minor versions.
Disabling delegation can have a performance impact, but it depends on
the nature of your workloads and whether files are shared amongst
your client population.
> We’ve also had a history of issues with 4.2 problems on clients. That’s why we backed off to 4.0 initially. Clients were seeing hangs.
Let's stick with the server issue for the moment.
Enabling some tracepoints might give us more insight, though if the
server then crashes we would be hard pressed to examine the trace
records. If it's pretty common to get multiple receive_cb_reply
error messages in a short time space, you might enable a triggered
tracepoint in that function to start a 60-second tcpdump capture to
a file.
--
Chuck Lever
next prev parent reply other threads:[~2021-08-09 17:37 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-07-05 9:44 CPU stall, eventual host hang with BTRFS + NFS under heavy load Timothy Pearson
2021-07-05 9:47 ` Timothy Pearson
2021-07-23 21:01 ` J. Bruce Fields
[not found] ` <B4D8C4B7-EE8C-456C-A6C5-D25FF1F3608E@rutgers.edu>
[not found] ` <3A4DF3BB-955C-4301-BBED-4D5F02959F71@rutgers.edu>
2021-08-09 17:06 ` Timothy Pearson
2021-08-09 17:15 ` hedrick
2021-08-09 17:25 ` Timothy Pearson
2021-08-09 17:37 ` Chuck Lever III [this message]
[not found] ` <F5179A41-FB9A-4AB1-BE58-C2859DB7EC06@rutgers.edu>
2021-08-09 18:30 ` Timothy Pearson
2021-08-09 18:38 ` hedrick
2021-08-09 18:44 ` Timothy Pearson
2021-08-09 18:49 ` J. Bruce Fields
[not found] ` <15AD846A-4638-4ACF-B47C-8EF655AD6E85@rutgers.edu>
2021-08-09 18:56 ` Timothy Pearson
2021-08-09 20:54 ` Charles Hedrick
2021-08-09 21:49 ` Timothy Pearson
2021-08-09 22:01 ` Charles Hedrick
[not found] ` <1119B476-171F-4C5A-9DEF-184F211A6A98@rutgers.edu>
2021-08-10 16:22 ` Timothy Pearson
2021-08-16 14:43 ` hedrick
2021-08-09 18:30 ` J. Bruce Fields
2021-08-09 18:34 ` hedrick
[not found] ` <413163A6-8484-4170-9877-C0C2D50B13C0@rutgers.edu>
2021-08-10 14:58 ` J. Bruce Fields
2021-07-23 21:00 ` J. Bruce Fields
2021-07-23 21:22 ` Timothy Pearson
2021-07-28 19:51 ` Timothy Pearson
2021-08-02 19:28 ` J. Bruce Fields
2021-08-10 0:43 ` NeilBrown
2021-08-10 0:54 ` J. Bruce Fields
2021-08-12 14:44 ` J. Bruce Fields
2021-08-12 21:36 ` NeilBrown
2021-10-08 20:27 ` Scott Mayhew
2021-10-08 20:53 ` Timothy Pearson
2021-10-08 21:11 ` J. Bruce Fields
2021-10-09 17:33 ` Chuck Lever III
2021-10-11 14:30 ` Bruce Fields
2021-10-11 16:36 ` Chuck Lever III
2021-10-11 21:57 ` NeilBrown
2021-10-14 22:36 ` Trond Myklebust
2021-10-14 22:51 ` NeilBrown
2021-10-15 8:03 ` Trond Myklebust
2021-10-15 8:05 ` Trond Myklebust
2021-12-01 18:36 ` Scott Mayhew
2021-12-01 19:35 ` Trond Myklebust
2021-12-01 20:13 ` Scott Mayhew
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=77ED566A-7738-4F62-867C-1C2DFC5D34AB@oracle.com \
--to=chuck.lever@oracle.com \
--cc=bfields@fieldses.org \
--cc=hedrick@rutgers.edu \
--cc=linux-nfs@vger.kernel.org \
--cc=tpearson@raptorengineering.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).