From: Chuck Lever III <chuck.lever@oracle.com>
To: Christian Herzog <herzog@phys.ethz.ch>
Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: file server freezes with all nfsds stuck in D state after upgrade to Debian bookworm
Date: Thu, 6 Apr 2023 13:48:06 +0000 [thread overview]
Message-ID: <6785EFE7-2CE1-45CD-8643-C40CCCDADEB8@oracle.com> (raw)
In-Reply-To: <ZC6oX7FxdJd86rF7@phys.ethz.ch>
> On Apr 6, 2023, at 7:09 AM, Christian Herzog <herzog@phys.ethz.ch> wrote:
>
> Dear all,
>
> for our researchers we are running file servers in the hundreds-of-TiB to
> low-PiB range that export via NFS and SMB. Storage is iSCSI-over-Infiniband
> LUNs LVM'ed into individual XFS file systems. With Ubuntu 18.04 nearing EOL,
> we prepared an upgrade to Debian bookworm and tests went well. About a week
> after one of the upgrades, we ran into the first occurence of our problem: all
> of a sudden, all nfsds enter the D state and are not recoverable. However, the
> underlying file systems seem fine and can be read and written to. The only way
> out appears to be to reboot the server. The only clues are the frozen nfsds
> and strack traces like
>
> [<0>] rq_qos_wait+0xbc/0x130
> [<0>] wbt_wait+0xa2/0x110
Hi Christian, you have a pretty deep storage stack!
rq_qos_wait is a few layers below NFSD. Jens Axboe
and linux-block are the folks who maintain that.
> [<0>] __rq_qos_throttle+0x20/0x40
> [<0>] blk_mq_submit_bio+0x2d3/0x580
> [<0>] submit_bio_noacct_nocheck+0xf7/0x2c0
> [<0>] iomap_submit_ioend+0x4b/0x80
> [<0>] iomap_do_writepage+0x4b4/0x820
> [<0>] write_cache_pages+0x180/0x4c0
> [<0>] iomap_writepages+0x1c/0x40
> [<0>] xfs_vm_writepages+0x79/0xb0 [xfs]
> [<0>] do_writepages+0xbd/0x1c0
> [<0>] filemap_fdatawrite_wbc+0x5f/0x80
> [<0>] __filemap_fdatawrite_range+0x58/0x80
> [<0>] file_write_and_wait_range+0x41/0x90
> [<0>] xfs_file_fsync+0x5a/0x2a0 [xfs]
> [<0>] nfsd_commit+0x93/0x190 [nfsd]
> [<0>] nfsd4_commit+0x5e/0x90 [nfsd]
> [<0>] nfsd4_proc_compound+0x352/0x660 [nfsd]
> [<0>] nfsd_dispatch+0x167/0x280 [nfsd]
> [<0>] svc_process_common+0x286/0x5e0 [sunrpc]
> [<0>] svc_process+0xad/0x100 [sunrpc]
> [<0>] nfsd+0xd5/0x190 [nfsd]
> [<0>] kthread+0xe6/0x110
> [<0>] ret_from_fork+0x1f/0x30
>
> (we've also seen nfsd3). It's very sporadic, we have no idea what's triggering
> it and it has now happened 4 times on one server and once on a second.
> Needless to say, these are production systems, so we have a window of a few
> minutes for debugging before people start yelling. We've thrown everything we
> could at our test setup but so far haven't been able to trigger it.
> Any pointers would be highly appreciated.
>
>
> thanks and best regards,
> -Christian
>
>
>
> cat /etc/os-release
> PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
>
> uname -vr
> 6.1.0-7-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.20-1 (2023-03-19)
>
> apt list --installed '*nfs*'
> libnfsidmap1/testing,now 1:2.6.2-4 amd64 [installed,automatic]
> nfs-common/testing,now 1:2.6.2-4 amd64 [installed]
> nfs-kernel-server/testing,now 1:2.6.2-4 amd64 [installed]
>
> nfsconf -d
> [exportd]
> debug = all
> [exportfs]
> debug = all
> [general]
> pipefs-directory = /run/rpc_pipefs
> [lockd]
> port = 32769
> udp-port = 32769
> [mountd]
> debug = all
> manage-gids = True
> port = 892
> [nfsd]
> debug = all
> port = 2049
> threads = 48
> [nfsdcld]
> debug = all
> [nfsdcltrack]
> debug = all
> [sm-notify]
> debug = all
> outgoing-port = 846
> [statd]
> debug = all
> outgoing-port = 2020
> port = 662
>
>
>
> --
> Dr. Christian Herzog <herzog@phys.ethz.ch> support: +41 44 633 26 68
> Head, IT Services Group, HPT H 8 voice: +41 44 633 39 50
> Department of Physics, ETH Zurich
> 8093 Zurich, Switzerland http://isg.phys.ethz.ch/
--
Chuck Lever
next prev parent reply other threads:[~2023-04-06 13:49 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-04-06 11:09 file server freezes with all nfsds stuck in D state after upgrade to Debian bookworm Christian Herzog
2023-04-06 13:48 ` Chuck Lever III [this message]
2023-04-06 15:33 ` Christian Herzog
2023-04-06 15:40 ` Chuck Lever III
2023-04-06 15:54 ` Christian Herzog
2023-04-06 16:19 ` Chuck Lever III
[not found] ` <4F41FC87-908F-451F-8D2C-089CB7AB5919@gmail.com>
2023-04-06 17:26 ` Christian Herzog
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6785EFE7-2CE1-45CD-8643-C40CCCDADEB8@oracle.com \
--to=chuck.lever@oracle.com \
--cc=herzog@phys.ethz.ch \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox