linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Christian Herzog <herzog@phys.ethz.ch>
To: Yu Kuai <yukuai1@huaweicloud.com>
Cc: linux-block@vger.kernel.org, "yukuai (C)" <yukuai3@huawei.com>
Subject: Re: file server freezes with all nfsds stuck in D state after upgrade to Debian bookworm
Date: Thu, 20 Apr 2023 14:57:37 +0200	[thread overview]
Message-ID: <ZEE2wZPpY7JBWbY8@phys.ethz.ch> (raw)
In-Reply-To: <50766556-cd33-7506-13b1-64940b5995bb@huaweicloud.com>

Dear all,

we just had another freeze on one of our bookworm file servers. The scenario
is a bit different, but the root cause might be just the same. So what
happened:

- the server had been happily serving NFS + SMB for two weeks
- today I noticed a left-over rsync process from a recent backup run that
  didn't do any IO and was in D state
- I killed this rsync process, but since it was in D, it never died
- after a few minutes I noticed an nfsd in D state too (but just one). I
  watched it for a bit and then decided to try "service nfs-kernel-server
  restart" to see if again nfs was involved. I guess it was...
- from then on, all sorts of processes entered eternal D: several smbd,
  autofs, the rsync and one nfsd
- however: at all times, the underlying file systems seemed perfectly fine. We
  could write to every single one of them and gdu the hundred-TiB ones without
  a problem
- my impression is that at least this time, nfsd was just one of the victims
  of a deeper problem
- we took all the forensics suggested last time by Kuai and Bob. I don't
  really understand them, but here's the facts:
  - memory on the machine is completely uncritical, < 20% used
  - the rqos/wbt/inflight of all block devices are 0 (remember: those are
    iSCSI LUNs)
  - all the hctx* values seem unsuspicious to me, but what do I know
  - the stacks traces of the D processes don't show any rq_qos_wait this time

here's the D rsync trace:

[<0>] iterate_dir+0x52/0x1c0
[<0>] __x64_sys_getdents64+0x84/0x120
[<0>] do_syscall_64+0x58/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd


and the D nfsd:

[<0>] vfs_rename+0x266/0xd70
[<0>] nfsd_rename+0x327/0x470 [nfsd]
[<0>] nfsd4_rename+0x53/0x110 [nfsd]
[<0>] nfsd4_proc_compound+0x352/0x660 [nfsd]
[<0>] nfsd_dispatch+0x167/0x280 [nfsd]
[<0>] svc_process_common+0x286/0x5e0 [sunrpc]
[<0>] svc_process+0xad/0x100 [sunrpc]
[<0>] nfsd+0xd5/0x190 [nfsd]
[<0>] kthread+0xe6/0x110
[<0>] ret_from_fork+0x1f/0x30

all the forensics are contained in
https://people.phys.ethz.ch/~daduke/freeze.tgz

we would be extremely grateful for any hints how we can debug (or even solve)
this. We're really at a loss here...


thanks and kind regards,
-Christian


-- 
Dr. Christian Herzog <herzog@phys.ethz.ch>  support: +41 44 633 26 68
Head, IT Services Group, HPT H 8              voice: +41 44 633 39 50
Department of Physics, ETH Zurich           
8093 Zurich, Switzerland                     http://isg.phys.ethz.ch/

      reply	other threads:[~2023-04-20 12:57 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-06 16:59 file server freezes with all nfsds stuck in D state after upgrade to Debian bookworm Christian Herzog
2023-04-07  6:26 ` Yu Kuai
2023-04-20 12:57   ` Christian Herzog [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZEE2wZPpY7JBWbY8@phys.ethz.ch \
    --to=herzog@phys.ethz.ch \
    --cc=linux-block@vger.kernel.org \
    --cc=yukuai1@huaweicloud.com \
    --cc=yukuai3@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).