All of lore.kernel.org
 help / color / mirror / Atom feed
From: Christian Herzog <herzog@phys.ethz.ch>
To: Yu Kuai <yukuai1@huaweicloud.com>
Cc: linux-block@vger.kernel.org, "yukuai (C)" <yukuai3@huawei.com>
Subject: Re: file server freezes with all nfsds stuck in D state after upgrade to Debian bookworm
Date: Thu, 20 Apr 2023 14:57:37 +0200	[thread overview]
Message-ID: <ZEE2wZPpY7JBWbY8@phys.ethz.ch> (raw)
In-Reply-To: <50766556-cd33-7506-13b1-64940b5995bb@huaweicloud.com>

Dear all,

we just had another freeze on one of our bookworm file servers. The scenario
is a bit different, but the root cause might be just the same. So what
happened:

- the server had been happily serving NFS + SMB for two weeks
- today I noticed a left-over rsync process from a recent backup run that
  didn't do any IO and was in D state
- I killed this rsync process, but since it was in D, it never died
- after a few minutes I noticed an nfsd in D state too (but just one). I
  watched it for a bit and then decided to try "service nfs-kernel-server
  restart" to see if again nfs was involved. I guess it was...
- from then on, all sorts of processes entered eternal D: several smbd,
  autofs, the rsync and one nfsd
- however: at all times, the underlying file systems seemed perfectly fine. We
  could write to every single one of them and gdu the hundred-TiB ones without
  a problem
- my impression is that at least this time, nfsd was just one of the victims
  of a deeper problem
- we took all the forensics suggested last time by Kuai and Bob. I don't
  really understand them, but here's the facts:
  - memory on the machine is completely uncritical, < 20% used
  - the rqos/wbt/inflight of all block devices are 0 (remember: those are
    iSCSI LUNs)
  - all the hctx* values seem unsuspicious to me, but what do I know
  - the stacks traces of the D processes don't show any rq_qos_wait this time

here's the D rsync trace:

[<0>] iterate_dir+0x52/0x1c0
[<0>] __x64_sys_getdents64+0x84/0x120
[<0>] do_syscall_64+0x58/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd


and the D nfsd:

[<0>] vfs_rename+0x266/0xd70
[<0>] nfsd_rename+0x327/0x470 [nfsd]
[<0>] nfsd4_rename+0x53/0x110 [nfsd]
[<0>] nfsd4_proc_compound+0x352/0x660 [nfsd]
[<0>] nfsd_dispatch+0x167/0x280 [nfsd]
[<0>] svc_process_common+0x286/0x5e0 [sunrpc]
[<0>] svc_process+0xad/0x100 [sunrpc]
[<0>] nfsd+0xd5/0x190 [nfsd]
[<0>] kthread+0xe6/0x110
[<0>] ret_from_fork+0x1f/0x30

all the forensics are contained in
https://people.phys.ethz.ch/~daduke/freeze.tgz

we would be extremely grateful for any hints how we can debug (or even solve)
this. We're really at a loss here...


thanks and kind regards,
-Christian


-- 
Dr. Christian Herzog <herzog@phys.ethz.ch>  support: +41 44 633 26 68
Head, IT Services Group, HPT H 8              voice: +41 44 633 39 50
Department of Physics, ETH Zurich           
8093 Zurich, Switzerland                     http://isg.phys.ethz.ch/

  reply	other threads:[~2023-04-20 12:57 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-06 16:59 file server freezes with all nfsds stuck in D state after upgrade to Debian bookworm Christian Herzog
2023-04-07  6:26 ` Yu Kuai
2023-04-20 12:57   ` Christian Herzog [this message]
  -- strict thread matches above, loose matches on Subject: below --
2023-04-06 11:09 Christian Herzog
2023-04-06 13:48 ` Chuck Lever III
2023-04-06 15:33   ` Christian Herzog
2023-04-06 15:40     ` Chuck Lever III
2023-04-06 15:54       ` Christian Herzog
2023-04-06 16:19         ` Chuck Lever III
     [not found]           ` <4F41FC87-908F-451F-8D2C-089CB7AB5919@gmail.com>
2023-04-06 17:26             ` Christian Herzog

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZEE2wZPpY7JBWbY8@phys.ethz.ch \
    --to=herzog@phys.ethz.ch \
    --cc=linux-block@vger.kernel.org \
    --cc=yukuai1@huaweicloud.com \
    --cc=yukuai3@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.