public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Li Lingfeng via Bugspray Bot <bugbot@kernel.org>
To: benoit.gschwind@minesparis.psl.eu, harald.dunkel@aixigo.com,
	 herzog@phys.ethz.ch, tom@talpey.com, chuck.lever@oracle.com,
	 jlayton@kernel.org, cel@kernel.org, trondmy@kernel.org,
	 baptiste.pellegrin@ac-grenoble.fr, carnil@debian.org,
	 linux-nfs@vger.kernel.org, anna@kernel.org
Subject: Re: NFSD threads hang when destroying a session or client ID
Date: Thu, 23 Jan 2025 02:10:21 +0000	[thread overview]
Message-ID: <20250123-b219710c17-c6cd701c9207@bugzilla.kernel.org> (raw)
In-Reply-To: <20250120-b219710c0-da932078cddb@bugzilla.kernel.org>

Li Lingfeng writes via Kernel.org Bugzilla:

(In reply to Chuck Lever from comment #0)
> On recent v6.1.y, intermittently, NFSD threads wait forever for NFSv4
> callback to shutdown. The wait is in __flush_workqueue(). A server system
> reboot is necessary to recover.
> 
> On new kernels, similar symptoms but the indefinite wait is in the "destroy
> client" path, waiting for NFSv4 callback shutdown. The wait is on the
> wait_var_event() in nfsd41_cb_inflight_wait_complete().

Hi, I've had a similar problem recently.
And I've also done some analysis.

[ 6526.031343] INFO: task bash:846259 blocked for more than 606 seconds.
[ 6526.032060]       Not tainted 6.6.0-gfbf24d352c28-dirty #22
[ 6526.032635] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6526.033404] task:bash            state:D stack:0     pid:846259 ppid:838395 flags:0x0000020d
[ 6526.034226] Call trace:
[ 6526.034527]  __switch_to+0x218/0x3e0
[ 6526.034925]  __schedule+0x734/0x11a8
[ 6526.035323]  schedule+0xa8/0x200
[ 6526.035731]  nfsd4_shutdown_callback+0x24c/0x2f0
[ 6526.036228]  __destroy_client+0x414/0x680
[ 6526.036663]  nfs4_state_destroy_net+0x144/0x448
[ 6526.037152]  nfs4_state_shutdown_net+0x2c8/0x450
[ 6526.037640]  nfsd_shutdown_net+0x100/0x2e0
[ 6526.038078]  nfsd_last_thread+0x190/0x330
[ 6526.038518]  nfsd_svc+0x3cc/0x4a0
[ 6526.038892]  write_threads+0x15c/0x2f0
[ 6526.039301]  nfsctl_transaction_write+0x90/0xd0
[ 6526.039836]  vfs_write+0x110/0x688
[ 6526.040221]  ksys_write+0xd0/0x188
[ 6526.040607]  __arm64_sys_write+0x4c/0x68
[ 6526.041035]  invoke_syscall+0x68/0x198
[ 6526.041455]  el0_svc_common.constprop.0+0x11c/0x150
[ 6526.041967]  do_el0_svc+0x38/0x50
[ 6526.042353]  el0_svc+0x5c/0x240
[ 6526.042723]  el0t_64_sync_handler+0x100/0x130
[ 6526.043186]  el0t_64_sync+0x188/0x190
[ 6526.051007] INFO: task cat:846265 blocked for more than 606 seconds.

1) Check cl_cb_inflight
crash> nfs4_client.cl_cb_inflight ffff000012338f08
  cl_cb_inflight = {
    counter = 1
  },
crash>

2) No work is associated with nfsd
Only two works unrelated to NFSD.
crash> p callback_wq
callback_wq = $1 = (struct workqueue_struct *) 0xffff0000c30a1400
crash>
crash> workqueue_struct.cpu_pwq 0xffff0000c30a1400
  cpu_pwq = 0xccfe9cb5d8d0
crash> kmem -o
PER-CPU OFFSET VALUES:
  CPU 0: ffff2f015341c000
  CPU 1: ffff2f0153442000
  CPU 2: ffff2f0153468000
  CPU 3: ffff2f015348e000
crash>
// ffff2f015341c000 + ccfe9cb5d8d0  = FFFFFBFFEFF798D0
crash> rd FFFFFBFFEFF798D0
fffffbffeff798d0:  ffff0000d3488d00                    ..H.....
crash>
// ffff2f0153442000 + ccfe9cb5d8d0 = FFFFFBFFEFF9F8D0
crash> rd FFFFFBFFEFF9F8D0
fffffbffeff9f8d0:  ffff0000d3488d00                    ..H.....
crash>
// ffff2f0153468000 + ccfe9cb5d8d0 = FFFFFBFFEFFC58D0
crash> rd FFFFFBFFEFFC58D0
fffffbffeffc58d0:  ffff0000d3488d00                    ..H.....
crash>
// ffff2f015348e000 + ccfe9cb5d8d0 = FFFFFBFFEFFEB8D0
crash> rd FFFFFBFFEFFEB8D0
fffffbffeffeb8d0:  ffff0000d3488d00                    ..H.....
crash>
crash> pool_workqueue.pool ffff0000d3488d00
  pool = 0xffff0000c01b6800,
crash>
crash> worker_pool.worklist 0xffff0000c01b6800
  worklist = {
    next = 0xffff0000c906c4a8,
    prev = 0xffffd0ff8944fc68 <stats_flush_dwork+8>
  },
crash>
crash> list 0xffff0000c906c4a8
ffff0000c906c4a8
ffffd0ff8944fc68
ffff0000c01b6860
crash>
crash> work_struct.func ffff0000c906c4a0
  func = 0xffffd0ff84fae128 <wb_update_bandwidth_workfn>,
crash> work_struct.func 0xffffd0ff8944fc60
  func = 0xffffd0ff8510b258 <flush_memcg_stats_dwork>,
crash>

3) No running kworker
I checked vmcore by "foreach bt" and find all kworker are as follows.
PID: 62       TASK: ffff0000c31d0040  CPU: 1    COMMAND: "kworker/R-nfsio"
 #0 [ffff800080c27b80] __switch_to at ffffd0ff866297dc
 #1 [ffff800080c27bd0] __schedule at ffffd0ff8662a180
 #2 [ffff800080c27d00] schedule at ffffd0ff8662ac9c
 #3 [ffff800080c27d40] rescuer_thread at ffffd0ff84b418e4
 #4 [ffff800080c27e60] kthread at ffffd0ff84b52e14
PID: 94       TASK: ffff0000c74ba080  CPU: 0    COMMAND: "kworker/0:1H"
 #0 [ffff800080e07c00] __switch_to at ffffd0ff866297dc
 #1 [ffff800080e07c50] __schedule at ffffd0ff8662a180
 #2 [ffff800080e07d80] schedule at ffffd0ff8662ac9c
 #3 [ffff800080e07dc0] worker_thread at ffffd0ff84b40f94
 #4 [ffff800080e07e60] kthread at ffffd0ff84b52e14

4) Check works releated to nfsd4_run_cb_work
crash> p nfsd4_run_cb_work
nfsd4_run_cb_work = $5 =
 {void (struct work_struct *)} 0xffffd0ff855691e0 <nfsd4_run_cb_work>
crash> search ffffd0ff855691e0
ffff000010474138: ffffd0ff855691e0
ffff0000104750f8: ffffd0ff855691e0
ffff0000104752f0: ffffd0ff855691e0
ffff0000104756e0: ffffd0ff855691e0
ffff000012338388: ffffd0ff855691e0
ffff000012339288: ffffd0ff855691e0
ffff00001233a908: ffffd0ff855691e0
ffff00001233b808: ffffd0ff855691e0
ffff0000c745d038: ffffd0ff855691e0
ffff0000c86499f8: ffffd0ff855691e0
ffff0000c8649b30: ffffd0ff855691e0
ffff0000c9ff8dc8: ffffd0ff855691e0
crash>
ffff000010474138 --> (work) ffff000010474120
ffff0000104750f8 --> (work) ffff0000104750e0
ffff0000104752f0 --> (work) ffff0000104752d8
ffff0000104756e0 --> (work) ffff0000104756c8
ffff000012338388 --> (work) ffff000012338370
ffff000012339288 --> (work) ffff000012339270
ffff00001233a908 --> (work) ffff00001233a8f0
ffff00001233b808 --> (work) ffff00001233b7f0
ffff0000c745d038 --> (work) ffff0000c745d020
ffff0000c86499f8 --> (work) ffff0000c86499e0
ffff0000c8649b30 --> (work) ffff0000c8649b18
ffff0000c9ff8dc8 --> (work) ffff0000c9ff8db0
crash> work_struct.data ffff0000104750e0
  data = {
    counter = 68719476704 // FFFFFFFE0 bit0~5 are 0
  },
crash>
crash> work_struct.data ffff0000c9ff8db0
  data = {
    counter = 256 // 0x100
  },
crash>

I have added some debug information and am trying to reproduce it.
Could you please provide more information you got?
Or any suggestions about this?

Thanks.
> 
> In some cases, clients suspend (inactivity). The server converts them to
> courteous clients. The NFSv4 callback shutdown workqueue item for that
> client appears to be stuck waiting in rpc_shutdown_client().
> 
> Let's collect data under this bug report.

View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c17
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


  parent reply	other threads:[~2025-01-23  2:09 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-20 15:00 NFSD threads hang when destroying a session or client ID Chuck Lever via Bugspray Bot
2025-01-20 15:14 ` Chuck Lever
2025-01-20 15:25 ` Chuck Lever via Bugspray Bot
2025-01-20 15:40 ` Chuck Lever via Bugspray Bot
2025-01-20 19:00 ` Chuck Lever via Bugspray Bot
2025-01-20 20:35 ` Baptiste PELLEGRIN via Bugspray Bot
2025-01-21 14:40 ` Jeff Layton via Bugspray Bot
2025-01-21 16:10 ` Chuck Lever via Bugspray Bot
2025-01-21 17:35   ` Jeff Layton via Bugspray Bot
2025-01-21 19:38     ` Tom Talpey
2025-01-21 19:43       ` Chuck Lever
2025-01-21 16:25 ` Baptiste PELLEGRIN via Bugspray Bot
2025-01-21 16:35   ` Chuck Lever via Bugspray Bot
2025-01-22 11:40     ` Baptiste PELLEGRIN via Bugspray Bot
2025-01-22 14:19       ` Chuck Lever
2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
2025-01-23  2:10 ` Li Lingfeng via Bugspray Bot [this message]
2025-01-23 13:50 ` Jeff Layton via Bugspray Bot
2025-01-23 14:22   ` Chuck Lever
2025-01-23 20:25 ` Baptiste PELLEGRIN via Bugspray Bot
2025-01-23 21:45 ` Chuck Lever via Bugspray Bot
2025-01-26  9:25 ` Baptiste PELLEGRIN via Bugspray Bot
2025-01-26 17:05   ` Chuck Lever via Bugspray Bot
2025-01-29 13:15 ` rik.theys via Bugspray Bot
2025-01-29 19:40 ` Chuck Lever via Bugspray Bot
2025-01-30 14:05   ` rik.theys via Bugspray Bot
2025-01-29 19:50 ` Chuck Lever via Bugspray Bot
2025-02-10 12:05 ` Baptiste PELLEGRIN via Bugspray Bot
2025-02-21 13:42   ` Salvatore Bonaccorso
2025-02-21 13:57     ` Harald Dunkel
2025-02-21 14:31       ` Salvatore Bonaccorso
2025-02-21 14:50       ` Jeff Layton via Bugspray Bot
2025-02-21 16:00     ` Chuck Lever via Bugspray Bot
2025-02-21 14:45 ` Jeff Layton via Bugspray Bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250123-b219710c17-c6cd701c9207@bugzilla.kernel.org \
    --to=bugbot@kernel.org \
    --cc=anna@kernel.org \
    --cc=baptiste.pellegrin@ac-grenoble.fr \
    --cc=benoit.gschwind@minesparis.psl.eu \
    --cc=carnil@debian.org \
    --cc=cel@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=harald.dunkel@aixigo.com \
    --cc=herzog@phys.ethz.ch \
    --cc=jlayton@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=tom@talpey.com \
    --cc=trondmy@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox