Linux NFS development
 help / color / mirror / Atom feed
* [bug report] deploying both NFS client and server on the same machine triggle hungtask
@ 2024-11-25 11:17 Li Lingfeng
  2024-11-25 17:32 ` Mark Liam Brown
  2024-11-28  7:22 ` Li Lingfeng
  0 siblings, 2 replies; 6+ messages in thread
From: Li Lingfeng @ 2024-11-25 11:17 UTC (permalink / raw)
  To: Dai.Ngo, Chuck Lever, Jeff Layton, NeilBrown, okorniev, tom,
	trond.myklebust
  Cc: linux-nfs, linux-kernel, Yu Kuai, Hou Tao, zhangyi (F), yangerkun,
	chengzhihao1, Li Lingfeng, Li Lingfeng

Hi, we have found a hungtask issue recently.

Commit 7746b32f467b ("NFSD: add shrinker to reap courtesy clients on low
memory condition") adds a shrinker to NFSD, which causes NFSD to try to
obtain shrinker_rwsem when starting and stopping services.

Deploying both NFS client and server on the same machine may lead to the
following issue, since they will share the global shrinker_rwsem.

     nfsd                            nfs
                             drop_cache // hold shrinker_rwsem
                             write back, wait for rpc_task to exit
// stop nfsd threads
svc_set_num_threads
// clean up xprts
svc_xprt_destroy_all
                             rpc_check_timeout
                              rpc_check_connected
                              // wait for the connection to be disconnected
unregister_shrinker
// wait for shrinker_rwsem

Normally, the client's rpc_task will exit after the server's nfsd thread
has processed the request.
When all the server's nfsd threads exit, the client’s rpc_task is expected
to detect the network connection being disconnected and exit.
However, although the server has executed svc_xprt_destroy_all before
waiting for shrinker_rwsem, the network connection is not actually
disconnected. Instead, the operation to close the socket is simply added
to the task_works queue.

svc_xprt_destroy_all
  ...
  svc_sock_free
   sockfd_put
    fput_many
     init_task_work // ____fput
     task_work_add // add to task->task_works

The actual disconnection of the network connection will only occur after
the current process finishes.
do_exit
  exit_task_work
   task_work_run
   ...
    ____fput // close sock

Although it is not a common practice to deploy NFS client and server on
the same machine, I think this issue still needs to be addressed,
otherwise it will cause all processes trying to acquire the shrinker_rwsem
to hang.

I don't have any ideas yet on how to solve this problem, does anyone have
any suggestions?

Thanks.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-12-03  2:32 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-25 11:17 [bug report] deploying both NFS client and server on the same machine triggle hungtask Li Lingfeng
2024-11-25 17:32 ` Mark Liam Brown
2024-11-26  2:28   ` Li Lingfeng
2024-11-28  7:22 ` Li Lingfeng
2024-12-02 16:05   ` Chuck Lever III
2024-12-03  2:32     ` Li Lingfeng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox