Linux NFS development
 help / color / mirror / Atom feed
From: Chuck Lever III <chuck.lever@oracle.com>
To: Li Lingfeng <lilingfeng3@huawei.com>
Cc: Dai Ngo <dai.ngo@oracle.com>, Jeff Layton <jlayton@kernel.org>,
	Neil Brown <neilb@suse.de>,
	Olga Kornievskaia <okorniev@redhat.com>,
	Tom Talpey <tom@talpey.com>,
	Trond Myklebust <trond.myklebust@hammerspace.com>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Yu Kuai <yukuai1@huaweicloud.com>, Hou Tao <houtao1@huawei.com>,
	"zhangyi (F)" <yi.zhang@huawei.com>,
	yangerkun <yangerkun@huawei.com>,
	"chengzhihao1@huawei.com" <chengzhihao1@huawei.com>,
	Li Lingfeng <lilingfeng@huaweicloud.com>
Subject: Re: [bug report] deploying both NFS client and server on the same machine triggle hungtask
Date: Mon, 2 Dec 2024 16:05:18 +0000	[thread overview]
Message-ID: <D4E120A4-D877-48CC-AE40-D55DBB6265D0@oracle.com> (raw)
In-Reply-To: <8b155d3c-62b4-4f16-ab00-e3d030148d29@huawei.com>



> On Nov 28, 2024, at 2:22 AM, Li Lingfeng <lilingfeng3@huawei.com> wrote:
> 
> Besides nfsd_file_shrinker, the nfsd_client_shrinker added by commit
> 7746b32f467b ("NFSD: add shrinker to reap courtesy clients on low memory
> condition") in 2022 and the nfsd_reply_cache_shrinker added by commit
> 3ba75830ce17 ("nfsd4: drc containerization") in 2019 may also trigger such
> an issue.
> Was this scenario not considered when designing the shrinkers for NFSD, or
> was it deemed unreasonable and not worth considering?

I'm speculating, but it is possible that the issue was
introduced by another patch in an area related to the
rwsem. Seems like there is a testing gap in this area.

Can you file a bugzilla report on bugzilla.kernel.org <http://bugzilla.kernel.org/>
under Filesystems/NFSD ?


> 在 2024/11/25 19:17, Li Lingfeng 写道:
>> Hi, we have found a hungtask issue recently.
>> 
>> Commit 7746b32f467b ("NFSD: add shrinker to reap courtesy clients on low
>> memory condition") adds a shrinker to NFSD, which causes NFSD to try to
>> obtain shrinker_rwsem when starting and stopping services.
>> 
>> Deploying both NFS client and server on the same machine may lead to the
>> following issue, since they will share the global shrinker_rwsem.
>> 
>>     nfsd                            nfs
>>                             drop_cache // hold shrinker_rwsem
>>                             write back, wait for rpc_task to exit
>> // stop nfsd threads
>> svc_set_num_threads
>> // clean up xprts
>> svc_xprt_destroy_all
>>                             rpc_check_timeout
>>                              rpc_check_connected
>>                              // wait for the connection to be disconnected
>> unregister_shrinker
>> // wait for shrinker_rwsem
>> 
>> Normally, the client's rpc_task will exit after the server's nfsd thread
>> has processed the request.
>> When all the server's nfsd threads exit, the client’s rpc_task is expected
>> to detect the network connection being disconnected and exit.
>> However, although the server has executed svc_xprt_destroy_all before
>> waiting for shrinker_rwsem, the network connection is not actually
>> disconnected. Instead, the operation to close the socket is simply added
>> to the task_works queue.
>> 
>> svc_xprt_destroy_all
>>  ...
>>  svc_sock_free
>>   sockfd_put
>>    fput_many
>>     init_task_work // ____fput
>>     task_work_add // add to task->task_works
>> 
>> The actual disconnection of the network connection will only occur after
>> the current process finishes.
>> do_exit
>>  exit_task_work
>>   task_work_run
>>   ...
>>    ____fput // close sock
>> 
>> Although it is not a common practice to deploy NFS client and server on
>> the same machine, I think this issue still needs to be addressed,
>> otherwise it will cause all processes trying to acquire the shrinker_rwsem
>> to hang.
>> 
>> I don't have any ideas yet on how to solve this problem, does anyone have
>> any suggestions?
>> 
>> Thanks.
>> 

--
Chuck Lever



  reply	other threads:[~2024-12-02 16:05 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-25 11:17 [bug report] deploying both NFS client and server on the same machine triggle hungtask Li Lingfeng
2024-11-25 17:32 ` Mark Liam Brown
2024-11-26  2:28   ` Li Lingfeng
2024-11-28  7:22 ` Li Lingfeng
2024-12-02 16:05   ` Chuck Lever III [this message]
2024-12-03  2:32     ` Li Lingfeng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=D4E120A4-D877-48CC-AE40-D55DBB6265D0@oracle.com \
    --to=chuck.lever@oracle.com \
    --cc=chengzhihao1@huawei.com \
    --cc=dai.ngo@oracle.com \
    --cc=houtao1@huawei.com \
    --cc=jlayton@kernel.org \
    --cc=lilingfeng3@huawei.com \
    --cc=lilingfeng@huaweicloud.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=okorniev@redhat.com \
    --cc=tom@talpey.com \
    --cc=trond.myklebust@hammerspace.com \
    --cc=yangerkun@huawei.com \
    --cc=yi.zhang@huawei.com \
    --cc=yukuai1@huaweicloud.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox