Linux NFS development
 help / color / mirror / Atom feed
From: Chuck Lever <chuck.lever@oracle.com>
To: NeilBrown <neilb@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>,
	Christian Brauner <brauner@kernel.org>,
	Jeff Layton <jlayton@kernel.org>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-nfs@vger.kernel.org
Subject: Re: [PATCH/RFC] core/nfsd: allow kernel threads to use task_work.
Date: Tue, 28 Nov 2023 10:34:39 -0500	[thread overview]
Message-ID: <ZWYIj7K0KPQFCCdf@tissot.1015granger.net> (raw)
In-Reply-To: <170114025065.7109.15330780753462853254@noble.neil.brown.name>

On Tue, Nov 28, 2023 at 01:57:30PM +1100, NeilBrown wrote:
> 
> (trimmed cc...)
> 
> On Tue, 28 Nov 2023, Chuck Lever wrote:
> > On Tue, Nov 28, 2023 at 11:16:06AM +1100, NeilBrown wrote:
> > > On Tue, 28 Nov 2023, Chuck Lever wrote:
> > > > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote:
> > > > > 
> > > > > I have evidence from a customer site of 256 nfsd threads adding files to
> > > > > delayed_fput_lists nearly twice as fast they are retired by a single
> > > > > work-queue thread running delayed_fput().  As you might imagine this
> > > > > does not end well (20 million files in the queue at the time a snapshot
> > > > > was taken for analysis).
> > > > > 
> > > > > While this might point to a problem with the filesystem not handling the
> > > > > final close efficiently, such problems should only hurt throughput, not
> > > > > lead to memory exhaustion.
> > > > 
> > > > I have this patch queued for v6.8:
> > > > 
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0
> > > > 
> > > 
> > > Thanks....
> > > I think that change is good, but I don't think it addresses the problem
> > > mentioned in the description, and it is not directly relevant to the
> > > problem I saw ... though it is complicated.
> > > 
> > > The problem "workqueue ...  hogged cpu..." probably means that
> > > nfsd_file_dispose_list() needs a cond_resched() call in the loop.
> > > That will stop it from hogging the CPU whether it is tied to one CPU or
> > > free to roam.
> > > 
> > > Also that work is calling filp_close() which primarily calls
> > > filp_flush().
> > > It also calls fput() but that does minimal work.  If there is much work
> > > to do then that is offloaded to another work-item.  *That* is the
> > > workitem that I had problems with.
> > > 
> > > The problem I saw was with an older kernel which didn't have the nfsd
> > > file cache and so probably is calling filp_close more often.
> > 
> > Without the file cache, the filp_close() should be handled directly
> > by the nfsd thread handling the RPC, IIRC.
> 
> Yes - but __fput() is handled by a workqueue.
> 
> > 
> > 
> > > So maybe
> > > my patch isn't so important now.  Particularly as nfsd now isn't closing
> > > most files in-task but instead offloads that to another task.  So the
> > > final fput will not be handled by the nfsd task either.
> > > 
> > > But I think there is room for improvement.  Gathering lots of files
> > > together into a list and closing them sequentially is not going to be as
> > > efficient as closing them in parallel.
> > 
> > I believe the file cache passes the filps to the work queue one at
> 
> nfsd_file_close_inode() does.  nfsd_file_gc() and nfsd_file_lru_scan()
> can pass multiple.
> 
> > a time, but I don't think there's anything that forces the work
> > queue to handle each flush/close completely before proceeding to the
> > next.
> 
> Parallelism with workqueues is controlled by the work items (struct
> work_struct).  Two different work items can run in parallel.  But any
> given work item can never run parallel to itself.
> 
> The only work items queued on nfsd_filecache_wq are from
>   nn->fcache_disposal->work.
> There is one of these for each network namespace.  So in any given
> network namespace, all work on nfsd_filecache_wq is fully serialised.

OIC, it's that specific case you are concerned with. The per-
namespace laundrette was added by:

  9542e6a643fc ("nfsd: Containerise filecache laundrette")

It's purpose was to confine the close backlog to each container.

Seems like it would be better if there was a struct work_struct
in each struct nfsd_file. That wouldn't add real backpressure to
nfsd threads, but it would enable file closes to run in parallel.


> > IOW there is some parallelism there already, especially now that
> > nfsd_filecache_wq is UNBOUND.
> 
> No there is not.  And UNBOUND makes no difference to parallelism in this
> case.  It allows the one work item to migrate between CPUs while it is
> running, but it doesn't allow it to run concurrently on two different
> CPUs.

Right. The laundrette can now run in parallel with other work by
moving to a different core, but there still can be only one
laundrette running per namespace.


> (UNBOUND can improve parallelism when multiple different work items are
>  submitted all from the same CPU.  Without UNBOUND all the work would
>  happen on the same CPU, though if the work sleeps, the different work
>  items can be interleaved.  With UNBOUND the different work items can
>  enjoy true parallelism when needed).
> 
> 
> > 
> > 
> > > > > For normal threads, the thread that closes the file also calls the
> > > > > final fput so there is natural rate limiting preventing excessive growth
> > > > > in the list of delayed fputs.  For kernel threads, and particularly for
> > > > > nfsd, delayed in the final fput do not impose any throttling to prevent
> > > > > the thread from closing more files.
> > > > 
> > > > I don't think we want to block nfsd threads waiting for files to
> > > > close. Won't that be a potential denial of service?
> > > 
> > > Not as much as the denial of service caused by memory exhaustion due to
> > > an indefinitely growing list of files waiting to be closed by a single
> > > thread of workqueue.
> > 
> > The cache garbage collector is single-threaded, but nfsd_filecache_wq
> > has a max_active setting of zero.
> 
> This allows parallelism between network namespaces, but not within a
> network namespace.
> 
> > 
> > 
> > > I think it is perfectly reasonable that when handling an NFSv4 CLOSE,
> > > the nfsd thread should completely handle that request including all the
> > > flush and ->release etc.  If that causes any denial of service, then
> > > simple increase the number of nfsd threads.
> > > 
> > > For NFSv3 it is more complex.  On the kernel where I saw a problem the
> > > filp_close happen after each READ or WRITE (though I think the customer
> > > was using NFSv4...).  With the file cache there is no thread that is
> > > obviously responsible for the close.
> > > To get the sort of throttling that I think is need, we could possibly
> > > have each "nfsd_open" check if there are pending closes, and to wait for
> > > some small amount of progress.
> > 
> > Well nfsd_open() in particular appears to be used only for readdir.
> > 
> > But maybe nfsd_file_acquire() could wait briefly, in the garbage-
> > collected case, if the nfsd_net's disposal queue is long.
> > 
> > 
> > > But don't think it is reasonable for the nfsd threads to take none of
> > > the burden of closing files as that can result in imbalance.
> > > 
> > > I'll need to give this more thought.
> > 
> > 
> > -- 
> > Chuck Lever
> > 
> 
> Thanks,
> NeilBrown

-- 
Chuck Lever

  reply	other threads:[~2023-11-28 15:34 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-27 22:05 [PATCH/RFC] core/nfsd: allow kernel threads to use task_work NeilBrown
2023-11-27 22:30 ` Al Viro
2023-11-27 22:43   ` NeilBrown
2023-11-27 22:59 ` Chuck Lever
2023-11-28  0:16   ` NeilBrown
2023-11-28  1:37     ` Chuck Lever
2023-11-28  2:57       ` NeilBrown
2023-11-28 15:34         ` Chuck Lever [this message]
2023-11-30 17:50           ` Jeff Layton
2023-11-28 13:51     ` Christian Brauner
2023-11-28 14:15       ` Jeff Layton
2023-11-28 15:22         ` Chuck Lever
2023-11-28 23:31         ` NeilBrown
2023-11-28 23:20       ` NeilBrown
2023-11-29 11:43         ` Christian Brauner
2023-12-04  1:30           ` NeilBrown
2023-11-29 14:04         ` Chuck Lever
2023-11-30 17:47           ` Jeff Layton
2023-11-30 18:07             ` Chuck Lever
2023-11-30 18:33               ` Jeff Layton
2023-11-28 11:24 ` Christian Brauner
2023-11-28 13:52   ` Oleg Nesterov
2023-11-28 15:33     ` Christian Brauner
2023-11-28 16:59       ` Oleg Nesterov
2023-11-28 17:29         ` Oleg Nesterov
2023-11-28 23:40           ` NeilBrown
2023-11-29 11:38           ` Christian Brauner
2023-11-28 14:01 ` Oleg Nesterov
2023-11-28 14:20   ` Oleg Nesterov
2023-11-29  0:14   ` NeilBrown
2023-11-29  7:55     ` Oleg Nesterov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZWYIj7K0KPQFCCdf@tissot.1015granger.net \
    --to=chuck.lever@oracle.com \
    --cc=brauner@kernel.org \
    --cc=jlayton@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox