From: Chuck Lever III <chuck.lever@oracle.com>
To: Jeff Layton <jlayton@kernel.org>
Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
Neil Brown <neilb@suse.de>
Subject: Re: [PATCH v2 3/3] nfsd: start non-blocking writeback after adding nfsd_file to the LRU
Date: Fri, 28 Oct 2022 17:21:27 +0000 [thread overview]
Message-ID: <A040CDCA-5E3F-470F-8D69-8FF9DA4325FE@oracle.com> (raw)
In-Reply-To: <cc4bfa448efedd0017fc7b20b8b7475907acbc5e.camel@kernel.org>
> On Oct 28, 2022, at 11:51 AM, Jeff Layton <jlayton@kernel.org> wrote:
>
> On Fri, 2022-10-28 at 15:29 +0000, Chuck Lever III wrote:
>>
>>> On Oct 28, 2022, at 11:05 AM, Jeff Layton <jlayton@kernel.org> wrote:
>>>
>>> On Fri, 2022-10-28 at 13:16 +0000, Chuck Lever III wrote:
>>>>
>>>>> On Oct 27, 2022, at 5:52 PM, Jeff Layton <jlayton@kernel.org> wrote:
>>>>>
>>>>> When a GC entry gets added to the LRU, kick off SYNC_NONE writeback
>>>>> so that we can be ready to close it out when the time comes.
>>>>
>>>> For a large file, a background flush still has to walk the file's
>>>> pages to see if they are dirty, and that consumes time, CPU, and
>>>> memory bandwidth. We're talking hundreds of microseconds for a
>>>> large file.
>>>>
>>>> Then the final flush does all that again.
>>>>
>>>> Basically, two (or more!) passes through the file for exactly the
>>>> same amount of work. Is there any measured improvement in latency
>>>> or throughput?
>>>>
>>>> And then... for a GC file, no-one is waiting on data persistence
>>>> during nfsd_file_put() so I'm not sure what is gained by taking
>>>> control of the flushing process away from the underlying filesystem.
>>>>
>>>>
>>>> Remind me why the filecache is flushing? Shouldn't NFSD rely on
>>>> COMMIT operations for that? (It's not obvious reading the code,
>>>> maybe there should be a documenting comment somewhere that
>>>> explains this arrangement).
>>>>
>>>
>>>
>>> Fair point. I was trying to replicate the behaviors introduced in these
>>> patches:
>>>
>>> b6669305d35a nfsd: Reduce the number of calls to nfsd_file_gc()
>>> 6b8a94332ee4 nfsd: Fix a write performance regression
>>>
>>> AFAICT, the fsync is there to catch writeback errors so that we can
>>> reset the write verifiers (AFAICT). The rationale for that is described
>>> here:
>>>
>>> 055b24a8f230 nfsd: Don't garbage collect files that might contain write errors
>>
>> Yes, I've been confused about this since then :-)
>>
>> So, the patch description says:
>>
>> If a file may contain unstable writes that can error out, then we want
>> to avoid garbage collecting the struct nfsd_file that may be
>> tracking those errors.
>>
>> That doesn't explain why that's a problem, it just says what we plan to
>> do about it.
>>
>>
>>> The problem with not calling vfs_fsync is that we might miss writeback
>>> errors. The nfsd_file could get reaped before a v3 COMMIT ever comes in.
>>> nfsd would eventually reopen the file but it could miss seeing the error
>>> if it got opened locally in the interim.
>>
>> That helps. So we're surfacing writeback errors for local writers?
>>
>
> Well for non-v3 writers anyway. I suppose you could hit the same
> scenario with a mixed v3 and v4 workload if you were unlucky enough, or
> mixed v3 and ksmbd workload, etc...
>
>> I guess I would like this flushing to interfere as little as possible
>> with the server's happy zone, since it's not something clients need to
>> wait for, and an error is exceptionally rare.
>>
>> But also, we can't let writeback errors hold onto a bunch of memory
>> indefinitely. How much nfsd_file and page cache memory might be be
>> pinned by a writeback error, and for how long?
>>
>
> You mean if we were to stop trying to fsync out when closing? We don't
> keep files in the cache indefinitely, even if they have writeback
> errors.
>
> In general, the kernel attempts to write things out, and if it fails it
> sets a writeback error in the mapping and marks the pages clean. So if
> we're talking about files that are no longer being used (since they're
> being GC'ed), we only block reaping them for as long as writeback is in
> progress.
>
> Once writeback ends and it's eligible for reaping, we'll call vfs_fsync
> a final time, grab the error and reset the write verifier when it's
> non-zero.
>
> If we stop doing fsyncs, then that model sort of breaks down. I'm not
> clear on what you'd like to do instead.
I'm not clear either. I think I just have some hand-wavy requirements.
I think keeping the flushes in the nfsd threads and away from single-
threaded garbage collection makes sense. Keep I/O in nfsd context, not
in the filecache garbage collector. I'm not sure that's guaranteed
if the garbage collection thread does an nfsd_file_put() that flushes.
And, we need to ensure that an nfsd_file isn't pinned forever -- the
flush has to make forward progress so that the nfsd_file is eventually
released. Otherwise, writeback errors become a DoS vector.
But, back to the topic of this patch: my own experiments with background
syncing showed that it introduces significant overhead and it wasn't
really worth the trouble. Basically, on intensive workloads, the garbage
collector must not stall or live-lock because it's walking through
millions of pages trying to figure out which ones are dirty.
>>> I'm not sure we need to worry about that so much for v4 though. Maybe we
>>> should just do this for GC files?
>>
>> I'm not caffeinated yet. Why is it not a problem for v4? Is it because
>> an open or delegation stateid will prevent the nfsd_file from going
>> away?
>>
>
>
> Yeah, more or less.
>
> I think that for a error to be lost with v4, it would require the client
> to have an application access pattern that would expose it to that
> possibility on a local filesystem as well. I don't think we have any
> obligation to do more there.
>
> Maybe that's a false assumption though.
>
>> Sorry for the noise. It's all a little subtle.
>>
>
> Very subtle. The more we can get this fleshed out into comments the
> better, so I welcome the questions.
>
>>
>>>>> Signed-off-by: Jeff Layton <jlayton@kernel.org>
>>>>> ---
>>>>> fs/nfsd/filecache.c | 37 +++++++++++++++++++++++++++++++------
>>>>> 1 file changed, 31 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
>>>>> index d2bbded805d4..491d3d9a1870 100644
>>>>> --- a/fs/nfsd/filecache.c
>>>>> +++ b/fs/nfsd/filecache.c
>>>>> @@ -316,7 +316,7 @@ nfsd_file_alloc(struct nfsd_file_lookup_key *key, unsigned int may)
>>>>> }
>>>>>
>>>>> static void
>>>>> -nfsd_file_flush(struct nfsd_file *nf)
>>>>> +nfsd_file_fsync(struct nfsd_file *nf)
>>>>> {
>>>>> struct file *file = nf->nf_file;
>>>>>
>>>>> @@ -327,6 +327,22 @@ nfsd_file_flush(struct nfsd_file *nf)
>>>>> nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
>>>>> }
>>>>>
>>>>> +static void
>>>>> +nfsd_file_flush(struct nfsd_file *nf)
>>>>> +{
>>>>> + struct file *file = nf->nf_file;
>>>>> + unsigned long nrpages;
>>>>> +
>>>>> + if (!file || !(file->f_mode & FMODE_WRITE))
>>>>> + return;
>>>>> +
>>>>> + nrpages = file->f_mapping->nrpages;
>>>>> + if (nrpages) {
>>>>> + this_cpu_add(nfsd_file_pages_flushed, nrpages);
>>>>> + filemap_flush(file->f_mapping);
>>>>> + }
>>>>> +}
>>>>> +
>>>>> static void
>>>>> nfsd_file_free(struct nfsd_file *nf)
>>>>> {
>>>>> @@ -337,7 +353,7 @@ nfsd_file_free(struct nfsd_file *nf)
>>>>> this_cpu_inc(nfsd_file_releases);
>>>>> this_cpu_add(nfsd_file_total_age, age);
>>>>>
>>>>> - nfsd_file_flush(nf);
>>>>> + nfsd_file_fsync(nf);
>>>>>
>>>>> if (nf->nf_mark)
>>>>> nfsd_file_mark_put(nf->nf_mark);
>>>>> @@ -500,12 +516,21 @@ nfsd_file_put(struct nfsd_file *nf)
>>>>>
>>>>> if (test_bit(NFSD_FILE_GC, &nf->nf_flags)) {
>>>>> /*
>>>>> - * If this is the last reference (nf_ref == 1), then transfer
>>>>> - * it to the LRU. If the add to the LRU fails, just put it as
>>>>> - * usual.
>>>>> + * If this is the last reference (nf_ref == 1), then try
>>>>> + * to transfer it to the LRU.
>>>>> + */
>>>>> + if (refcount_dec_not_one(&nf->nf_ref))
>>>>> + return;
>>>>> +
>>>>> + /*
>>>>> + * If the add to the list succeeds, try to kick off SYNC_NONE
>>>>> + * writeback. If the add fails, then just fall through to
>>>>> + * decrement as usual.
>>>>
>>>> These comments simply repeat what the code does, so they seem
>>>> redundant to me. Could they instead explain why?
>>>>
>>>>
>>>>> */
>>>>> - if (refcount_dec_not_one(&nf->nf_ref) || nfsd_file_lru_add(nf))
>>>>> + if (nfsd_file_lru_add(nf)) {
>>>>> + nfsd_file_flush(nf);
>>>>> return;
>>>>> + }
>>>>> }
>>>>> __nfsd_file_put(nf);
>>>>> }
>>>>> --
>>>>> 2.37.3
>>>>>
>>>>
>>>> --
>>>> Chuck Lever
>>>>
>>>>
>>>>
>>>
>>> --
>>> Jeff Layton <jlayton@kernel.org>
>>
>> --
>> Chuck Lever
>>
>>
>>
>
> --
> Jeff Layton <jlayton@kernel.org>
--
Chuck Lever
next prev parent reply other threads:[~2022-10-28 17:21 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-27 21:52 [PATCH v2 0/3] nfsd: clean up refcounting in filecache Jeff Layton
2022-10-27 21:52 ` [PATCH v2 1/3] nfsd: rework " Jeff Layton
2022-10-27 22:51 ` NeilBrown
2022-10-28 9:57 ` Jeff Layton
2022-10-28 22:14 ` NeilBrown
2022-10-30 21:29 ` NeilBrown
2022-10-31 9:58 ` Jeff Layton
2022-10-31 9:54 ` Jeff Layton
2022-10-27 21:52 ` [PATCH v2 2/3] nfsd: only keep unused entries on the LRU Jeff Layton
2022-10-27 22:20 ` NeilBrown
2022-10-27 22:47 ` Jeff Layton
2022-10-27 22:55 ` NeilBrown
2022-10-27 23:52 ` Chuck Lever III
2022-10-27 21:52 ` [PATCH v2 3/3] nfsd: start non-blocking writeback after adding nfsd_file to " Jeff Layton
2022-10-27 22:25 ` NeilBrown
2022-10-28 13:01 ` Jeff Layton
2022-10-28 13:16 ` Chuck Lever III
2022-10-28 15:05 ` Jeff Layton
2022-10-28 15:29 ` Chuck Lever III
2022-10-28 15:51 ` Jeff Layton
2022-10-28 17:21 ` Chuck Lever III [this message]
2022-10-28 17:43 ` Jeff Layton
2022-10-28 18:53 ` Chuck Lever III
2022-10-28 20:04 ` Jeff Layton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=A040CDCA-5E3F-470F-8D69-8FF9DA4325FE@oracle.com \
--to=chuck.lever@oracle.com \
--cc=jlayton@kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=neilb@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox