public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed
From: "Chuck Lever" <cel@kernel.org>
To: NeilBrown <neil@brown.name>
Cc: "Jeff Layton" <jlayton@kernel.org>,
	"Olga Kornievskaia" <okorniev@redhat.com>,
	"Dai Ngo" <dai.ngo@oracle.com>, "Tom Talpey" <tom@talpey.com>,
	"Mike Snitzer" <snitzer@kernel.org>,
	"Christoph Hellwig" <hch@lst.de>,
	linux-nfs@vger.kernel.org, "Chuck Lever" <chuck.lever@oracle.com>
Subject: Re: [RFC PATCH v2] NFSD: Add asynchronous write throttling support for UNSTABLE WRITEs
Date: Sat, 10 Jan 2026 18:33:59 -0500	[thread overview]
Message-ID: <eaac3112-eafc-4bc5-974f-752edf868788@app.fastmail.com> (raw)
In-Reply-To: <176808109160.2462021.5788018456330144196@noble.neil.brown.name>



On Sat, Jan 10, 2026, at 4:38 PM, NeilBrown wrote:
> On Sun, 11 Jan 2026, Chuck Lever wrote:
>> 
>> On Sat, Jan 10, 2026, at 12:30 AM, NeilBrown wrote:
>> > On Sat, 10 Jan 2026, Chuck Lever wrote:
>> >> From: Chuck Lever <chuck.lever@oracle.com>
>> >> 
>> >> When memory pressure occurs during buffered writes, the traditional
>> >> approach is for balance_dirty_pages() to put the writing thread to
>> >> sleep until dirty pages are flushed. For NFSD, this means server
>> >> threads block waiting for I/O, reducing overall server throughput.
>> >> 
>> >> Add asynchronous write throttling for UNSTABLE writes using the
>> >> BDP_ASYNC flag to balance_dirty_pages_ratelimited_flags(). NFSD
>> >> checks memory pressure before attempting a buffered write. If the
>> >> call returns -EAGAIN (indicating memory exhaustion), NFSD returns
>> >> NFS4ERR_DELAY (or NFSERR_JUKEBOX for NFSv3) to the client instead
>> >> of blocking.
>> >> 
>> >> Clients then wait and retry, rather than tying up server memory with
>> >> a cached uncommitted write payload.
>> >> 
>> >> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> >> ---
>> >>  fs/nfsd/vfs.c | 24 ++++++++++++++++++++++++
>> >>  1 file changed, 24 insertions(+)
>> >> 
>> >> Compile tested only.
>> >> 
>> >> Changes since RFC v1:
>> >> - Remove the experimental debugfs setting
>> >> - Enforce throttling specifically only for UNSTABLE WRITEs
>> >> 
>> >> 
>> >> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
>> >> index 168d3ccc8155..c4550105234e 100644
>> >> --- a/fs/nfsd/vfs.c
>> >> +++ b/fs/nfsd/vfs.c
>> >> @@ -1458,6 +1458,30 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> >>  		}
>> >>  	}
>> >>  
>> >> +	/*
>> >> +	 * Throttle buffered writes under memory pressure. When dirty
>> >> +	 * page limits are exceeded, BDP_ASYNC causes -EAGAIN to be
>> >> +	 * returned rather than blocking the thread. This -EAGAIN
>> >> +	 * maps to nfserr_jukebox, signaling the client to back off
>> >> +	 * and retry rather than tying up a server thread during
>> >> +	 * writeback.
>> >> +	 *
>> >> +	 * NFSv2 writes commit to stable storage before reply; no
>> >> +	 * dirty pages accumulate, so throttling is unnecessary.
>> >> +	 * FILE_SYNC and DATA_SYNC writes flush immediately and do
>> >> +	 * not leave uncommitted dirty pages behind.
>> >> +	 * Direct I/O and DONTCACHE bypass the page cache entirely.
>> >> +	 */
>> >> +	if (rqstp->rq_vers > 2 &&
>> >> +	    stable == NFS_UNSTABLE &&
>> >> +	    nfsd_io_cache_write == NFSD_IO_BUFFERED) {
>> >> +		host_err =
>> >> +			balance_dirty_pages_ratelimited_flags(file->f_mapping,
>> >> +							      BDP_ASYNC);
>> >> +		if (host_err == -EAGAIN)
>> >> +			goto out_nfserr;
>> >
>> > I doubt that this will do what you want - at least not reliably.
>> >
>> > balance_dirty_pages_ratelimited_flags() assumes it will be called
>> > repeatedly by the same task and it lets that task write for a while,
>> > then blocks it, then lets it write some more.
>> >
>> > The way you have integrated it into nfsd could result in the write load
>> > bouncing around among different threads and behaving inconsistently.
>> >
>> > Also the delay imposed is (for a Linux client) between 100ms and
>> > 15seconds.
>> > I suspect that is often longer than we would want.  The actual pause
>> > imposed by page-writeback.c is variable based on the measured throughput
>> > of the backing device.
>> 
>> These are UNSTABLE WRITEs. I can understand delaying the COMMIT because
>> that's where NFSD requests synchronous interaction with the backing
>> device. But nothing delays an UNSTABLE WRITE if the backing device is
>> slow.
>
> That isn't correct.  If the "dirty threshold" is reached (e.g.  10% of
> memory dirty) then balance_dirty_pages() will delay all writes to avoid
> exceeding the dirty page limit.

That doesn't seem to be happening in some cases. Or perhaps, it is
happening, but the added delay is not aggressive enough.


>> > But maybe I'm seeing problems that don't exist.  Testing would help, but
>> > finding a mix of loads that properly stress the system would be a
>> > challenge.
>> >
>> > And maybe just allowing the thread pool to grow will make this a
>> > non-problem? 
>> 
>> I think allowing the thread pool to grow could make the memory problem
>> worse.
>
> At 4(?) pages per thread?

I'm talking about the WRITE payloads, not the thread footprint.

More threads means capacity to handle a higher rate of ingress UNSTABLE
WRITE traffic. I think we need a way for NFSD to complete those requests
quickly (with NFS4ERR_DELAY, for example) when the server is under duress
so that WRITE payloads pending on the transport queue or waiting to be
committed do not consume server memory until the server has the resources
to process the WRITEs.

Flow control, essentially.


> What exactly is "the memory problem"?  Do you have specific symptoms you
> are trying to address?  Have you had NFS server run out of memory and
> grind to a halt?

Review the past 9 months of Mike's work on direct I/O, published on
this mailing list. Hammerspace has measured this misbehavior and
experienced server melt-down. Their solution is to avoid using the
page cache entirely.

But even so there still seems to be an effective denial-of-service
vector by overloading NFSD with UNSTABLE WRITE traffic faster than
it can push it to persistence.

Perhaps we need better observability first.


-- 
Chuck Lever

  reply	other threads:[~2026-01-10 23:34 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-09 21:56 [RFC PATCH v2] NFSD: Add asynchronous write throttling support for UNSTABLE WRITEs Chuck Lever
2026-01-10  5:30 ` NeilBrown
2026-01-10 20:28   ` Chuck Lever
2026-01-10 21:38     ` NeilBrown
2026-01-10 23:33       ` Chuck Lever [this message]
2026-01-12  4:15         ` NeilBrown
2026-01-12 14:38           ` Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=eaac3112-eafc-4bc5-974f-752edf868788@app.fastmail.com \
    --to=cel@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=dai.ngo@oracle.com \
    --cc=hch@lst.de \
    --cc=jlayton@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neil@brown.name \
    --cc=okorniev@redhat.com \
    --cc=snitzer@kernel.org \
    --cc=tom@talpey.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox