[RFC PATCH v2] NFSD: Add asynchronous write throttling support for UNSTABLE WRITEs

public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH v2] NFSD: Add asynchronous write throttling support for UNSTABLE WRITEs
@ 2026-01-09 21:56 Chuck Lever
  2026-01-10  5:30 ` NeilBrown
  0 siblings, 1 reply; 7+ messages in thread
From: Chuck Lever @ 2026-01-09 21:56 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Mike Snitzer, Christoph Hellwig
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

When memory pressure occurs during buffered writes, the traditional
approach is for balance_dirty_pages() to put the writing thread to
sleep until dirty pages are flushed. For NFSD, this means server
threads block waiting for I/O, reducing overall server throughput.

Add asynchronous write throttling for UNSTABLE writes using the
BDP_ASYNC flag to balance_dirty_pages_ratelimited_flags(). NFSD
checks memory pressure before attempting a buffered write. If the
call returns -EAGAIN (indicating memory exhaustion), NFSD returns
NFS4ERR_DELAY (or NFSERR_JUKEBOX for NFSv3) to the client instead
of blocking.

Clients then wait and retry, rather than tying up server memory with
a cached uncommitted write payload.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/vfs.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

Compile tested only.

Changes since RFC v1:
- Remove the experimental debugfs setting
- Enforce throttling specifically only for UNSTABLE WRITEs


diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 168d3ccc8155..c4550105234e 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1458,6 +1458,30 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		}
 	}
 
+	/*
+	 * Throttle buffered writes under memory pressure. When dirty
+	 * page limits are exceeded, BDP_ASYNC causes -EAGAIN to be
+	 * returned rather than blocking the thread. This -EAGAIN
+	 * maps to nfserr_jukebox, signaling the client to back off
+	 * and retry rather than tying up a server thread during
+	 * writeback.
+	 *
+	 * NFSv2 writes commit to stable storage before reply; no
+	 * dirty pages accumulate, so throttling is unnecessary.
+	 * FILE_SYNC and DATA_SYNC writes flush immediately and do
+	 * not leave uncommitted dirty pages behind.
+	 * Direct I/O and DONTCACHE bypass the page cache entirely.
+	 */
+	if (rqstp->rq_vers > 2 &&
+	    stable == NFS_UNSTABLE &&
+	    nfsd_io_cache_write == NFSD_IO_BUFFERED) {
+		host_err =
+			balance_dirty_pages_ratelimited_flags(file->f_mapping,
+							      BDP_ASYNC);
+		if (host_err == -EAGAIN)
+			goto out_nfserr;
+	}
+
 	nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
 
 	since = READ_ONCE(file->f_wb_err);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v2] NFSD: Add asynchronous write throttling support for UNSTABLE WRITEs
  2026-01-09 21:56 [RFC PATCH v2] NFSD: Add asynchronous write throttling support for UNSTABLE WRITEs Chuck Lever
@ 2026-01-10  5:30 ` NeilBrown
  2026-01-10 20:28   ` Chuck Lever
  0 siblings, 1 reply; 7+ messages in thread
From: NeilBrown @ 2026-01-10  5:30 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer,
	Christoph Hellwig, linux-nfs, Chuck Lever

On Sat, 10 Jan 2026, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> When memory pressure occurs during buffered writes, the traditional
> approach is for balance_dirty_pages() to put the writing thread to
> sleep until dirty pages are flushed. For NFSD, this means server
> threads block waiting for I/O, reducing overall server throughput.
> 
> Add asynchronous write throttling for UNSTABLE writes using the
> BDP_ASYNC flag to balance_dirty_pages_ratelimited_flags(). NFSD
> checks memory pressure before attempting a buffered write. If the
> call returns -EAGAIN (indicating memory exhaustion), NFSD returns
> NFS4ERR_DELAY (or NFSERR_JUKEBOX for NFSv3) to the client instead
> of blocking.
> 
> Clients then wait and retry, rather than tying up server memory with
> a cached uncommitted write payload.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/vfs.c | 24 ++++++++++++++++++++++++
>  1 file changed, 24 insertions(+)
> 
> Compile tested only.
> 
> Changes since RFC v1:
> - Remove the experimental debugfs setting
> - Enforce throttling specifically only for UNSTABLE WRITEs
> 
> 
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 168d3ccc8155..c4550105234e 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1458,6 +1458,30 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  		}
>  	}
>  
> +	/*
> +	 * Throttle buffered writes under memory pressure. When dirty
> +	 * page limits are exceeded, BDP_ASYNC causes -EAGAIN to be
> +	 * returned rather than blocking the thread. This -EAGAIN
> +	 * maps to nfserr_jukebox, signaling the client to back off
> +	 * and retry rather than tying up a server thread during
> +	 * writeback.
> +	 *
> +	 * NFSv2 writes commit to stable storage before reply; no
> +	 * dirty pages accumulate, so throttling is unnecessary.
> +	 * FILE_SYNC and DATA_SYNC writes flush immediately and do
> +	 * not leave uncommitted dirty pages behind.
> +	 * Direct I/O and DONTCACHE bypass the page cache entirely.
> +	 */
> +	if (rqstp->rq_vers > 2 &&
> +	    stable == NFS_UNSTABLE &&
> +	    nfsd_io_cache_write == NFSD_IO_BUFFERED) {
> +		host_err =
> +			balance_dirty_pages_ratelimited_flags(file->f_mapping,
> +							      BDP_ASYNC);
> +		if (host_err == -EAGAIN)
> +			goto out_nfserr;

I doubt that this will do what you want - at least not reliably.

balance_dirty_pages_ratelimited_flags() assumes it will be called
repeatedly by the same task and it lets that task write for a while,
then blocks it, then lets it write some more.

The way you have integrated it into nfsd could result in the write load
bouncing around among different threads and behaving inconsistently.

Also the delay imposed is (for a Linux client) between 100ms and
15seconds.
I suspect that is often longer than we would want.  The actual pause
imposed by page-writeback.c is variable based on the measured throughput
of the backing device.

What we really want, I think, is to be able to push back on the client
by limiting the number of bytes in unacknowledged writes, but I don't
think NFS has any mechanism for that.

I cannot immediately think of any approach that really shows promise,
but I suspect that it will involves a deeper interaction with the
writeback code in a way that abstracts out the task state so that nfsd
can appear to be one-task-per-client (or similar).

Possibly the best approach for throttling the client is to somehow delay
the reply (without tying up a thread) so that it sees a fairly precise
latency.... 

But maybe I'm seeing problems that don't exist.  Testing would help, but
finding a mix of loads that properly stress the system would be a
challenge.

And maybe just allowing the thread pool to grow will make this a
non-problem? 

Thanks,
NeilBrown

> +	}
> +
>  	nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
>  
>  	since = READ_ONCE(file->f_wb_err);
> -- 
> 2.52.0
> 
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v2] NFSD: Add asynchronous write throttling support for UNSTABLE WRITEs
  2026-01-10  5:30 ` NeilBrown
@ 2026-01-10 20:28   ` Chuck Lever
  2026-01-10 21:38     ` NeilBrown
  0 siblings, 1 reply; 7+ messages in thread
From: Chuck Lever @ 2026-01-10 20:28 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer,
	Christoph Hellwig, linux-nfs, Chuck Lever



On Sat, Jan 10, 2026, at 12:30 AM, NeilBrown wrote:
> On Sat, 10 Jan 2026, Chuck Lever wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>> 
>> When memory pressure occurs during buffered writes, the traditional
>> approach is for balance_dirty_pages() to put the writing thread to
>> sleep until dirty pages are flushed. For NFSD, this means server
>> threads block waiting for I/O, reducing overall server throughput.
>> 
>> Add asynchronous write throttling for UNSTABLE writes using the
>> BDP_ASYNC flag to balance_dirty_pages_ratelimited_flags(). NFSD
>> checks memory pressure before attempting a buffered write. If the
>> call returns -EAGAIN (indicating memory exhaustion), NFSD returns
>> NFS4ERR_DELAY (or NFSERR_JUKEBOX for NFSv3) to the client instead
>> of blocking.
>> 
>> Clients then wait and retry, rather than tying up server memory with
>> a cached uncommitted write payload.
>> 
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>>  fs/nfsd/vfs.c | 24 ++++++++++++++++++++++++
>>  1 file changed, 24 insertions(+)
>> 
>> Compile tested only.
>> 
>> Changes since RFC v1:
>> - Remove the experimental debugfs setting
>> - Enforce throttling specifically only for UNSTABLE WRITEs
>> 
>> 
>> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
>> index 168d3ccc8155..c4550105234e 100644
>> --- a/fs/nfsd/vfs.c
>> +++ b/fs/nfsd/vfs.c
>> @@ -1458,6 +1458,30 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>  		}
>>  	}
>>  
>> +	/*
>> +	 * Throttle buffered writes under memory pressure. When dirty
>> +	 * page limits are exceeded, BDP_ASYNC causes -EAGAIN to be
>> +	 * returned rather than blocking the thread. This -EAGAIN
>> +	 * maps to nfserr_jukebox, signaling the client to back off
>> +	 * and retry rather than tying up a server thread during
>> +	 * writeback.
>> +	 *
>> +	 * NFSv2 writes commit to stable storage before reply; no
>> +	 * dirty pages accumulate, so throttling is unnecessary.
>> +	 * FILE_SYNC and DATA_SYNC writes flush immediately and do
>> +	 * not leave uncommitted dirty pages behind.
>> +	 * Direct I/O and DONTCACHE bypass the page cache entirely.
>> +	 */
>> +	if (rqstp->rq_vers > 2 &&
>> +	    stable == NFS_UNSTABLE &&
>> +	    nfsd_io_cache_write == NFSD_IO_BUFFERED) {
>> +		host_err =
>> +			balance_dirty_pages_ratelimited_flags(file->f_mapping,
>> +							      BDP_ASYNC);
>> +		if (host_err == -EAGAIN)
>> +			goto out_nfserr;
>
> I doubt that this will do what you want - at least not reliably.
>
> balance_dirty_pages_ratelimited_flags() assumes it will be called
> repeatedly by the same task and it lets that task write for a while,
> then blocks it, then lets it write some more.
>
> The way you have integrated it into nfsd could result in the write load
> bouncing around among different threads and behaving inconsistently.
>
> Also the delay imposed is (for a Linux client) between 100ms and
> 15seconds.
> I suspect that is often longer than we would want.  The actual pause
> imposed by page-writeback.c is variable based on the measured throughput
> of the backing device.

These are UNSTABLE WRITEs. I can understand delaying the COMMIT because
that's where NFSD requests synchronous interaction with the backing
device. But nothing delays an UNSTABLE WRITE if the backing device is
slow.

But I can see there could be significant fairness issues with the bdp
approach here.


> What we really want, I think, is to be able to push back on the client
> by limiting the number of bytes in unacknowledged writes, but I don't
> think NFS has any mechanism for that.
>
> I cannot immediately think of any approach that really shows promise,
> but I suspect that it will involves a deeper interaction with the
> writeback code in a way that abstracts out the task state so that nfsd
> can appear to be one-task-per-client (or similar).
>
> Possibly the best approach for throttling the client is to somehow delay
> the reply (without tying up a thread) so that it sees a fairly precise
> latency....

Set aside my threading comments for a moment. What I'm trying to prevent
is UNSTABLE WRITEs tying up server /memory/. When under memory pressure,
NFSD needs to delay UNSTABLE WRITEs until there is adequate memory to cache
the WRITE payloads, without having to evict something more critical that
could cause the server significant heartburn or livelock.

I'm considering another idea where the svc threads in each thread pool
are all members of cgroup. That way the amount of memory dedicated to
NFSD in one container (let's say) can be constrained, preventing it
from overrunning memory resources needed to keep the server system up
and otherwise responsive.


> But maybe I'm seeing problems that don't exist.  Testing would help, but
> finding a mix of loads that properly stress the system would be a
> challenge.
>
> And maybe just allowing the thread pool to grow will make this a
> non-problem? 

I think allowing the thread pool to grow could make the memory problem
worse.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v2] NFSD: Add asynchronous write throttling support for UNSTABLE WRITEs
  2026-01-10 20:28   ` Chuck Lever
@ 2026-01-10 21:38     ` NeilBrown
  2026-01-10 23:33       ` Chuck Lever
  0 siblings, 1 reply; 7+ messages in thread
From: NeilBrown @ 2026-01-10 21:38 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer,
	Christoph Hellwig, linux-nfs, Chuck Lever

On Sun, 11 Jan 2026, Chuck Lever wrote:
> 
> On Sat, Jan 10, 2026, at 12:30 AM, NeilBrown wrote:
> > On Sat, 10 Jan 2026, Chuck Lever wrote:
> >> From: Chuck Lever <chuck.lever@oracle.com>
> >> 
> >> When memory pressure occurs during buffered writes, the traditional
> >> approach is for balance_dirty_pages() to put the writing thread to
> >> sleep until dirty pages are flushed. For NFSD, this means server
> >> threads block waiting for I/O, reducing overall server throughput.
> >> 
> >> Add asynchronous write throttling for UNSTABLE writes using the
> >> BDP_ASYNC flag to balance_dirty_pages_ratelimited_flags(). NFSD
> >> checks memory pressure before attempting a buffered write. If the
> >> call returns -EAGAIN (indicating memory exhaustion), NFSD returns
> >> NFS4ERR_DELAY (or NFSERR_JUKEBOX for NFSv3) to the client instead
> >> of blocking.
> >> 
> >> Clients then wait and retry, rather than tying up server memory with
> >> a cached uncommitted write payload.
> >> 
> >> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> >> ---
> >>  fs/nfsd/vfs.c | 24 ++++++++++++++++++++++++
> >>  1 file changed, 24 insertions(+)
> >> 
> >> Compile tested only.
> >> 
> >> Changes since RFC v1:
> >> - Remove the experimental debugfs setting
> >> - Enforce throttling specifically only for UNSTABLE WRITEs
> >> 
> >> 
> >> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> >> index 168d3ccc8155..c4550105234e 100644
> >> --- a/fs/nfsd/vfs.c
> >> +++ b/fs/nfsd/vfs.c
> >> @@ -1458,6 +1458,30 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >>  		}
> >>  	}
> >>  
> >> +	/*
> >> +	 * Throttle buffered writes under memory pressure. When dirty
> >> +	 * page limits are exceeded, BDP_ASYNC causes -EAGAIN to be
> >> +	 * returned rather than blocking the thread. This -EAGAIN
> >> +	 * maps to nfserr_jukebox, signaling the client to back off
> >> +	 * and retry rather than tying up a server thread during
> >> +	 * writeback.
> >> +	 *
> >> +	 * NFSv2 writes commit to stable storage before reply; no
> >> +	 * dirty pages accumulate, so throttling is unnecessary.
> >> +	 * FILE_SYNC and DATA_SYNC writes flush immediately and do
> >> +	 * not leave uncommitted dirty pages behind.
> >> +	 * Direct I/O and DONTCACHE bypass the page cache entirely.
> >> +	 */
> >> +	if (rqstp->rq_vers > 2 &&
> >> +	    stable == NFS_UNSTABLE &&
> >> +	    nfsd_io_cache_write == NFSD_IO_BUFFERED) {
> >> +		host_err =
> >> +			balance_dirty_pages_ratelimited_flags(file->f_mapping,
> >> +							      BDP_ASYNC);
> >> +		if (host_err == -EAGAIN)
> >> +			goto out_nfserr;
> >
> > I doubt that this will do what you want - at least not reliably.
> >
> > balance_dirty_pages_ratelimited_flags() assumes it will be called
> > repeatedly by the same task and it lets that task write for a while,
> > then blocks it, then lets it write some more.
> >
> > The way you have integrated it into nfsd could result in the write load
> > bouncing around among different threads and behaving inconsistently.
> >
> > Also the delay imposed is (for a Linux client) between 100ms and
> > 15seconds.
> > I suspect that is often longer than we would want.  The actual pause
> > imposed by page-writeback.c is variable based on the measured throughput
> > of the backing device.
> 
> These are UNSTABLE WRITEs. I can understand delaying the COMMIT because
> that's where NFSD requests synchronous interaction with the backing
> device. But nothing delays an UNSTABLE WRITE if the backing device is
> slow.

That isn't correct.  If the "dirty threshold" is reached (e.g.  10% of
memory dirty) then balance_dirty_pages() will delay all writes to avoid
exceeding the dirty page limit.
It attempts to monitor the recent throughput of each backing device, and
to divide available memory among then in the same proportion as
throughput, then throttle writes to backing devices using more than
their share.

> 
> But I can see there could be significant fairness issues with the bdp
> approach here.
> 
> 
> > What we really want, I think, is to be able to push back on the client
> > by limiting the number of bytes in unacknowledged writes, but I don't
> > think NFS has any mechanism for that.
> >
> > I cannot immediately think of any approach that really shows promise,
> > but I suspect that it will involves a deeper interaction with the
> > writeback code in a way that abstracts out the task state so that nfsd
> > can appear to be one-task-per-client (or similar).
> >
> > Possibly the best approach for throttling the client is to somehow delay
> > the reply (without tying up a thread) so that it sees a fairly precise
> > latency....
> 
> Set aside my threading comments for a moment. What I'm trying to prevent
> is UNSTABLE WRITEs tying up server /memory/. When under memory pressure,
> NFSD needs to delay UNSTABLE WRITEs until there is adequate memory to cache
> the WRITE payloads, without having to evict something more critical that
> could cause the server significant heartburn or livelock.

The writeback code already does this. The numbers in 
  /proc/sys/vm/dirty_ratio
and
  /proc/sys/vm/dirty_bytes

can be used to set how much memory can store dirty pages (including
UNSTABLE writes in write-back).  Writers are throttled to attempt to
mostly keep within that limit.

> 
> I'm considering another idea where the svc threads in each thread pool
> are all members of cgroup. That way the amount of memory dedicated to
> NFSD in one container (let's say) can be constrained, preventing it
> from overrunning memory resources needed to keep the server system up
> and otherwise responsive.

I have no direct experience with mem-cgroups so there are doubtless
subtleties that I miss, but I suspect the primary effect of including
nfsd in a mem-cgroup would be to push file content out of the page-cache
more quickly.  I don't think it can control dirty pages separately from
clean pages, but I could easily be wrong.

And that might be exactly what you want to happen.

> 
> 
> > But maybe I'm seeing problems that don't exist.  Testing would help, but
> > finding a mix of loads that properly stress the system would be a
> > challenge.
> >
> > And maybe just allowing the thread pool to grow will make this a
> > non-problem? 
> 
> I think allowing the thread pool to grow could make the memory problem
> worse.

At 4(?) pages per thread?

What exactly is "the memory problem"?  Do you have specific symptoms you
are trying to address?  Have you had NFS server run out of memory and
grind to a halt?

Thanks,
NeilBrown

> 
> 
> -- 
> Chuck Lever
> 
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v2] NFSD: Add asynchronous write throttling support for UNSTABLE WRITEs
  2026-01-10 21:38     ` NeilBrown
@ 2026-01-10 23:33       ` Chuck Lever
  2026-01-12  4:15         ` NeilBrown
  0 siblings, 1 reply; 7+ messages in thread
From: Chuck Lever @ 2026-01-10 23:33 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer,
	Christoph Hellwig, linux-nfs, Chuck Lever



On Sat, Jan 10, 2026, at 4:38 PM, NeilBrown wrote:
> On Sun, 11 Jan 2026, Chuck Lever wrote:
>> 
>> On Sat, Jan 10, 2026, at 12:30 AM, NeilBrown wrote:
>> > On Sat, 10 Jan 2026, Chuck Lever wrote:
>> >> From: Chuck Lever <chuck.lever@oracle.com>
>> >> 
>> >> When memory pressure occurs during buffered writes, the traditional
>> >> approach is for balance_dirty_pages() to put the writing thread to
>> >> sleep until dirty pages are flushed. For NFSD, this means server
>> >> threads block waiting for I/O, reducing overall server throughput.
>> >> 
>> >> Add asynchronous write throttling for UNSTABLE writes using the
>> >> BDP_ASYNC flag to balance_dirty_pages_ratelimited_flags(). NFSD
>> >> checks memory pressure before attempting a buffered write. If the
>> >> call returns -EAGAIN (indicating memory exhaustion), NFSD returns
>> >> NFS4ERR_DELAY (or NFSERR_JUKEBOX for NFSv3) to the client instead
>> >> of blocking.
>> >> 
>> >> Clients then wait and retry, rather than tying up server memory with
>> >> a cached uncommitted write payload.
>> >> 
>> >> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> >> ---
>> >>  fs/nfsd/vfs.c | 24 ++++++++++++++++++++++++
>> >>  1 file changed, 24 insertions(+)
>> >> 
>> >> Compile tested only.
>> >> 
>> >> Changes since RFC v1:
>> >> - Remove the experimental debugfs setting
>> >> - Enforce throttling specifically only for UNSTABLE WRITEs
>> >> 
>> >> 
>> >> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
>> >> index 168d3ccc8155..c4550105234e 100644
>> >> --- a/fs/nfsd/vfs.c
>> >> +++ b/fs/nfsd/vfs.c
>> >> @@ -1458,6 +1458,30 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> >>  		}
>> >>  	}
>> >>  
>> >> +	/*
>> >> +	 * Throttle buffered writes under memory pressure. When dirty
>> >> +	 * page limits are exceeded, BDP_ASYNC causes -EAGAIN to be
>> >> +	 * returned rather than blocking the thread. This -EAGAIN
>> >> +	 * maps to nfserr_jukebox, signaling the client to back off
>> >> +	 * and retry rather than tying up a server thread during
>> >> +	 * writeback.
>> >> +	 *
>> >> +	 * NFSv2 writes commit to stable storage before reply; no
>> >> +	 * dirty pages accumulate, so throttling is unnecessary.
>> >> +	 * FILE_SYNC and DATA_SYNC writes flush immediately and do
>> >> +	 * not leave uncommitted dirty pages behind.
>> >> +	 * Direct I/O and DONTCACHE bypass the page cache entirely.
>> >> +	 */
>> >> +	if (rqstp->rq_vers > 2 &&
>> >> +	    stable == NFS_UNSTABLE &&
>> >> +	    nfsd_io_cache_write == NFSD_IO_BUFFERED) {
>> >> +		host_err =
>> >> +			balance_dirty_pages_ratelimited_flags(file->f_mapping,
>> >> +							      BDP_ASYNC);
>> >> +		if (host_err == -EAGAIN)
>> >> +			goto out_nfserr;
>> >
>> > I doubt that this will do what you want - at least not reliably.
>> >
>> > balance_dirty_pages_ratelimited_flags() assumes it will be called
>> > repeatedly by the same task and it lets that task write for a while,
>> > then blocks it, then lets it write some more.
>> >
>> > The way you have integrated it into nfsd could result in the write load
>> > bouncing around among different threads and behaving inconsistently.
>> >
>> > Also the delay imposed is (for a Linux client) between 100ms and
>> > 15seconds.
>> > I suspect that is often longer than we would want.  The actual pause
>> > imposed by page-writeback.c is variable based on the measured throughput
>> > of the backing device.
>> 
>> These are UNSTABLE WRITEs. I can understand delaying the COMMIT because
>> that's where NFSD requests synchronous interaction with the backing
>> device. But nothing delays an UNSTABLE WRITE if the backing device is
>> slow.
>
> That isn't correct.  If the "dirty threshold" is reached (e.g.  10% of
> memory dirty) then balance_dirty_pages() will delay all writes to avoid
> exceeding the dirty page limit.

That doesn't seem to be happening in some cases. Or perhaps, it is
happening, but the added delay is not aggressive enough.


>> > But maybe I'm seeing problems that don't exist.  Testing would help, but
>> > finding a mix of loads that properly stress the system would be a
>> > challenge.
>> >
>> > And maybe just allowing the thread pool to grow will make this a
>> > non-problem? 
>> 
>> I think allowing the thread pool to grow could make the memory problem
>> worse.
>
> At 4(?) pages per thread?

I'm talking about the WRITE payloads, not the thread footprint.

More threads means capacity to handle a higher rate of ingress UNSTABLE
WRITE traffic. I think we need a way for NFSD to complete those requests
quickly (with NFS4ERR_DELAY, for example) when the server is under duress
so that WRITE payloads pending on the transport queue or waiting to be
committed do not consume server memory until the server has the resources
to process the WRITEs.

Flow control, essentially.


> What exactly is "the memory problem"?  Do you have specific symptoms you
> are trying to address?  Have you had NFS server run out of memory and
> grind to a halt?

Review the past 9 months of Mike's work on direct I/O, published on
this mailing list. Hammerspace has measured this misbehavior and
experienced server melt-down. Their solution is to avoid using the
page cache entirely.

But even so there still seems to be an effective denial-of-service
vector by overloading NFSD with UNSTABLE WRITE traffic faster than
it can push it to persistence.

Perhaps we need better observability first.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v2] NFSD: Add asynchronous write throttling support for UNSTABLE WRITEs
  2026-01-10 23:33       ` Chuck Lever
@ 2026-01-12  4:15         ` NeilBrown
  2026-01-12 14:38           ` Chuck Lever
  0 siblings, 1 reply; 7+ messages in thread
From: NeilBrown @ 2026-01-12  4:15 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer,
	Christoph Hellwig, linux-nfs, Chuck Lever

On Sun, 11 Jan 2026, Chuck Lever wrote:
> 
> On Sat, Jan 10, 2026, at 4:38 PM, NeilBrown wrote:
> > On Sun, 11 Jan 2026, Chuck Lever wrote:
> >> 
> >> On Sat, Jan 10, 2026, at 12:30 AM, NeilBrown wrote:
> >> > On Sat, 10 Jan 2026, Chuck Lever wrote:
> >> >> From: Chuck Lever <chuck.lever@oracle.com>
> >> >> 
> >> >> When memory pressure occurs during buffered writes, the traditional
> >> >> approach is for balance_dirty_pages() to put the writing thread to
> >> >> sleep until dirty pages are flushed. For NFSD, this means server
> >> >> threads block waiting for I/O, reducing overall server throughput.
> >> >> 
> >> >> Add asynchronous write throttling for UNSTABLE writes using the
> >> >> BDP_ASYNC flag to balance_dirty_pages_ratelimited_flags(). NFSD
> >> >> checks memory pressure before attempting a buffered write. If the
> >> >> call returns -EAGAIN (indicating memory exhaustion), NFSD returns
> >> >> NFS4ERR_DELAY (or NFSERR_JUKEBOX for NFSv3) to the client instead
> >> >> of blocking.
> >> >> 
> >> >> Clients then wait and retry, rather than tying up server memory with
> >> >> a cached uncommitted write payload.
> >> >> 
> >> >> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> >> >> ---
> >> >>  fs/nfsd/vfs.c | 24 ++++++++++++++++++++++++
> >> >>  1 file changed, 24 insertions(+)
> >> >> 
> >> >> Compile tested only.
> >> >> 
> >> >> Changes since RFC v1:
> >> >> - Remove the experimental debugfs setting
> >> >> - Enforce throttling specifically only for UNSTABLE WRITEs
> >> >> 
> >> >> 
> >> >> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> >> >> index 168d3ccc8155..c4550105234e 100644
> >> >> --- a/fs/nfsd/vfs.c
> >> >> +++ b/fs/nfsd/vfs.c
> >> >> @@ -1458,6 +1458,30 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >> >>  		}
> >> >>  	}
> >> >>  
> >> >> +	/*
> >> >> +	 * Throttle buffered writes under memory pressure. When dirty
> >> >> +	 * page limits are exceeded, BDP_ASYNC causes -EAGAIN to be
> >> >> +	 * returned rather than blocking the thread. This -EAGAIN
> >> >> +	 * maps to nfserr_jukebox, signaling the client to back off
> >> >> +	 * and retry rather than tying up a server thread during
> >> >> +	 * writeback.
> >> >> +	 *
> >> >> +	 * NFSv2 writes commit to stable storage before reply; no
> >> >> +	 * dirty pages accumulate, so throttling is unnecessary.
> >> >> +	 * FILE_SYNC and DATA_SYNC writes flush immediately and do
> >> >> +	 * not leave uncommitted dirty pages behind.
> >> >> +	 * Direct I/O and DONTCACHE bypass the page cache entirely.
> >> >> +	 */
> >> >> +	if (rqstp->rq_vers > 2 &&
> >> >> +	    stable == NFS_UNSTABLE &&
> >> >> +	    nfsd_io_cache_write == NFSD_IO_BUFFERED) {
> >> >> +		host_err =
> >> >> +			balance_dirty_pages_ratelimited_flags(file->f_mapping,
> >> >> +							      BDP_ASYNC);
> >> >> +		if (host_err == -EAGAIN)
> >> >> +			goto out_nfserr;
> >> >
> >> > I doubt that this will do what you want - at least not reliably.
> >> >
> >> > balance_dirty_pages_ratelimited_flags() assumes it will be called
> >> > repeatedly by the same task and it lets that task write for a while,
> >> > then blocks it, then lets it write some more.
> >> >
> >> > The way you have integrated it into nfsd could result in the write load
> >> > bouncing around among different threads and behaving inconsistently.
> >> >
> >> > Also the delay imposed is (for a Linux client) between 100ms and
> >> > 15seconds.
> >> > I suspect that is often longer than we would want.  The actual pause
> >> > imposed by page-writeback.c is variable based on the measured throughput
> >> > of the backing device.
> >> 
> >> These are UNSTABLE WRITEs. I can understand delaying the COMMIT because
> >> that's where NFSD requests synchronous interaction with the backing
> >> device. But nothing delays an UNSTABLE WRITE if the backing device is
> >> slow.
> >
> > That isn't correct.  If the "dirty threshold" is reached (e.g.  10% of
> > memory dirty) then balance_dirty_pages() will delay all writes to avoid
> > exceeding the dirty page limit.
> 
> That doesn't seem to be happening in some cases. Or perhaps, it is
> happening, but the added delay is not aggressive enough.

I would be surprised if that actual count of dirty pages
   grep Dirty /proc/meminfo

grows much above the dirty page limit.  There is some elasticity so the
threads don't need to check global variables on every page - only every
1024 pages or something like that - so the nominal limit can be exceeded
briefly.  But there should still be bounds and if nfsd is being allowed
to dirty significantly more pages then it should, then that is a problem
well beyond nfsd.

> 
> 
> >> > But maybe I'm seeing problems that don't exist.  Testing would help, but
> >> > finding a mix of loads that properly stress the system would be a
> >> > challenge.
> >> >
> >> > And maybe just allowing the thread pool to grow will make this a
> >> > non-problem? 
> >> 
> >> I think allowing the thread pool to grow could make the memory problem
> >> worse.
> >
> > At 4(?) pages per thread?
> 
> I'm talking about the WRITE payloads, not the thread footprint.

hmmmm..
My footprint calculation was extemely wrong.  I didn't allow for the
rq_pages allocation - so add 1MB or more per thread.

But the write payload cannot get beyond the thread footprint without
creating dirty pages, which are limited.

> 
> More threads means capacity to handle a higher rate of ingress UNSTABLE
> WRITE traffic. I think we need a way for NFSD to complete those requests
> quickly (with NFS4ERR_DELAY, for example) when the server is under duress
> so that WRITE payloads pending on the transport queue or waiting to be
> committed do not consume server memory until the server has the resources
> to process the WRITEs.
> 
> Flow control, essentially.

The transport queue already has flow control (for TCP).   And the threads
allocate a large rq_pages whether they are serving WRITEs or not.

So when the server is under memory duress, the threads will block in
bdp, which will push back on the transport queue once all threads are
blocked... 

> 
> 
> > What exactly is "the memory problem"?  Do you have specific symptoms you
> > are trying to address?  Have you had NFS server run out of memory and
> > grind to a halt?
> 
> Review the past 9 months of Mike's work on direct I/O, published on
> this mailing list. Hammerspace has measured this misbehavior and
> experienced server melt-down. Their solution is to avoid using the
> page cache entirely.

I didn't pay very close attention but I thought the assessment was that
adding lots of single-use pages to the page cache, and then having to
clean them out later, caused a lot of unnecessary work that was best
avoided, and that drop-behind addressed this.

Are you trying to find another way to address the same problem?

> 
> But even so there still seems to be an effective denial-of-service
> vector by overloading NFSD with UNSTABLE WRITE traffic faster than
> it can push it to persistence.
> 
> Perhaps we need better observability first.

It would definitely be helpful to have numbers to categorise the
problem.
I think there *is* flow control in place.  A NFS server can certainly
consume all the bandwidth to storage, but I don't see how it is able to
consume all of memory (providing the number of threads is "appropriate"
for the total amount of memory).

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v2] NFSD: Add asynchronous write throttling support for UNSTABLE WRITEs
  2026-01-12  4:15         ` NeilBrown
@ 2026-01-12 14:38           ` Chuck Lever
  0 siblings, 0 replies; 7+ messages in thread
From: Chuck Lever @ 2026-01-12 14:38 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer,
	Christoph Hellwig, linux-nfs, Chuck Lever

On 1/11/26 11:15 PM, NeilBrown wrote:
>>> What exactly is "the memory problem"?  Do you have specific symptoms you
>>> are trying to address?  Have you had NFS server run out of memory and
>>> grind to a halt?
>> Review the past 9 months of Mike's work on direct I/O, published on
>> this mailing list. Hammerspace has measured this misbehavior and
>> experienced server melt-down. Their solution is to avoid using the
>> page cache entirely.
> I didn't pay very close attention but I thought the assessment was that
> adding lots of single-use pages to the page cache, and then having to
> clean them out later, caused a lot of unnecessary work that was best
> avoided, and that drop-behind addressed this.

Drop-behind is too inefficient to be used here. This is why direct I/O
is also an option. Direct I/O is measurably superior to drop-behind.


> Are you trying to find another way to address the same problem?

Yes. I don't think we can backport direct I/O, for example, to LTS
kernels. I expect it will be important to have an alternative to the
new I/O caching modes for this reason alone.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-01-12 14:38 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-09 21:56 [RFC PATCH v2] NFSD: Add asynchronous write throttling support for UNSTABLE WRITEs Chuck Lever
2026-01-10  5:30 ` NeilBrown
2026-01-10 20:28   ` Chuck Lever
2026-01-10 21:38     ` NeilBrown
2026-01-10 23:33       ` Chuck Lever
2026-01-12  4:15         ` NeilBrown
2026-01-12 14:38           ` Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox