* [RFC PATCH v2] NFSD: Add asynchronous write throttling support for UNSTABLE WRITEs @ 2026-01-09 21:56 Chuck Lever 2026-01-10 5:30 ` NeilBrown 0 siblings, 1 reply; 7+ messages in thread From: Chuck Lever @ 2026-01-09 21:56 UTC (permalink / raw) To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer, Christoph Hellwig Cc: linux-nfs, Chuck Lever From: Chuck Lever <chuck.lever@oracle.com> When memory pressure occurs during buffered writes, the traditional approach is for balance_dirty_pages() to put the writing thread to sleep until dirty pages are flushed. For NFSD, this means server threads block waiting for I/O, reducing overall server throughput. Add asynchronous write throttling for UNSTABLE writes using the BDP_ASYNC flag to balance_dirty_pages_ratelimited_flags(). NFSD checks memory pressure before attempting a buffered write. If the call returns -EAGAIN (indicating memory exhaustion), NFSD returns NFS4ERR_DELAY (or NFSERR_JUKEBOX for NFSv3) to the client instead of blocking. Clients then wait and retry, rather than tying up server memory with a cached uncommitted write payload. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> --- fs/nfsd/vfs.c | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) Compile tested only. Changes since RFC v1: - Remove the experimental debugfs setting - Enforce throttling specifically only for UNSTABLE WRITEs diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index 168d3ccc8155..c4550105234e 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -1458,6 +1458,30 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, } } + /* + * Throttle buffered writes under memory pressure. When dirty + * page limits are exceeded, BDP_ASYNC causes -EAGAIN to be + * returned rather than blocking the thread. This -EAGAIN + * maps to nfserr_jukebox, signaling the client to back off + * and retry rather than tying up a server thread during + * writeback. + * + * NFSv2 writes commit to stable storage before reply; no + * dirty pages accumulate, so throttling is unnecessary. + * FILE_SYNC and DATA_SYNC writes flush immediately and do + * not leave uncommitted dirty pages behind. + * Direct I/O and DONTCACHE bypass the page cache entirely. + */ + if (rqstp->rq_vers > 2 && + stable == NFS_UNSTABLE && + nfsd_io_cache_write == NFSD_IO_BUFFERED) { + host_err = + balance_dirty_pages_ratelimited_flags(file->f_mapping, + BDP_ASYNC); + if (host_err == -EAGAIN) + goto out_nfserr; + } + nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload); since = READ_ONCE(file->f_wb_err); -- 2.52.0 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [RFC PATCH v2] NFSD: Add asynchronous write throttling support for UNSTABLE WRITEs 2026-01-09 21:56 [RFC PATCH v2] NFSD: Add asynchronous write throttling support for UNSTABLE WRITEs Chuck Lever @ 2026-01-10 5:30 ` NeilBrown 2026-01-10 20:28 ` Chuck Lever 0 siblings, 1 reply; 7+ messages in thread From: NeilBrown @ 2026-01-10 5:30 UTC (permalink / raw) To: Chuck Lever Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer, Christoph Hellwig, linux-nfs, Chuck Lever On Sat, 10 Jan 2026, Chuck Lever wrote: > From: Chuck Lever <chuck.lever@oracle.com> > > When memory pressure occurs during buffered writes, the traditional > approach is for balance_dirty_pages() to put the writing thread to > sleep until dirty pages are flushed. For NFSD, this means server > threads block waiting for I/O, reducing overall server throughput. > > Add asynchronous write throttling for UNSTABLE writes using the > BDP_ASYNC flag to balance_dirty_pages_ratelimited_flags(). NFSD > checks memory pressure before attempting a buffered write. If the > call returns -EAGAIN (indicating memory exhaustion), NFSD returns > NFS4ERR_DELAY (or NFSERR_JUKEBOX for NFSv3) to the client instead > of blocking. > > Clients then wait and retry, rather than tying up server memory with > a cached uncommitted write payload. > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > --- > fs/nfsd/vfs.c | 24 ++++++++++++++++++++++++ > 1 file changed, 24 insertions(+) > > Compile tested only. > > Changes since RFC v1: > - Remove the experimental debugfs setting > - Enforce throttling specifically only for UNSTABLE WRITEs > > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c > index 168d3ccc8155..c4550105234e 100644 > --- a/fs/nfsd/vfs.c > +++ b/fs/nfsd/vfs.c > @@ -1458,6 +1458,30 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, > } > } > > + /* > + * Throttle buffered writes under memory pressure. When dirty > + * page limits are exceeded, BDP_ASYNC causes -EAGAIN to be > + * returned rather than blocking the thread. This -EAGAIN > + * maps to nfserr_jukebox, signaling the client to back off > + * and retry rather than tying up a server thread during > + * writeback. > + * > + * NFSv2 writes commit to stable storage before reply; no > + * dirty pages accumulate, so throttling is unnecessary. > + * FILE_SYNC and DATA_SYNC writes flush immediately and do > + * not leave uncommitted dirty pages behind. > + * Direct I/O and DONTCACHE bypass the page cache entirely. > + */ > + if (rqstp->rq_vers > 2 && > + stable == NFS_UNSTABLE && > + nfsd_io_cache_write == NFSD_IO_BUFFERED) { > + host_err = > + balance_dirty_pages_ratelimited_flags(file->f_mapping, > + BDP_ASYNC); > + if (host_err == -EAGAIN) > + goto out_nfserr; I doubt that this will do what you want - at least not reliably. balance_dirty_pages_ratelimited_flags() assumes it will be called repeatedly by the same task and it lets that task write for a while, then blocks it, then lets it write some more. The way you have integrated it into nfsd could result in the write load bouncing around among different threads and behaving inconsistently. Also the delay imposed is (for a Linux client) between 100ms and 15seconds. I suspect that is often longer than we would want. The actual pause imposed by page-writeback.c is variable based on the measured throughput of the backing device. What we really want, I think, is to be able to push back on the client by limiting the number of bytes in unacknowledged writes, but I don't think NFS has any mechanism for that. I cannot immediately think of any approach that really shows promise, but I suspect that it will involves a deeper interaction with the writeback code in a way that abstracts out the task state so that nfsd can appear to be one-task-per-client (or similar). Possibly the best approach for throttling the client is to somehow delay the reply (without tying up a thread) so that it sees a fairly precise latency.... But maybe I'm seeing problems that don't exist. Testing would help, but finding a mix of loads that properly stress the system would be a challenge. And maybe just allowing the thread pool to grow will make this a non-problem? Thanks, NeilBrown > + } > + > nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload); > > since = READ_ONCE(file->f_wb_err); > -- > 2.52.0 > > > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH v2] NFSD: Add asynchronous write throttling support for UNSTABLE WRITEs 2026-01-10 5:30 ` NeilBrown @ 2026-01-10 20:28 ` Chuck Lever 2026-01-10 21:38 ` NeilBrown 0 siblings, 1 reply; 7+ messages in thread From: Chuck Lever @ 2026-01-10 20:28 UTC (permalink / raw) To: NeilBrown Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer, Christoph Hellwig, linux-nfs, Chuck Lever On Sat, Jan 10, 2026, at 12:30 AM, NeilBrown wrote: > On Sat, 10 Jan 2026, Chuck Lever wrote: >> From: Chuck Lever <chuck.lever@oracle.com> >> >> When memory pressure occurs during buffered writes, the traditional >> approach is for balance_dirty_pages() to put the writing thread to >> sleep until dirty pages are flushed. For NFSD, this means server >> threads block waiting for I/O, reducing overall server throughput. >> >> Add asynchronous write throttling for UNSTABLE writes using the >> BDP_ASYNC flag to balance_dirty_pages_ratelimited_flags(). NFSD >> checks memory pressure before attempting a buffered write. If the >> call returns -EAGAIN (indicating memory exhaustion), NFSD returns >> NFS4ERR_DELAY (or NFSERR_JUKEBOX for NFSv3) to the client instead >> of blocking. >> >> Clients then wait and retry, rather than tying up server memory with >> a cached uncommitted write payload. >> >> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> >> --- >> fs/nfsd/vfs.c | 24 ++++++++++++++++++++++++ >> 1 file changed, 24 insertions(+) >> >> Compile tested only. >> >> Changes since RFC v1: >> - Remove the experimental debugfs setting >> - Enforce throttling specifically only for UNSTABLE WRITEs >> >> >> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c >> index 168d3ccc8155..c4550105234e 100644 >> --- a/fs/nfsd/vfs.c >> +++ b/fs/nfsd/vfs.c >> @@ -1458,6 +1458,30 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, >> } >> } >> >> + /* >> + * Throttle buffered writes under memory pressure. When dirty >> + * page limits are exceeded, BDP_ASYNC causes -EAGAIN to be >> + * returned rather than blocking the thread. This -EAGAIN >> + * maps to nfserr_jukebox, signaling the client to back off >> + * and retry rather than tying up a server thread during >> + * writeback. >> + * >> + * NFSv2 writes commit to stable storage before reply; no >> + * dirty pages accumulate, so throttling is unnecessary. >> + * FILE_SYNC and DATA_SYNC writes flush immediately and do >> + * not leave uncommitted dirty pages behind. >> + * Direct I/O and DONTCACHE bypass the page cache entirely. >> + */ >> + if (rqstp->rq_vers > 2 && >> + stable == NFS_UNSTABLE && >> + nfsd_io_cache_write == NFSD_IO_BUFFERED) { >> + host_err = >> + balance_dirty_pages_ratelimited_flags(file->f_mapping, >> + BDP_ASYNC); >> + if (host_err == -EAGAIN) >> + goto out_nfserr; > > I doubt that this will do what you want - at least not reliably. > > balance_dirty_pages_ratelimited_flags() assumes it will be called > repeatedly by the same task and it lets that task write for a while, > then blocks it, then lets it write some more. > > The way you have integrated it into nfsd could result in the write load > bouncing around among different threads and behaving inconsistently. > > Also the delay imposed is (for a Linux client) between 100ms and > 15seconds. > I suspect that is often longer than we would want. The actual pause > imposed by page-writeback.c is variable based on the measured throughput > of the backing device. These are UNSTABLE WRITEs. I can understand delaying the COMMIT because that's where NFSD requests synchronous interaction with the backing device. But nothing delays an UNSTABLE WRITE if the backing device is slow. But I can see there could be significant fairness issues with the bdp approach here. > What we really want, I think, is to be able to push back on the client > by limiting the number of bytes in unacknowledged writes, but I don't > think NFS has any mechanism for that. > > I cannot immediately think of any approach that really shows promise, > but I suspect that it will involves a deeper interaction with the > writeback code in a way that abstracts out the task state so that nfsd > can appear to be one-task-per-client (or similar). > > Possibly the best approach for throttling the client is to somehow delay > the reply (without tying up a thread) so that it sees a fairly precise > latency.... Set aside my threading comments for a moment. What I'm trying to prevent is UNSTABLE WRITEs tying up server /memory/. When under memory pressure, NFSD needs to delay UNSTABLE WRITEs until there is adequate memory to cache the WRITE payloads, without having to evict something more critical that could cause the server significant heartburn or livelock. I'm considering another idea where the svc threads in each thread pool are all members of cgroup. That way the amount of memory dedicated to NFSD in one container (let's say) can be constrained, preventing it from overrunning memory resources needed to keep the server system up and otherwise responsive. > But maybe I'm seeing problems that don't exist. Testing would help, but > finding a mix of loads that properly stress the system would be a > challenge. > > And maybe just allowing the thread pool to grow will make this a > non-problem? I think allowing the thread pool to grow could make the memory problem worse. -- Chuck Lever ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH v2] NFSD: Add asynchronous write throttling support for UNSTABLE WRITEs 2026-01-10 20:28 ` Chuck Lever @ 2026-01-10 21:38 ` NeilBrown 2026-01-10 23:33 ` Chuck Lever 0 siblings, 1 reply; 7+ messages in thread From: NeilBrown @ 2026-01-10 21:38 UTC (permalink / raw) To: Chuck Lever Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer, Christoph Hellwig, linux-nfs, Chuck Lever On Sun, 11 Jan 2026, Chuck Lever wrote: > > On Sat, Jan 10, 2026, at 12:30 AM, NeilBrown wrote: > > On Sat, 10 Jan 2026, Chuck Lever wrote: > >> From: Chuck Lever <chuck.lever@oracle.com> > >> > >> When memory pressure occurs during buffered writes, the traditional > >> approach is for balance_dirty_pages() to put the writing thread to > >> sleep until dirty pages are flushed. For NFSD, this means server > >> threads block waiting for I/O, reducing overall server throughput. > >> > >> Add asynchronous write throttling for UNSTABLE writes using the > >> BDP_ASYNC flag to balance_dirty_pages_ratelimited_flags(). NFSD > >> checks memory pressure before attempting a buffered write. If the > >> call returns -EAGAIN (indicating memory exhaustion), NFSD returns > >> NFS4ERR_DELAY (or NFSERR_JUKEBOX for NFSv3) to the client instead > >> of blocking. > >> > >> Clients then wait and retry, rather than tying up server memory with > >> a cached uncommitted write payload. > >> > >> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > >> --- > >> fs/nfsd/vfs.c | 24 ++++++++++++++++++++++++ > >> 1 file changed, 24 insertions(+) > >> > >> Compile tested only. > >> > >> Changes since RFC v1: > >> - Remove the experimental debugfs setting > >> - Enforce throttling specifically only for UNSTABLE WRITEs > >> > >> > >> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c > >> index 168d3ccc8155..c4550105234e 100644 > >> --- a/fs/nfsd/vfs.c > >> +++ b/fs/nfsd/vfs.c > >> @@ -1458,6 +1458,30 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, > >> } > >> } > >> > >> + /* > >> + * Throttle buffered writes under memory pressure. When dirty > >> + * page limits are exceeded, BDP_ASYNC causes -EAGAIN to be > >> + * returned rather than blocking the thread. This -EAGAIN > >> + * maps to nfserr_jukebox, signaling the client to back off > >> + * and retry rather than tying up a server thread during > >> + * writeback. > >> + * > >> + * NFSv2 writes commit to stable storage before reply; no > >> + * dirty pages accumulate, so throttling is unnecessary. > >> + * FILE_SYNC and DATA_SYNC writes flush immediately and do > >> + * not leave uncommitted dirty pages behind. > >> + * Direct I/O and DONTCACHE bypass the page cache entirely. > >> + */ > >> + if (rqstp->rq_vers > 2 && > >> + stable == NFS_UNSTABLE && > >> + nfsd_io_cache_write == NFSD_IO_BUFFERED) { > >> + host_err = > >> + balance_dirty_pages_ratelimited_flags(file->f_mapping, > >> + BDP_ASYNC); > >> + if (host_err == -EAGAIN) > >> + goto out_nfserr; > > > > I doubt that this will do what you want - at least not reliably. > > > > balance_dirty_pages_ratelimited_flags() assumes it will be called > > repeatedly by the same task and it lets that task write for a while, > > then blocks it, then lets it write some more. > > > > The way you have integrated it into nfsd could result in the write load > > bouncing around among different threads and behaving inconsistently. > > > > Also the delay imposed is (for a Linux client) between 100ms and > > 15seconds. > > I suspect that is often longer than we would want. The actual pause > > imposed by page-writeback.c is variable based on the measured throughput > > of the backing device. > > These are UNSTABLE WRITEs. I can understand delaying the COMMIT because > that's where NFSD requests synchronous interaction with the backing > device. But nothing delays an UNSTABLE WRITE if the backing device is > slow. That isn't correct. If the "dirty threshold" is reached (e.g. 10% of memory dirty) then balance_dirty_pages() will delay all writes to avoid exceeding the dirty page limit. It attempts to monitor the recent throughput of each backing device, and to divide available memory among then in the same proportion as throughput, then throttle writes to backing devices using more than their share. > > But I can see there could be significant fairness issues with the bdp > approach here. > > > > What we really want, I think, is to be able to push back on the client > > by limiting the number of bytes in unacknowledged writes, but I don't > > think NFS has any mechanism for that. > > > > I cannot immediately think of any approach that really shows promise, > > but I suspect that it will involves a deeper interaction with the > > writeback code in a way that abstracts out the task state so that nfsd > > can appear to be one-task-per-client (or similar). > > > > Possibly the best approach for throttling the client is to somehow delay > > the reply (without tying up a thread) so that it sees a fairly precise > > latency.... > > Set aside my threading comments for a moment. What I'm trying to prevent > is UNSTABLE WRITEs tying up server /memory/. When under memory pressure, > NFSD needs to delay UNSTABLE WRITEs until there is adequate memory to cache > the WRITE payloads, without having to evict something more critical that > could cause the server significant heartburn or livelock. The writeback code already does this. The numbers in /proc/sys/vm/dirty_ratio and /proc/sys/vm/dirty_bytes can be used to set how much memory can store dirty pages (including UNSTABLE writes in write-back). Writers are throttled to attempt to mostly keep within that limit. > > I'm considering another idea where the svc threads in each thread pool > are all members of cgroup. That way the amount of memory dedicated to > NFSD in one container (let's say) can be constrained, preventing it > from overrunning memory resources needed to keep the server system up > and otherwise responsive. I have no direct experience with mem-cgroups so there are doubtless subtleties that I miss, but I suspect the primary effect of including nfsd in a mem-cgroup would be to push file content out of the page-cache more quickly. I don't think it can control dirty pages separately from clean pages, but I could easily be wrong. And that might be exactly what you want to happen. > > > > But maybe I'm seeing problems that don't exist. Testing would help, but > > finding a mix of loads that properly stress the system would be a > > challenge. > > > > And maybe just allowing the thread pool to grow will make this a > > non-problem? > > I think allowing the thread pool to grow could make the memory problem > worse. At 4(?) pages per thread? What exactly is "the memory problem"? Do you have specific symptoms you are trying to address? Have you had NFS server run out of memory and grind to a halt? Thanks, NeilBrown > > > -- > Chuck Lever > > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH v2] NFSD: Add asynchronous write throttling support for UNSTABLE WRITEs 2026-01-10 21:38 ` NeilBrown @ 2026-01-10 23:33 ` Chuck Lever 2026-01-12 4:15 ` NeilBrown 0 siblings, 1 reply; 7+ messages in thread From: Chuck Lever @ 2026-01-10 23:33 UTC (permalink / raw) To: NeilBrown Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer, Christoph Hellwig, linux-nfs, Chuck Lever On Sat, Jan 10, 2026, at 4:38 PM, NeilBrown wrote: > On Sun, 11 Jan 2026, Chuck Lever wrote: >> >> On Sat, Jan 10, 2026, at 12:30 AM, NeilBrown wrote: >> > On Sat, 10 Jan 2026, Chuck Lever wrote: >> >> From: Chuck Lever <chuck.lever@oracle.com> >> >> >> >> When memory pressure occurs during buffered writes, the traditional >> >> approach is for balance_dirty_pages() to put the writing thread to >> >> sleep until dirty pages are flushed. For NFSD, this means server >> >> threads block waiting for I/O, reducing overall server throughput. >> >> >> >> Add asynchronous write throttling for UNSTABLE writes using the >> >> BDP_ASYNC flag to balance_dirty_pages_ratelimited_flags(). NFSD >> >> checks memory pressure before attempting a buffered write. If the >> >> call returns -EAGAIN (indicating memory exhaustion), NFSD returns >> >> NFS4ERR_DELAY (or NFSERR_JUKEBOX for NFSv3) to the client instead >> >> of blocking. >> >> >> >> Clients then wait and retry, rather than tying up server memory with >> >> a cached uncommitted write payload. >> >> >> >> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> >> >> --- >> >> fs/nfsd/vfs.c | 24 ++++++++++++++++++++++++ >> >> 1 file changed, 24 insertions(+) >> >> >> >> Compile tested only. >> >> >> >> Changes since RFC v1: >> >> - Remove the experimental debugfs setting >> >> - Enforce throttling specifically only for UNSTABLE WRITEs >> >> >> >> >> >> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c >> >> index 168d3ccc8155..c4550105234e 100644 >> >> --- a/fs/nfsd/vfs.c >> >> +++ b/fs/nfsd/vfs.c >> >> @@ -1458,6 +1458,30 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, >> >> } >> >> } >> >> >> >> + /* >> >> + * Throttle buffered writes under memory pressure. When dirty >> >> + * page limits are exceeded, BDP_ASYNC causes -EAGAIN to be >> >> + * returned rather than blocking the thread. This -EAGAIN >> >> + * maps to nfserr_jukebox, signaling the client to back off >> >> + * and retry rather than tying up a server thread during >> >> + * writeback. >> >> + * >> >> + * NFSv2 writes commit to stable storage before reply; no >> >> + * dirty pages accumulate, so throttling is unnecessary. >> >> + * FILE_SYNC and DATA_SYNC writes flush immediately and do >> >> + * not leave uncommitted dirty pages behind. >> >> + * Direct I/O and DONTCACHE bypass the page cache entirely. >> >> + */ >> >> + if (rqstp->rq_vers > 2 && >> >> + stable == NFS_UNSTABLE && >> >> + nfsd_io_cache_write == NFSD_IO_BUFFERED) { >> >> + host_err = >> >> + balance_dirty_pages_ratelimited_flags(file->f_mapping, >> >> + BDP_ASYNC); >> >> + if (host_err == -EAGAIN) >> >> + goto out_nfserr; >> > >> > I doubt that this will do what you want - at least not reliably. >> > >> > balance_dirty_pages_ratelimited_flags() assumes it will be called >> > repeatedly by the same task and it lets that task write for a while, >> > then blocks it, then lets it write some more. >> > >> > The way you have integrated it into nfsd could result in the write load >> > bouncing around among different threads and behaving inconsistently. >> > >> > Also the delay imposed is (for a Linux client) between 100ms and >> > 15seconds. >> > I suspect that is often longer than we would want. The actual pause >> > imposed by page-writeback.c is variable based on the measured throughput >> > of the backing device. >> >> These are UNSTABLE WRITEs. I can understand delaying the COMMIT because >> that's where NFSD requests synchronous interaction with the backing >> device. But nothing delays an UNSTABLE WRITE if the backing device is >> slow. > > That isn't correct. If the "dirty threshold" is reached (e.g. 10% of > memory dirty) then balance_dirty_pages() will delay all writes to avoid > exceeding the dirty page limit. That doesn't seem to be happening in some cases. Or perhaps, it is happening, but the added delay is not aggressive enough. >> > But maybe I'm seeing problems that don't exist. Testing would help, but >> > finding a mix of loads that properly stress the system would be a >> > challenge. >> > >> > And maybe just allowing the thread pool to grow will make this a >> > non-problem? >> >> I think allowing the thread pool to grow could make the memory problem >> worse. > > At 4(?) pages per thread? I'm talking about the WRITE payloads, not the thread footprint. More threads means capacity to handle a higher rate of ingress UNSTABLE WRITE traffic. I think we need a way for NFSD to complete those requests quickly (with NFS4ERR_DELAY, for example) when the server is under duress so that WRITE payloads pending on the transport queue or waiting to be committed do not consume server memory until the server has the resources to process the WRITEs. Flow control, essentially. > What exactly is "the memory problem"? Do you have specific symptoms you > are trying to address? Have you had NFS server run out of memory and > grind to a halt? Review the past 9 months of Mike's work on direct I/O, published on this mailing list. Hammerspace has measured this misbehavior and experienced server melt-down. Their solution is to avoid using the page cache entirely. But even so there still seems to be an effective denial-of-service vector by overloading NFSD with UNSTABLE WRITE traffic faster than it can push it to persistence. Perhaps we need better observability first. -- Chuck Lever ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH v2] NFSD: Add asynchronous write throttling support for UNSTABLE WRITEs 2026-01-10 23:33 ` Chuck Lever @ 2026-01-12 4:15 ` NeilBrown 2026-01-12 14:38 ` Chuck Lever 0 siblings, 1 reply; 7+ messages in thread From: NeilBrown @ 2026-01-12 4:15 UTC (permalink / raw) To: Chuck Lever Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer, Christoph Hellwig, linux-nfs, Chuck Lever On Sun, 11 Jan 2026, Chuck Lever wrote: > > On Sat, Jan 10, 2026, at 4:38 PM, NeilBrown wrote: > > On Sun, 11 Jan 2026, Chuck Lever wrote: > >> > >> On Sat, Jan 10, 2026, at 12:30 AM, NeilBrown wrote: > >> > On Sat, 10 Jan 2026, Chuck Lever wrote: > >> >> From: Chuck Lever <chuck.lever@oracle.com> > >> >> > >> >> When memory pressure occurs during buffered writes, the traditional > >> >> approach is for balance_dirty_pages() to put the writing thread to > >> >> sleep until dirty pages are flushed. For NFSD, this means server > >> >> threads block waiting for I/O, reducing overall server throughput. > >> >> > >> >> Add asynchronous write throttling for UNSTABLE writes using the > >> >> BDP_ASYNC flag to balance_dirty_pages_ratelimited_flags(). NFSD > >> >> checks memory pressure before attempting a buffered write. If the > >> >> call returns -EAGAIN (indicating memory exhaustion), NFSD returns > >> >> NFS4ERR_DELAY (or NFSERR_JUKEBOX for NFSv3) to the client instead > >> >> of blocking. > >> >> > >> >> Clients then wait and retry, rather than tying up server memory with > >> >> a cached uncommitted write payload. > >> >> > >> >> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > >> >> --- > >> >> fs/nfsd/vfs.c | 24 ++++++++++++++++++++++++ > >> >> 1 file changed, 24 insertions(+) > >> >> > >> >> Compile tested only. > >> >> > >> >> Changes since RFC v1: > >> >> - Remove the experimental debugfs setting > >> >> - Enforce throttling specifically only for UNSTABLE WRITEs > >> >> > >> >> > >> >> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c > >> >> index 168d3ccc8155..c4550105234e 100644 > >> >> --- a/fs/nfsd/vfs.c > >> >> +++ b/fs/nfsd/vfs.c > >> >> @@ -1458,6 +1458,30 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, > >> >> } > >> >> } > >> >> > >> >> + /* > >> >> + * Throttle buffered writes under memory pressure. When dirty > >> >> + * page limits are exceeded, BDP_ASYNC causes -EAGAIN to be > >> >> + * returned rather than blocking the thread. This -EAGAIN > >> >> + * maps to nfserr_jukebox, signaling the client to back off > >> >> + * and retry rather than tying up a server thread during > >> >> + * writeback. > >> >> + * > >> >> + * NFSv2 writes commit to stable storage before reply; no > >> >> + * dirty pages accumulate, so throttling is unnecessary. > >> >> + * FILE_SYNC and DATA_SYNC writes flush immediately and do > >> >> + * not leave uncommitted dirty pages behind. > >> >> + * Direct I/O and DONTCACHE bypass the page cache entirely. > >> >> + */ > >> >> + if (rqstp->rq_vers > 2 && > >> >> + stable == NFS_UNSTABLE && > >> >> + nfsd_io_cache_write == NFSD_IO_BUFFERED) { > >> >> + host_err = > >> >> + balance_dirty_pages_ratelimited_flags(file->f_mapping, > >> >> + BDP_ASYNC); > >> >> + if (host_err == -EAGAIN) > >> >> + goto out_nfserr; > >> > > >> > I doubt that this will do what you want - at least not reliably. > >> > > >> > balance_dirty_pages_ratelimited_flags() assumes it will be called > >> > repeatedly by the same task and it lets that task write for a while, > >> > then blocks it, then lets it write some more. > >> > > >> > The way you have integrated it into nfsd could result in the write load > >> > bouncing around among different threads and behaving inconsistently. > >> > > >> > Also the delay imposed is (for a Linux client) between 100ms and > >> > 15seconds. > >> > I suspect that is often longer than we would want. The actual pause > >> > imposed by page-writeback.c is variable based on the measured throughput > >> > of the backing device. > >> > >> These are UNSTABLE WRITEs. I can understand delaying the COMMIT because > >> that's where NFSD requests synchronous interaction with the backing > >> device. But nothing delays an UNSTABLE WRITE if the backing device is > >> slow. > > > > That isn't correct. If the "dirty threshold" is reached (e.g. 10% of > > memory dirty) then balance_dirty_pages() will delay all writes to avoid > > exceeding the dirty page limit. > > That doesn't seem to be happening in some cases. Or perhaps, it is > happening, but the added delay is not aggressive enough. I would be surprised if that actual count of dirty pages grep Dirty /proc/meminfo grows much above the dirty page limit. There is some elasticity so the threads don't need to check global variables on every page - only every 1024 pages or something like that - so the nominal limit can be exceeded briefly. But there should still be bounds and if nfsd is being allowed to dirty significantly more pages then it should, then that is a problem well beyond nfsd. > > > >> > But maybe I'm seeing problems that don't exist. Testing would help, but > >> > finding a mix of loads that properly stress the system would be a > >> > challenge. > >> > > >> > And maybe just allowing the thread pool to grow will make this a > >> > non-problem? > >> > >> I think allowing the thread pool to grow could make the memory problem > >> worse. > > > > At 4(?) pages per thread? > > I'm talking about the WRITE payloads, not the thread footprint. hmmmm.. My footprint calculation was extemely wrong. I didn't allow for the rq_pages allocation - so add 1MB or more per thread. But the write payload cannot get beyond the thread footprint without creating dirty pages, which are limited. > > More threads means capacity to handle a higher rate of ingress UNSTABLE > WRITE traffic. I think we need a way for NFSD to complete those requests > quickly (with NFS4ERR_DELAY, for example) when the server is under duress > so that WRITE payloads pending on the transport queue or waiting to be > committed do not consume server memory until the server has the resources > to process the WRITEs. > > Flow control, essentially. The transport queue already has flow control (for TCP). And the threads allocate a large rq_pages whether they are serving WRITEs or not. So when the server is under memory duress, the threads will block in bdp, which will push back on the transport queue once all threads are blocked... > > > > What exactly is "the memory problem"? Do you have specific symptoms you > > are trying to address? Have you had NFS server run out of memory and > > grind to a halt? > > Review the past 9 months of Mike's work on direct I/O, published on > this mailing list. Hammerspace has measured this misbehavior and > experienced server melt-down. Their solution is to avoid using the > page cache entirely. I didn't pay very close attention but I thought the assessment was that adding lots of single-use pages to the page cache, and then having to clean them out later, caused a lot of unnecessary work that was best avoided, and that drop-behind addressed this. Are you trying to find another way to address the same problem? > > But even so there still seems to be an effective denial-of-service > vector by overloading NFSD with UNSTABLE WRITE traffic faster than > it can push it to persistence. > > Perhaps we need better observability first. It would definitely be helpful to have numbers to categorise the problem. I think there *is* flow control in place. A NFS server can certainly consume all the bandwidth to storage, but I don't see how it is able to consume all of memory (providing the number of threads is "appropriate" for the total amount of memory). Thanks, NeilBrown ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH v2] NFSD: Add asynchronous write throttling support for UNSTABLE WRITEs 2026-01-12 4:15 ` NeilBrown @ 2026-01-12 14:38 ` Chuck Lever 0 siblings, 0 replies; 7+ messages in thread From: Chuck Lever @ 2026-01-12 14:38 UTC (permalink / raw) To: NeilBrown Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer, Christoph Hellwig, linux-nfs, Chuck Lever On 1/11/26 11:15 PM, NeilBrown wrote: >>> What exactly is "the memory problem"? Do you have specific symptoms you >>> are trying to address? Have you had NFS server run out of memory and >>> grind to a halt? >> Review the past 9 months of Mike's work on direct I/O, published on >> this mailing list. Hammerspace has measured this misbehavior and >> experienced server melt-down. Their solution is to avoid using the >> page cache entirely. > I didn't pay very close attention but I thought the assessment was that > adding lots of single-use pages to the page cache, and then having to > clean them out later, caused a lot of unnecessary work that was best > avoided, and that drop-behind addressed this. Drop-behind is too inefficient to be used here. This is why direct I/O is also an option. Direct I/O is measurably superior to drop-behind. > Are you trying to find another way to address the same problem? Yes. I don't think we can backport direct I/O, for example, to LTS kernels. I expect it will be important to have an alternative to the new I/O caching modes for this reason alone. -- Chuck Lever ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2026-01-12 14:38 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-01-09 21:56 [RFC PATCH v2] NFSD: Add asynchronous write throttling support for UNSTABLE WRITEs Chuck Lever 2026-01-10 5:30 ` NeilBrown 2026-01-10 20:28 ` Chuck Lever 2026-01-10 21:38 ` NeilBrown 2026-01-10 23:33 ` Chuck Lever 2026-01-12 4:15 ` NeilBrown 2026-01-12 14:38 ` Chuck Lever
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox