From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1215550095C for ; Sat, 10 Jan 2026 23:34:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1768088099; cv=none; b=oShSmpCkaKbip3BWnoENUtHBHVvCSwWgHJDnqRJod9xjHgMi8FHTIzLdDjsgFVLws/Ft+LviuyA9CPcxPLatecjJn3cDYZ90GQ6GPMLSB86DLYf8YcFfplSqPPf75uYmGjJBmrYcUt+Y0VKSz3LkWg3S6zuQLgtjzOScupO2zrE= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1768088099; c=relaxed/simple; bh=Jqg9249WAhbP+z8PF0JaQL66dZSlUPQZESYtp3r39hs=; h=MIME-Version:Date:From:To:Cc:Message-Id:In-Reply-To:References: Subject:Content-Type; b=iI1Ww5PuLegCRWxbKoZvpvUgzM+pA1DWP4GmOx2aoB8QETMHF1VYHvWu1IU9WJbP+BRaO/w3JhCGQnShOsHWyBgQKPN+jHSCsbsP73M6XEra/xZtncGcr+BgzGfjuA9LXMw4VljH396+Zy+WCT+XbPInA3pELCrm/Eh6LeLpsmY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=rXoft1+5; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="rXoft1+5" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 95395C116C6; Sat, 10 Jan 2026 23:34:58 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1768088098; bh=Jqg9249WAhbP+z8PF0JaQL66dZSlUPQZESYtp3r39hs=; h=Date:From:To:Cc:In-Reply-To:References:Subject:From; b=rXoft1+5ZurLhpyr15JTuo6rbQH/XUFO19+eM+q22LB5XlzN+WH7Y+kKx7v7yhftZ JJBGYvaxL4l5j1tbL6kc/mAGVTth1Y4dazRo3OPN9M6LWzWENZ0dHiF5l2X3SYR+5U 20wYQHwmNygOdXUpSByQErm4pXOYDwopjWGl6KvaJ/rbndJbgeEUytataTkZVSh5C3 4W5+hPP+H8etVGoYndLg6QJU+Qvu0DXaTTJhz4FQJLQKeQfyhGb3GwFUC4iEgr3Ziv xHNW8aKoiswdVlzJyT4s0FYmR8kppuDpVUWDYAUMAuxLXPoacoz0JkZQWHT1NX80R+ JlMaERqXwyM7g== Received: from phl-compute-10.internal (phl-compute-10.internal [10.202.2.50]) by mailfauth.phl.internal (Postfix) with ESMTP id A3AD7F40068; Sat, 10 Jan 2026 18:34:57 -0500 (EST) Received: from phl-imap-15 ([10.202.2.104]) by phl-compute-10.internal (MEProxy); Sat, 10 Jan 2026 18:34:57 -0500 X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefgedrtddtgdduudeftdefucetufdoteggodetrf dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceu rghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujf gurhepofggfffhvfevkfgjfhfutgfgsehtjeertdertddtnecuhfhrohhmpedfvehhuhgt khcunfgvvhgvrhdfuceotggvlheskhgvrhhnvghlrdhorhhgqeenucggtffrrghtthgvrh hnpefhffekffeftdfgheeiveekudeuhfdvjedvfedvueduvdegleekgeetgfduhfefleen ucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpegthhhutg hklhgvvhgvrhdomhgvshhmthhprghuthhhphgvrhhsohhnrghlihhthidqudeifeegleel leehledqfedvleekgeegvdefqdgtvghlpeepkhgvrhhnvghlrdhorhhgsehfrghsthhmrg hilhdrtghomhdpnhgspghrtghpthhtohepledpmhhouggvpehsmhhtphhouhhtpdhrtghp thhtohepnhgvihhlsegsrhhofihnrdhnrghmvgdprhgtphhtthhopehjlhgrhihtohhnse hkvghrnhgvlhdrohhrghdprhgtphhtthhopehsnhhithiivghrsehkvghrnhgvlhdrohhr ghdprhgtphhtthhopehhtghhsehlshhtrdguvgdprhgtphhtthhopegthhhutghkrdhlvg hvvghrsehorhgrtghlvgdrtghomhdprhgtphhtthhopegurghirdhnghhosehorhgrtghl vgdrtghomhdprhgtphhtthhopehokhhorhhnihgvvhesrhgvughhrghtrdgtohhmpdhrtg hpthhtohepthhomhesthgrlhhpvgihrdgtohhmpdhrtghpthhtoheplhhinhhugidqnhhf shesvhhgvghrrdhkvghrnhgvlhdrohhrgh X-ME-Proxy: Feedback-ID: ifa6e4810:Fastmail Received: by mailuser.phl.internal (Postfix, from userid 501) id 7DB01780054; Sat, 10 Jan 2026 18:34:57 -0500 (EST) X-Mailer: MessagingEngine.com Webmail Interface Precedence: bulk X-Mailing-List: linux-nfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ThreadId: A1CkIJRtWFaZ Date: Sat, 10 Jan 2026 18:33:59 -0500 From: "Chuck Lever" To: NeilBrown Cc: "Jeff Layton" , "Olga Kornievskaia" , "Dai Ngo" , "Tom Talpey" , "Mike Snitzer" , "Christoph Hellwig" , linux-nfs@vger.kernel.org, "Chuck Lever" Message-Id: In-Reply-To: <176808109160.2462021.5788018456330144196@noble.neil.brown.name> References: <20260109215613.25250-1-cel@kernel.org> <176802301025.16766.5819430775313248993@noble.neil.brown.name> <63566a53-ed5a-4c0b-920d-22219c750354@app.fastmail.com> <176808109160.2462021.5788018456330144196@noble.neil.brown.name> Subject: Re: [RFC PATCH v2] NFSD: Add asynchronous write throttling support for UNSTABLE WRITEs Content-Type: text/plain Content-Transfer-Encoding: 7bit On Sat, Jan 10, 2026, at 4:38 PM, NeilBrown wrote: > On Sun, 11 Jan 2026, Chuck Lever wrote: >> >> On Sat, Jan 10, 2026, at 12:30 AM, NeilBrown wrote: >> > On Sat, 10 Jan 2026, Chuck Lever wrote: >> >> From: Chuck Lever >> >> >> >> When memory pressure occurs during buffered writes, the traditional >> >> approach is for balance_dirty_pages() to put the writing thread to >> >> sleep until dirty pages are flushed. For NFSD, this means server >> >> threads block waiting for I/O, reducing overall server throughput. >> >> >> >> Add asynchronous write throttling for UNSTABLE writes using the >> >> BDP_ASYNC flag to balance_dirty_pages_ratelimited_flags(). NFSD >> >> checks memory pressure before attempting a buffered write. If the >> >> call returns -EAGAIN (indicating memory exhaustion), NFSD returns >> >> NFS4ERR_DELAY (or NFSERR_JUKEBOX for NFSv3) to the client instead >> >> of blocking. >> >> >> >> Clients then wait and retry, rather than tying up server memory with >> >> a cached uncommitted write payload. >> >> >> >> Signed-off-by: Chuck Lever >> >> --- >> >> fs/nfsd/vfs.c | 24 ++++++++++++++++++++++++ >> >> 1 file changed, 24 insertions(+) >> >> >> >> Compile tested only. >> >> >> >> Changes since RFC v1: >> >> - Remove the experimental debugfs setting >> >> - Enforce throttling specifically only for UNSTABLE WRITEs >> >> >> >> >> >> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c >> >> index 168d3ccc8155..c4550105234e 100644 >> >> --- a/fs/nfsd/vfs.c >> >> +++ b/fs/nfsd/vfs.c >> >> @@ -1458,6 +1458,30 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, >> >> } >> >> } >> >> >> >> + /* >> >> + * Throttle buffered writes under memory pressure. When dirty >> >> + * page limits are exceeded, BDP_ASYNC causes -EAGAIN to be >> >> + * returned rather than blocking the thread. This -EAGAIN >> >> + * maps to nfserr_jukebox, signaling the client to back off >> >> + * and retry rather than tying up a server thread during >> >> + * writeback. >> >> + * >> >> + * NFSv2 writes commit to stable storage before reply; no >> >> + * dirty pages accumulate, so throttling is unnecessary. >> >> + * FILE_SYNC and DATA_SYNC writes flush immediately and do >> >> + * not leave uncommitted dirty pages behind. >> >> + * Direct I/O and DONTCACHE bypass the page cache entirely. >> >> + */ >> >> + if (rqstp->rq_vers > 2 && >> >> + stable == NFS_UNSTABLE && >> >> + nfsd_io_cache_write == NFSD_IO_BUFFERED) { >> >> + host_err = >> >> + balance_dirty_pages_ratelimited_flags(file->f_mapping, >> >> + BDP_ASYNC); >> >> + if (host_err == -EAGAIN) >> >> + goto out_nfserr; >> > >> > I doubt that this will do what you want - at least not reliably. >> > >> > balance_dirty_pages_ratelimited_flags() assumes it will be called >> > repeatedly by the same task and it lets that task write for a while, >> > then blocks it, then lets it write some more. >> > >> > The way you have integrated it into nfsd could result in the write load >> > bouncing around among different threads and behaving inconsistently. >> > >> > Also the delay imposed is (for a Linux client) between 100ms and >> > 15seconds. >> > I suspect that is often longer than we would want. The actual pause >> > imposed by page-writeback.c is variable based on the measured throughput >> > of the backing device. >> >> These are UNSTABLE WRITEs. I can understand delaying the COMMIT because >> that's where NFSD requests synchronous interaction with the backing >> device. But nothing delays an UNSTABLE WRITE if the backing device is >> slow. > > That isn't correct. If the "dirty threshold" is reached (e.g. 10% of > memory dirty) then balance_dirty_pages() will delay all writes to avoid > exceeding the dirty page limit. That doesn't seem to be happening in some cases. Or perhaps, it is happening, but the added delay is not aggressive enough. >> > But maybe I'm seeing problems that don't exist. Testing would help, but >> > finding a mix of loads that properly stress the system would be a >> > challenge. >> > >> > And maybe just allowing the thread pool to grow will make this a >> > non-problem? >> >> I think allowing the thread pool to grow could make the memory problem >> worse. > > At 4(?) pages per thread? I'm talking about the WRITE payloads, not the thread footprint. More threads means capacity to handle a higher rate of ingress UNSTABLE WRITE traffic. I think we need a way for NFSD to complete those requests quickly (with NFS4ERR_DELAY, for example) when the server is under duress so that WRITE payloads pending on the transport queue or waiting to be committed do not consume server memory until the server has the resources to process the WRITEs. Flow control, essentially. > What exactly is "the memory problem"? Do you have specific symptoms you > are trying to address? Have you had NFS server run out of memory and > grind to a halt? Review the past 9 months of Mike's work on direct I/O, published on this mailing list. Hammerspace has measured this misbehavior and experienced server melt-down. Their solution is to avoid using the page cache entirely. But even so there still seems to be an effective denial-of-service vector by overloading NFSD with UNSTABLE WRITE traffic faster than it can push it to persistence. Perhaps we need better observability first. -- Chuck Lever