From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E63AF27F73A for ; Tue, 23 Sep 2025 23:31:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758670305; cv=none; b=cN7BiYT+pIng9r0b7x17QHaV6DSLN7o9EKlqZg8Gn8bu47b4ySQg5JCmlk9Z+YLQDEYJNo6YJ4fpQnzizlMJGH+nIlUxjmdv+7yyn3DiCSTtbqdertoBm7714VVh4YdPuXexYoxP/9ub85S8iywo6ajcBLp77xalO3BPM7krC58= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758670305; c=relaxed/simple; bh=EnhA3YIhMPIy6ezu9zVEYBHoCnvP20L+Boi7dfaN0yA=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=KSHY41rHxhQ3uC6HArY4XAzIEy1tTAJBWkBUgdZqFfODvNtG4jsd36iR8kcRSd6TLPAaRkqX1kq4r3wzr5QJS39vxSLUAmUYC6NWsrL+0VLSkl9UmYnlZs6OmoSbugGjZNN4H/Jg/BautugNybFql/UJusSDfXmZFASrTU8Bwdc= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=L6F9RiO6; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="L6F9RiO6" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 2931EC4CEF5; Tue, 23 Sep 2025 23:31:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1758670304; bh=EnhA3YIhMPIy6ezu9zVEYBHoCnvP20L+Boi7dfaN0yA=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=L6F9RiO65F2/v8fzOqEOyhJAK4yMsiY6Z9ixzwdNzGGnvqQhkDuUcyXklcTqIzh4Z v0oig4hmHmc5xE6qlTwWAOVOPM4Do19G+N9gxunRV2jAvmwmeiuK1by/2+nyaH7Gm2 PN7j+MBNdECwY6z2OfJ3ZMh3g705Xz9bI9cX2GZNb+jah6+m1mrGisyPbOULGIt1t1 JuA7rpmjSxlidGKrlrtyz3VnsYBcfspPe+LVUi4mPJ93ePW0diXt+ei3URPB/IOKsT Wd49wJtZ6Iityvlvuh5Ua42qW6AJ3xd8kStHrdlZZh8OTy1NlgXbUDkq6EOP1retdU p7aViFwMB0HUQ== Date: Tue, 23 Sep 2025 19:31:43 -0400 From: Mike Snitzer To: Chuck Lever Cc: NeilBrown , Jeff Layton , Olga Kornievskaia , Dai Ngo , Tom Talpey , linux-nfs@vger.kernel.org, Chuck Lever Subject: Re: [PATCH v3 3/3] NFSD: Implement NFSD_IO_DIRECT for NFS READ Message-ID: References: <20250922141137.632525-1-cel@kernel.org> <20250922141137.632525-4-cel@kernel.org> <19eef754-57d9-4fe4-a6e6-a481dcec470e@kernel.org> Precedence: bulk X-Mailing-List: linux-nfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <19eef754-57d9-4fe4-a6e6-a481dcec470e@kernel.org> On Tue, Sep 23, 2025 at 06:49:35PM -0400, Chuck Lever wrote: > On 9/23/25 3:26 PM, Mike Snitzer wrote: > > On Mon, Sep 22, 2025 at 10:11:37AM -0400, Chuck Lever wrote: > >> From: Chuck Lever > >> > >> Add an experimental option that forces NFS READ operations to use > >> direct I/O instead of reading through the NFS server's page cache. > >> > >> There are already other layers of caching: > >> - The page cache on NFS clients > >> - The block device underlying the exported file system > >> > >> The server's page cache, in many cases, is unlikely to provide > >> additional benefit. Some benchmarks have demonstrated that the > >> server's page cache is actively detrimental for workloads whose > >> working set is larger than the server's available physical memory. > >> > >> For instance, on small NFS servers, cached NFS file content can > >> squeeze out local memory consumers. For large sequential workloads, > >> an enormous amount of data flows into and out of the page cache > >> and is consumed by NFS clients exactly once -- caching that data > >> is expensive to do and totally valueless. > >> > >> For now this is a hidden option that can be enabled on test > >> systems for benchmarking. In the longer term, this option might > >> be enabled persistently or per-export. When the exported file > >> system does not support direct I/O, NFSD falls back to using > >> either DONTCACHE or buffered I/O to fulfill NFS READ requests. > >> > >> Suggested-by: Mike Snitzer > >> Reviewed-by: Mike Snitzer > >> Reviewed-by: Jeff Layton > >> Signed-off-by: Chuck Lever > >> --- > >> fs/nfsd/debugfs.c | 2 ++ > >> fs/nfsd/nfsd.h | 1 + > >> fs/nfsd/trace.h | 1 + > >> fs/nfsd/vfs.c | 82 +++++++++++++++++++++++++++++++++++++++++++++++ > >> 4 files changed, 86 insertions(+) > >> > >> diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c > >> index ed2b9e066206..00eb1ecef6ac 100644 > >> --- a/fs/nfsd/debugfs.c > >> +++ b/fs/nfsd/debugfs.c > >> @@ -44,6 +44,7 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n"); > >> * Contents: > >> * %0: NFS READ will use buffered IO > >> * %1: NFS READ will use dontcache (buffered IO w/ dropbehind) > >> + * %2: NFS READ will use direct IO > >> * > >> * This setting takes immediate effect for all NFS versions, > >> * all exports, and in all NFSD net namespaces. > >> @@ -64,6 +65,7 @@ static int nfsd_io_cache_read_set(void *data, u64 val) > >> nfsd_io_cache_read = NFSD_IO_BUFFERED; > >> break; > >> case NFSD_IO_DONTCACHE: > >> + case NFSD_IO_DIRECT: > >> /* > >> * Must disable splice_read when enabling > >> * NFSD_IO_DONTCACHE. > >> diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h > >> index ea87b42894dd..bdb60ee1f1a4 100644 > >> --- a/fs/nfsd/nfsd.h > >> +++ b/fs/nfsd/nfsd.h > >> @@ -157,6 +157,7 @@ enum { > >> /* Any new NFSD_IO enum value must be added at the end */ > >> NFSD_IO_BUFFERED, > >> NFSD_IO_DONTCACHE, > >> + NFSD_IO_DIRECT, > >> }; > >> > >> extern u64 nfsd_io_cache_read __read_mostly; > >> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h > >> index 6e2c8e2aab10..bfd41236aff2 100644 > >> --- a/fs/nfsd/trace.h > >> +++ b/fs/nfsd/trace.h > >> @@ -464,6 +464,7 @@ DEFINE_EVENT(nfsd_io_class, nfsd_##name, \ > >> DEFINE_NFSD_IO_EVENT(read_start); > >> DEFINE_NFSD_IO_EVENT(read_splice); > >> DEFINE_NFSD_IO_EVENT(read_vector); > >> +DEFINE_NFSD_IO_EVENT(read_direct); > >> DEFINE_NFSD_IO_EVENT(read_io_done); > >> DEFINE_NFSD_IO_EVENT(read_done); > >> DEFINE_NFSD_IO_EVENT(write_start); > >> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c > >> index 35880d3f1326..ddcd812f0761 100644 > >> --- a/fs/nfsd/vfs.c > >> +++ b/fs/nfsd/vfs.c > >> @@ -1074,6 +1074,82 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp, > >> return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err); > >> } > >> > >> +/* > >> + * The byte range of the client's READ request is expanded on both > >> + * ends until it meets the underlying file system's direct I/O > >> + * alignment requirements. After the internal read is complete, the > >> + * byte range of the NFS READ payload is reduced to the byte range > >> + * that was originally requested. > >> + * > >> + * Note that a direct read can be done only when the xdr_buf > >> + * containing the NFS READ reply does not already have contents in > >> + * its .pages array. This is due to potentially restrictive > >> + * alignment requirements on the read buffer. When .page_len and > >> + * @base are zero, the .pages array is guaranteed to be page- > >> + * aligned. > >> + */ > > > > So this ^ comment (and the related conversation with Neil in a > > different thread) says page_len should be 0 on entry to > > nfsd_direct_read. > > > >> @@ -1106,6 +1182,12 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp, > >> switch (nfsd_io_cache_read) { > >> case NFSD_IO_BUFFERED: > >> break; > >> + case NFSD_IO_DIRECT: > >> + if (nf->nf_dio_read_offset_align && > >> + rqstp->rq_res.page_len && !base) > >> + return nfsd_direct_read(rqstp, fhp, nf, offset, > >> + count, eof); > >> + fallthrough; > > > > Yet the nfsd_iter_read is only calling nfsd_direct_read() if > > rqstp->rq_res.page_len is not zero, shouldn't it be > > !rqstp->rq_res.page_len ? > > Oops, yes. I did this work last week, while out of range of my lab. Sure. > > > (testing confirms it should be !rqstp->rq_res.page_len) > > > > Hopefully with this fix you can have more confidence in staging this > > in your nfsd-testing? > I'm waiting only for Neil to send an R-b. OK, makes sense. For some reaosn I thought you had that for patch 3. Mike