* [RFC PATCH] NFSD: Add a "file_sync" export option
@ 2025-10-30 12:56 Chuck Lever
2025-10-30 14:20 ` Christoph Hellwig
0 siblings, 1 reply; 9+ messages in thread
From: Chuck Lever @ 2025-10-30 12:56 UTC (permalink / raw)
To: linux-nfs; +Cc: Mike Snitzer, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
Introduce the kernel pieces for a "file_sync" export option. This
option would make all NFS WRITE operations on one export a
FILE_SYNC WRITE.
There are two primary use cases for this new export option:
1. The exported file system is not backed by persistent storage.
Thus a subsequent COMMIT will be a no-op. To prevent the client
from wasting the extra round-trip on a COMMIT operation, convert
all WRITEs to files on that export to FILE_SYNC.
2. The exported file system is backed by persistent storage that is
faster than the mean network round trip with the client. Waiting
for a separate COMMIT operation would cost more time than just
committing the data during the WRITE operation.
Either the underlying persistent storage is faster than most any
network fabric (eg, NVMe); or the network connection to the
client is very high latency (eg, a WAN link).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/export.c | 1 +
fs/nfsd/nfs4proc.c | 1 +
fs/nfsd/vfs.c | 5 +++--
include/uapi/linux/nfsd/export.h | 3 ++-
4 files changed, 7 insertions(+), 3 deletions(-)
This patch is a year old, so won't apply to current kernels. But
the idea is similar to Mike's suggestion that NFSD_IO_DIRECT
should promote all NFS WRITEs to durable writes, but is much
simpler in execution. Any interest in revisiting this approach?
diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
index c82d8e3e0d4f..11b5337dd0ea 100644
--- a/fs/nfsd/export.c
+++ b/fs/nfsd/export.c
@@ -1297,6 +1297,7 @@ static struct flags {
{ NFSEXP_V4ROOT, {"v4root", ""}},
{ NFSEXP_PNFS, {"pnfs", ""}},
{ NFSEXP_SECURITY_LABEL, {"security_label", ""}},
+ { NFSEXP_FILE_SYNC, {"file_sync", ""}},
{ 0, {"", ""}}
};
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 51bae11d5d23..7a4ded3ff7c2 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -1269,6 +1269,7 @@ nfsd4_clone(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
status = nfsd4_clone_file_range(rqstp, src, clone->cl_src_pos,
dst, clone->cl_dst_pos, clone->cl_count,
EX_ISSYNC(cstate->current_fh.fh_export));
+ /* cel: check the "file_sync" export option as well */
nfsd_file_put(dst);
nfsd_file_put(src);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index cd00d95c997f..ffa6db6851bd 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1205,9 +1205,10 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
exp = fhp->fh_export;
- if (!EX_ISSYNC(exp))
+ if (exp->ex_flags & NFSEXP_FILE_SYNC)
+ *stable = NFS_FILE_SYNC;
+ else if (!EX_ISSYNC(exp))
*stable = NFS_UNSTABLE;
-
if (*stable && !fhp->fh_use_wgather)
flags |= RWF_SYNC;
diff --git a/include/uapi/linux/nfsd/export.h b/include/uapi/linux/nfsd/export.h
index a73ca3703abb..45afec454a37 100644
--- a/include/uapi/linux/nfsd/export.h
+++ b/include/uapi/linux/nfsd/export.h
@@ -53,9 +53,10 @@
*/
#define NFSEXP_V4ROOT 0x10000
#define NFSEXP_PNFS 0x20000
+#define NFSEXP_FILE_SYNC 0x40000
/* All flags that we claim to support. (Note we don't support NOACL.) */
-#define NFSEXP_ALLFLAGS 0x3FEFF
+#define NFSEXP_ALLFLAGS 0x7FEFF
/* The flags that may vary depending on security flavor: */
#define NFSEXP_SECINFO_FLAGS (NFSEXP_READONLY | NFSEXP_ROOTSQUASH \
--
2.51.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] NFSD: Add a "file_sync" export option
2025-10-30 12:56 [RFC PATCH] NFSD: Add a "file_sync" export option Chuck Lever
@ 2025-10-30 14:20 ` Christoph Hellwig
2025-10-30 15:33 ` Mike Snitzer
0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2025-10-30 14:20 UTC (permalink / raw)
To: Chuck Lever; +Cc: linux-nfs, Mike Snitzer, Chuck Lever
On Thu, Oct 30, 2025 at 08:56:38AM -0400, Chuck Lever wrote:
> Either the underlying persistent storage is faster than most any
> network fabric (eg, NVMe); or the network connection to the
> client is very high latency (eg, a WAN link).
NVMe really is an interface and not a description of performance
characteristics. Nevertheless none of the NVMe implementations I
know have lower roundtrip latency than modern local networks.
> This patch is a year old, so won't apply to current kernels. But
> the idea is similar to Mike's suggestion that NFSD_IO_DIRECT
> should promote all NFS WRITEs to durable writes, but is much
> simpler in execution. Any interest in revisiting this approach?
This is a much better approach than overloading direct I/O with
these semantics. I'd still love to see actual use cases for which
we see benefits before merging it.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] NFSD: Add a "file_sync" export option
2025-10-30 14:20 ` Christoph Hellwig
@ 2025-10-30 15:33 ` Mike Snitzer
2025-10-30 15:45 ` Christoph Hellwig
2025-10-30 15:47 ` Chuck Lever
0 siblings, 2 replies; 9+ messages in thread
From: Mike Snitzer @ 2025-10-30 15:33 UTC (permalink / raw)
To: Christoph Hellwig, Chuck Lever; +Cc: linux-nfs, Chuck Lever
On Thu, Oct 30, 2025 at 07:20:02AM -0700, Christoph Hellwig wrote:
> On Thu, Oct 30, 2025 at 08:56:38AM -0400, Chuck Lever wrote:
> > Either the underlying persistent storage is faster than most any
> > network fabric (eg, NVMe); or the network connection to the
> > client is very high latency (eg, a WAN link).
>
> NVMe really is an interface and not a description of performance
> characteristics. Nevertheless none of the NVMe implementations I
> know have lower roundtrip latency than modern local networks.
Sure, but not all modern networks have the same level of performance
either. When the NVMe is faster than the network we don't see nearly
as much MM pressure. But that implies the network is the bottleneck, so
reducing network operations (like COMMIT) should reduce network
traffic (even if marginally).
Once the network is as fast or faster than the NVMe devices, that's
when we've seen VM writeback/reclaim with buffered IO become
detrimental (when the working set exceeds system memory by a factor of
3:1). And that's where NFSD_IO_DIRECT mode has proven best.
> > This patch is a year old, so won't apply to current kernels. But
> > the idea is similar to Mike's suggestion that NFSD_IO_DIRECT
> > should promote all NFS WRITEs to durable writes, but is much
> > simpler in execution. Any interest in revisiting this approach?
>
> This is a much better approach than overloading direct I/O with
> these semantics. I'd still love to see actual use cases for which
> we see benefits before merging it.
Yes. Also thinking that a "data_sync" export option would be
appropriate too (that way to have the ability to try all stable_how
variants). Chuck? If something like that sounds OK in theory I can
rebase your patch (still attributed to you) and then create a separate
to add "data_sync" and then work to get the permutations tested.
Or the export option could be stable_how=[unstable,data_sync,file_sync] ?
(knowing that this export option would just set the stable_how floor,
NFSD cannot downgrade NFS client specified stable_how as per the spec
Chuck pointed to before).
(but that ushers in collision with async and stable_how=, so maybe
best to just go with discrete export options? or confine stable_how=
to being able to set data_sync or file_sync?)
Christoph, if you have canned benchmarks that do a solid job of
showcasing overwrites (which you expect to really benefit from _not_
having DSYNC or DSYNC|SYNC set) please let me know.
Thanks,
Mike
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] NFSD: Add a "file_sync" export option
2025-10-30 15:33 ` Mike Snitzer
@ 2025-10-30 15:45 ` Christoph Hellwig
2025-10-30 16:16 ` Mike Snitzer
2025-10-30 15:47 ` Chuck Lever
1 sibling, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2025-10-30 15:45 UTC (permalink / raw)
To: Mike Snitzer; +Cc: Christoph Hellwig, Chuck Lever, linux-nfs, Chuck Lever
On Thu, Oct 30, 2025 at 11:33:00AM -0400, Mike Snitzer wrote:
> Sure, but not all modern networks have the same level of performance
> either. When the NVMe is faster than the network we don't see nearly
> as much MM pressure. But that implies the network is the bottleneck, so
> reducing network operations (like COMMIT) should reduce network
> traffic (even if marginally).
There is a lot of code between the network and the storage, and they
tend to be slower than either for many common workloads :)
> Once the network is as fast or faster than the NVMe devices, that's
> when we've seen VM writeback/reclaim with buffered IO become
> detrimental (when the working set exceeds system memory by a factor of
> 3:1). And that's where NFSD_IO_DIRECT mode has proven best.
I bet that getting VM writeback out of the stack helps at lot. But as
mentioned I doubt forcing stable writes helps, and in fact for most
workloads will actually make it slower. But that's just my experience
from similar but not the same things, so I'd love to see numbers if
you suspect something else. Either way we're much better off changing
one variable at a time instead of forcing two totally unrelated changes
to go together.
> Christoph, if you have canned benchmarks that do a solid job of
> showcasing overwrites (which you expect to really benefit from _not_
> having DSYNC or DSYNC|SYNC set) please let me know.
None with nfs in the loop. For an older benchmark with purely
local I/O, 3460cac1ca76215a60acb086ebe97b3e50731628 has an example,
which should be pretty representative for modern workloads, even if
the overall numbers for each case would improve a lot.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] NFSD: Add a "file_sync" export option
2025-10-30 15:33 ` Mike Snitzer
2025-10-30 15:45 ` Christoph Hellwig
@ 2025-10-30 15:47 ` Chuck Lever
2025-10-30 16:32 ` Mike Snitzer
1 sibling, 1 reply; 9+ messages in thread
From: Chuck Lever @ 2025-10-30 15:47 UTC (permalink / raw)
To: Mike Snitzer, Christoph Hellwig; +Cc: linux-nfs, Chuck Lever
On 10/30/25 11:33 AM, Mike Snitzer wrote:
>>> This patch is a year old, so won't apply to current kernels. But
>>> the idea is similar to Mike's suggestion that NFSD_IO_DIRECT
>>> should promote all NFS WRITEs to durable writes, but is much
>>> simpler in execution. Any interest in revisiting this approach?
>> This is a much better approach than overloading direct I/O with
>> these semantics. I'd still love to see actual use cases for which
>> we see benefits before merging it.
And the reason it hasn't been merged yet is because I couldn't find any
such workloads. Even tmpfs was a little slower without the COMMITs,
to my surprise.
> Yes. Also thinking that a "data_sync" export option would be
> appropriate too (that way to have the ability to try all stable_how
> variants). Chuck? If something like that sounds OK in theory I can
> rebase your patch (still attributed to you) and then create a separate
> to add "data_sync" and then work to get the permutations tested.
If you want to experiment, feel free.
As always, I'm not enthusiastic about exposing a bunch of tuning knobs
like this without a clear understanding of how it benefits users and
what documentation might look like explaining how to use it. So for the
moment, this patch is, as labeled in the Subject: field, an RFC, and not
a firm/official proposal for an API change. (Note that IIRC, adding the
new export option was an idea we had /before/ we had
/sys/kernel/debug/nfsd available to us).
Or to put it differently, just because I proposed this patch does not
mean it's automatically "Chuck approved". I'm interested in experimental
results first. I'm thinking you have access to big iron on which to try
it.
But, in the bigger picture, I think comparison between this approach
and NFSD_IO_DIRECT might be illustrative.
--
Chuck Lever
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] NFSD: Add a "file_sync" export option
2025-10-30 15:45 ` Christoph Hellwig
@ 2025-10-30 16:16 ` Mike Snitzer
0 siblings, 0 replies; 9+ messages in thread
From: Mike Snitzer @ 2025-10-30 16:16 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Chuck Lever, linux-nfs, Chuck Lever
On Thu, Oct 30, 2025 at 08:45:42AM -0700, Christoph Hellwig wrote:
> On Thu, Oct 30, 2025 at 11:33:00AM -0400, Mike Snitzer wrote:
> > Sure, but not all modern networks have the same level of performance
> > either. When the NVMe is faster than the network we don't see nearly
> > as much MM pressure. But that implies the network is the bottleneck, so
> > reducing network operations (like COMMIT) should reduce network
> > traffic (even if marginally).
>
> There is a lot of code between the network and the storage, and they
> tend to be slower than either for many common workloads :)
I've been pretty impressed with how NFS, and surrounding Linux IO
stacks (network and storage), is able to keep up with really fast
hardware.
> > Once the network is as fast or faster than the NVMe devices, that's
> > when we've seen VM writeback/reclaim with buffered IO become
> > detrimental (when the working set exceeds system memory by a factor of
> > 3:1). And that's where NFSD_IO_DIRECT mode has proven best.
>
> I bet that getting VM writeback out of the stack helps at lot. But as
> mentioned I doubt forcing stable writes helps, and in fact for most
> workloads will actually make it slower. But that's just my experience
> from similar but not the same things, so I'd love to see numbers if
> you suspect something else. Either way we're much better off changing
> one variable at a time instead of forcing two totally unrelated changes
> to go together.
Yeah. I'd have split them out to new variants of NFSD_IO_DIRECT,
e.g.:
NFSD_IO_DIRECT_DATA_SYNC
NFSD_IO_DIRECT_FILE_SYNC
But using a proper export option to control stable_how entirely
independent of the chosen NFSD_IO mode is more useful.
> > Christoph, if you have canned benchmarks that do a solid job of
> > showcasing overwrites (which you expect to really benefit from _not_
> > having DSYNC or DSYNC|SYNC set) please let me know.
>
> None with nfs in the loop. For an older benchmark with purely
> local I/O, 3460cac1ca76215a60acb086ebe97b3e50731628 has an example,
> which should be pretty representative for modern workloads, even if
> the overall numbers for each case would improve a lot.
Even if not for NFS, we can run NFS client using O_DIRECT which
should drive the IO so NFSD receives it much like the application
issued it (albeit wrapped in XDR and NFS protocol). And if using
NFSD_IO_DIRECT we should then be able to assess how/if changing
stable_how impacts performance.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] NFSD: Add a "file_sync" export option
2025-10-30 15:47 ` Chuck Lever
@ 2025-10-30 16:32 ` Mike Snitzer
2025-10-30 19:13 ` Chuck Lever
0 siblings, 1 reply; 9+ messages in thread
From: Mike Snitzer @ 2025-10-30 16:32 UTC (permalink / raw)
To: Chuck Lever; +Cc: Christoph Hellwig, linux-nfs, Chuck Lever
On Thu, Oct 30, 2025 at 11:47:15AM -0400, Chuck Lever wrote:
> On 10/30/25 11:33 AM, Mike Snitzer wrote:
> >>> This patch is a year old, so won't apply to current kernels. But
> >>> the idea is similar to Mike's suggestion that NFSD_IO_DIRECT
> >>> should promote all NFS WRITEs to durable writes, but is much
> >>> simpler in execution. Any interest in revisiting this approach?
> >> This is a much better approach than overloading direct I/O with
> >> these semantics. I'd still love to see actual use cases for which
> >> we see benefits before merging it.
>
> And the reason it hasn't been merged yet is because I couldn't find any
> such workloads. Even tmpfs was a little slower without the COMMITs,
> to my surprise.
>
>
> > Yes. Also thinking that a "data_sync" export option would be
> > appropriate too (that way to have the ability to try all stable_how
> > variants). Chuck? If something like that sounds OK in theory I can
> > rebase your patch (still attributed to you) and then create a separate
> > to add "data_sync" and then work to get the permutations tested.
>
> If you want to experiment, feel free.
>
> As always, I'm not enthusiastic about exposing a bunch of tuning knobs
> like this without a clear understanding of how it benefits users and
> what documentation might look like explaining how to use it. So for the
> moment, this patch is, as labeled in the Subject: field, an RFC, and not
> a firm/official proposal for an API change. (Note that IIRC, adding the
> new export option was an idea we had /before/ we had
> /sys/kernel/debug/nfsd available to us).
>
> Or to put it differently, just because I proposed this patch does not
> mean it's automatically "Chuck approved". I'm interested in experimental
> results first. I'm thinking you have access to big iron on which to try
> it.
>
> But, in the bigger picture, I think comparison between this approach
> and NFSD_IO_DIRECT might be illustrative.
Sure, I'm very interested in the data myself. A patch to easily
enable control is all I'm after. So given what you said above, I'll
actually just run with introducing 2 new variants of NFSD_IO_DIRECT
for now, so like I mentioned in my previous reply to hch:
NFSD_IO_DIRECT_DATA_SYNC
NFSD_IO_DIRECT_FILE_SYNC
Because it sounds like it is only in the context of NFSD_IO_DIRECT
where there is any doubt about whether using NFS_FILE_SYNC helpful.
So it bounds the supportability exposure, and makes it clear these
knobs are for experimental purposes relative to NFS_IO mode controls.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] NFSD: Add a "file_sync" export option
2025-10-30 16:32 ` Mike Snitzer
@ 2025-10-30 19:13 ` Chuck Lever
2025-10-30 23:39 ` Mike Snitzer
0 siblings, 1 reply; 9+ messages in thread
From: Chuck Lever @ 2025-10-30 19:13 UTC (permalink / raw)
To: Mike Snitzer; +Cc: Christoph Hellwig, linux-nfs, Chuck Lever
On 10/30/25 12:32 PM, Mike Snitzer wrote:
> On Thu, Oct 30, 2025 at 11:47:15AM -0400, Chuck Lever wrote:
>> But, in the bigger picture, I think comparison between this approach
>> and NFSD_IO_DIRECT might be illustrative.
>
> Sure, I'm very interested in the data myself. A patch to easily
> enable control is all I'm after. So given what you said above, I'll
> actually just run with introducing 2 new variants of NFSD_IO_DIRECT
> for now, so like I mentioned in my previous reply to hch:
>
> NFSD_IO_DIRECT_DATA_SYNC
> NFSD_IO_DIRECT_FILE_SYNC
>
> Because it sounds like it is only in the context of NFSD_IO_DIRECT
> where there is any doubt about whether using NFS_FILE_SYNC helpful.
I'm not sure where you're getting that. FILE_SYNC is interesting for all
the IO modes, and I'd really like to see specifically the comparison of
NFSD_IO_BUFFERED with "file_sync" and NFSD_IO_DIRECT with and without
"file_sync".
> So it bounds the supportability exposure, and makes it clear these
> knobs are for experimental purposes relative to NFS_IO mode controls.
--
Chuck Lever
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] NFSD: Add a "file_sync" export option
2025-10-30 19:13 ` Chuck Lever
@ 2025-10-30 23:39 ` Mike Snitzer
0 siblings, 0 replies; 9+ messages in thread
From: Mike Snitzer @ 2025-10-30 23:39 UTC (permalink / raw)
To: Chuck Lever; +Cc: Christoph Hellwig, linux-nfs, Chuck Lever
On Thu, Oct 30, 2025 at 03:13:05PM -0400, Chuck Lever wrote:
> On 10/30/25 12:32 PM, Mike Snitzer wrote:
> > On Thu, Oct 30, 2025 at 11:47:15AM -0400, Chuck Lever wrote:
>
> >> But, in the bigger picture, I think comparison between this approach
> >> and NFSD_IO_DIRECT might be illustrative.
> >
> > Sure, I'm very interested in the data myself. A patch to easily
> > enable control is all I'm after. So given what you said above, I'll
> > actually just run with introducing 2 new variants of NFSD_IO_DIRECT
> > for now, so like I mentioned in my previous reply to hch:
> >
> > NFSD_IO_DIRECT_DATA_SYNC
> > NFSD_IO_DIRECT_FILE_SYNC
> >
> > Because it sounds like it is only in the context of NFSD_IO_DIRECT
> > where there is any doubt about whether using NFS_FILE_SYNC helpful.
>
> I'm not sure where you're getting that. FILE_SYNC is interesting for all
> the IO modes, and I'd really like to see specifically the comparison of
> NFSD_IO_BUFFERED with "file_sync" and NFSD_IO_DIRECT with and without
> "file_sync".
I wasn't talking about file_sync being useful or not in general. I
meant based on what you said with file_sync uniformly hurting
performance (even with tmpfs) that it wouldn't be a question for IO
modes other than NFSD_IO_DIRECT.
Anyway, I can change to implementing the control in terms of an export
option.
Mike
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-10-30 23:39 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-30 12:56 [RFC PATCH] NFSD: Add a "file_sync" export option Chuck Lever
2025-10-30 14:20 ` Christoph Hellwig
2025-10-30 15:33 ` Mike Snitzer
2025-10-30 15:45 ` Christoph Hellwig
2025-10-30 16:16 ` Mike Snitzer
2025-10-30 15:47 ` Chuck Lever
2025-10-30 16:32 ` Mike Snitzer
2025-10-30 19:13 ` Chuck Lever
2025-10-30 23:39 ` Mike Snitzer
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.