All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Snitzer <snitzer@kernel.org>
To: Chuck Lever <cel@kernel.org>
Cc: Jeff Layton <jlayton@kernel.org>,
	Christoph Hellwig <hch@infradead.org>,
	NeilBrown <neil@brown.name>,
	Olga Kornievskaia <okorniev@redhat.com>,
	Dai Ngo <dai.ngo@oracle.com>, Tom Talpey <tom@talpey.com>,
	linux-nfs@vger.kernel.org, Chuck Lever <chuck.lever@oracle.com>,
	Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
Date: Mon, 27 Oct 2025 23:26:58 -0400	[thread overview]
Message-ID: <aQA4AkzjlDybKzCR@kernel.org> (raw)
In-Reply-To: <a221755a-0868-477d-b978-d2c045011977@kernel.org>

[Apologies, I think I repeated myself 4 slightly different ways below]

On Mon, Oct 27, 2025 at 01:57:25PM -0400, Chuck Lever wrote:
> On 10/27/25 12:05 PM, Mike Snitzer wrote:
> >>> You also need IOCB_DSYNC for direct I/O to hit the media if you want
> >>> to return NFS_FILE_SYNC.  But I still don't understand why we want or
> >>> need to return NFS_FILE_SYNC to start with.
> >> NFS_FILE_SYNC is not required here, but it's better if we can return
> >> that. If the server returns NFS_FILE_SYNC there is no need for the
> >> client to follow up with a COMMIT.
> > Yes, which is why I'm confused by Chuck wanting to do away with
> > NFSD_IO_DIRECT setting NFS_FILE_SYNC "for now".  Not heard compelling
> > reason, but "it is what it is". 😉
> 
> The compelling reason is that it's generally faster (or less work for
> the NFS server and its storage) to sync the metadata after the client
> has sent all of the data it wants to write. This amortizes the cost of
> the metadata operations, and allows the server to get the written data
> persisted (if it makes sense to do that) while waiting for the COMMIT.

But none of that matters if the only safe way to implement mixing
buffered and direct IO is by waiting for the DIO to succeed and with
it any associated page cache invalidation (and associated possible
failure to invalidate the page handled with using buffered IO fallback
by underlying filesystem).

Any buffered or direct IO associated with the misaligned DIO WRITE
handling, in terms of 3 segments, must use IOCB_DSYNC.

So can we please revisit your desire to eliminate the use of
IOCB_DSYNC for NFSD_IO_DIRECT WRITEs?

I contend that NFSD_IO_DIRECT should use IOCB_DSYNC|IOCBD_SYNC for
all 3 segments of a misaligned DIO WRITE (so for both buffered and
direct IO).

> Since your patch asserts IOCB_DSYNC for all direct write segments,
> NFSD_IO_DIRECT (as it is implemented in your patch) does not need a
> subsequent COMMIT operation. Forcing the client to COMMIT is totally
> unnecessary. That's why I suggested promoting all NFSD_IO_DIRECT WRITEs
> to FILE_SYNC.

No, the COMMIT can only be elided (and NFS_FILE_SYNC return to client)
if both IOCB_DSYNC and IOCB_SYNC are set.

But yes, at Bakeathon, the intent/understanding was: if you're already
setting SYNC then you can avoid the COMMIT -- this nuance of DSYNC vs
SYNC wasn't on our radar at the time.

> So perhaps the issue here is that the rationale for using IOCB_DSYNC
> for all NFSD_IO_DIRECT writes is hazy and needs to be clarified.

How is it still hazy?  We've had repeat discussion about the need for
IOCB_DSYNC (and IOCB_SYNC if we really want to honor intent of
NFS_FILE_SYNC).

Christoph has repeatedly said DSYNC is needed with O_DIRECT, yet you
keep removing it.

> > Were we not all concerned about safety first (especially of mixing
> > buffered and DIO) and performance a secondary concern?  Using
> > IOCB_DSYNC|IOCB_SYNC for all WRITEs and returning NFS_FILE_SYNC is
> > really safe right?
> 
> The client ensures that UNSTABLE + COMMIT is just as "safe" as
> FILE_SYNC, generally, by preserving the dirty data in its own page cache
> until the server indicates the written data is durable and is safe to
> evict if needed.

I'm aware UNSTABLE is safe thanks to COMMIT, etc.

But the entire intent behind NFSD's O_DIRECT support is to ensure IO
is on stable storage when it replies to the client.  The client isn't
meant to get involved with driving the correctness of NFSD's O_DIRECT
support (by requiring the client set NFS_FILE_SYNC for the benefit of
a feature it doesn't know enabled in the server).

That cannot be what you're saying, NFSD_IO_DIRECT is entirely managed
by the server as a configurable implementation detail.  So we need to
make sure it is implemented safely without the client being involved.

> If you want to make an argument about data integrity, let's
> be as precise as we can about what we believe might be unsafe. The
> data integrity doubt was with the unaligned ends, IIRC? If we need
> that extra bit of integrity, then WRITEs with an unaligned portion
> will need to be IOCB_DSYNC, and then promoted.

The repeat _valid_ concern from Jeff and you was about the need to
ensure data integrity in the NFSD_IO_DIRECT's misaligned WRITE support
(mixes both buffered IO and direct IO for a single misaligned WRITE).

But now you no longer have that concern and have removed the
IOCB_DSYNC flag which is required for _both_ buffered and direct IO.

(IOCB_SYNC is only required if we'd like to return NFS_FILE_SYNC to
the client, maybe you meant to leave IOCB_DSYNC but remove IOCB_SYNC?
Oh man do I really hope so... ;) If so, please trim from here down)

> An NFS WRITE, even an UNSTABLE one, I believe makes the written data
> visible to other readers. That might be an argument for using IOCB_DSYNC
> with IOCB_DIRECT, so that subsequent NFS READs (and local applications)
> see the written data as soon as NFSD generates a response to an NFS
> WRITE backed by IOCB_DIRECT.

I'm confused, how is using IOCB_DSYNC with IOCB_DIRECT up for question
again?  It is needed for misaligned DIO WRITE (splitting into 3
segments, mixing buffered and direct IO).

From the start, NFSD_IO_DIRECT's WRITE support has been about ensuring
cache coherence on a per WRITE basis (once its acknowledged back to
the client). I have tried to be careful to do that even when handling
misaligned DIO WRITEs.

> I think we also discussed that an NFS WRITE should make updates visible
> in byte order, which is why the segments are handled low offset to high
> offset.

Trond mentioned it as needed relative to ensuring consistent file
offset.  I mentioned that is the case, and with NFSD we get that
simply by issuing the IO in series, in file offset order -- but for
NFS client's LOCALIO I _do_ issue out-of-order segment IOs (head/tail
buffered and then DIO aligned middle, but with care to preserve file
offset integrity even with partial writes of any of the 3 segments).

Maybe its fine to have a mode where NFSD allows O_DIRECT to be used
without setting DSYNC when using UNSTABLE, _but_ it really shouldn't
be the default.

> Or, we might decide that, no, NFS WRITE has no data visibility mandate;
> applications achieve data visibility explicitly using COMMIT and file
> locking, so none of this matters.

We don't have that freedom if we cannot preserve file offset integrity
(as would be the case if we removed IOCB_DSYNC when handling all 3
segments of a misaligned DIO WRITE).  Removing IOCB_DSYNC would
compromise misaligned DIO WRITE as implemented.

> > And we already showed that doing so really isn't slow.
> 
> Well we don't have a comparison with "IOCB_DIRECT without IOCB_DSYNC".
> That might be faster than what you tested? Plus I think your test was
> on esoteric enterprise NVMe devices, not on the significantly more
> commonly deployed SSD devices.

IOCB_DIRECT without IOCB_DSYNC isn't an option because we must ensure
the data is ondisk.

The only related bake-off would be:
1) IOCB_DIRECT | IOCB_DSYNC with UNSTABLE 
 vs
2) IOCB_DIRECT | IOCB_DSYNC | IOCBD_SYNC with NFS_FILE_SYNC

(but on esoteric enterprise storage: there is no difference)

This thread and evolution of the threads before it is jarring:

Just recently we were surprised to find IOCB_DIRECT needs IOCB_DSYNC
to ensure data is ondisk; that you (and I too) thought O_DIRECT would
imply that -- I raised concern given what I saw in the XFS performance
improvement patch that was focused on combining O_DIRECT and O_DSYNC.
Because I couldn't see how to avoid setting DSYNC, and we've since
learned DSYNC needed _to ensure data is ondisk_.

So I'm left confused... and I'm feeling sick and would like to get off
this merry-go-round now ;)

Thanks,
Mike

  reply	other threads:[~2025-10-28  3:26 UTC|newest]

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-24 14:42 [PATCH v7 00/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-10-24 14:42 ` [PATCH v7 01/14] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
2025-10-24 15:21   ` Jeff Layton
2025-10-27  8:02   ` Christoph Hellwig
2025-10-24 14:42 ` [PATCH v7 02/14] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
2025-10-27  8:03   ` Christoph Hellwig
2025-10-24 14:42 ` [PATCH v7 03/14] NFSD: Refactor nfsd_vfs_write() Chuck Lever
2025-10-27  8:04   ` Christoph Hellwig
2025-10-24 14:42 ` [PATCH v7 04/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-10-24 17:12   ` Mike Snitzer
2025-10-24 17:24     ` Chuck Lever
2025-10-26  0:03   ` kernel test robot
2025-10-26  1:16   ` kernel test robot
2025-10-24 14:42 ` [PATCH v7 05/14] NFSD: @stable for direct writes is always NFS_FILE_SYNC Chuck Lever
2025-10-24 15:22   ` Jeff Layton
2025-10-24 15:23     ` Chuck Lever
2025-10-27  8:05   ` Christoph Hellwig
2025-10-27 13:23     ` Chuck Lever
2025-10-27 13:27       ` Christoph Hellwig
2025-10-27 14:31         ` Mike Snitzer
2025-10-27 14:36           ` Christoph Hellwig
2025-10-27 14:58             ` Mike Snitzer
2025-10-27 15:04               ` Chuck Lever
2025-10-27 15:19                 ` Mike Snitzer
2025-10-27 15:05               ` Christoph Hellwig
2025-10-24 14:42 ` [PATCH v7 06/14] NFSD: Always set IOCB_SYNC in direct write path Chuck Lever
2025-10-24 15:22   ` Jeff Layton
2025-10-27  8:08   ` Christoph Hellwig
2025-10-27 10:38     ` Jeff Layton
2025-10-27 10:40       ` Christoph Hellwig
2025-10-24 14:42 ` [PATCH v7 07/14] NFSD: Remove specific error handling Chuck Lever
2025-10-24 15:22   ` Jeff Layton
2025-10-24 14:43 ` [PATCH v7 08/14] NFSD: Remove alignment size checking Chuck Lever
2025-10-24 15:22   ` Jeff Layton
2025-10-27  8:09   ` Christoph Hellwig
2025-10-27 13:25     ` Chuck Lever
2025-10-27 13:30       ` Christoph Hellwig
2025-10-24 14:43 ` [PATCH v7 09/14] NFSD: Remove the len_mask check Chuck Lever
2025-10-24 15:23   ` Jeff Layton
2025-10-24 17:16   ` Mike Snitzer
2025-10-24 17:22     ` Chuck Lever
2025-10-24 14:43 ` [PATCH v7 10/14] NFSD: Clean up synopsis of nfsd_iov_iter_aligned_bvec() Chuck Lever
2025-10-24 15:24   ` Jeff Layton
2025-10-24 14:43 ` [PATCH v7 11/14] NFSD: Clean up struct nfsd_write_dio Chuck Lever
2025-10-24 15:26   ` Jeff Layton
2025-10-24 17:20   ` Mike Snitzer
2025-10-24 14:43 ` [PATCH v7 12/14] NFSD: Introduce struct nfsd_write_dio_seg Chuck Lever
2025-10-24 15:30   ` Jeff Layton
2025-10-24 15:37     ` Chuck Lever
2025-10-24 17:57   ` Mike Snitzer
2025-10-24 14:43 ` [PATCH v7 13/14] NFSD: Clean up direct write fall back error flow Chuck Lever
2025-10-24 15:32   ` Jeff Layton
2025-10-24 18:01   ` Mike Snitzer
2025-10-24 14:43 ` [PATCH v7 14/14] NFSD: Initialize separate ki_flags Chuck Lever
2025-10-24 15:34   ` Jeff Layton
2025-10-24 18:13   ` Mike Snitzer
2025-10-24 19:34     ` Chuck Lever
2025-10-24 20:37       ` Mike Snitzer
2025-10-24 21:16         ` Chuck Lever
2025-10-24 23:56           ` Mike Snitzer
2025-10-27  8:15             ` Christoph Hellwig
2025-10-27 10:50               ` Jeff Layton
2025-10-27 10:55                 ` Christoph Hellwig
2025-10-27 13:48                 ` Chuck Lever
2025-10-27 13:49                   ` Christoph Hellwig
2025-10-27 16:18                   ` Mike Snitzer
2025-10-27 16:59                     ` Mike Snitzer
2025-10-29  7:20                     ` Christoph Hellwig
2025-10-27 16:05                 ` Mike Snitzer
2025-10-27 17:57                   ` Chuck Lever
2025-10-28  3:26                     ` Mike Snitzer [this message]
2025-10-28 15:37                       ` Chuck Lever
2025-10-28 16:04                         ` Mike Snitzer
2025-10-28 18:48                           ` Chuck Lever
2025-10-28 23:56                             ` Mike Snitzer
2025-10-29 15:22                               ` Chuck Lever
2025-10-29 16:54                                 ` Mike Snitzer
2025-10-29  7:37                         ` Christoph Hellwig
2025-10-29  7:32                       ` Christoph Hellwig
2025-10-29  7:25                     ` Christoph Hellwig
2025-10-27  8:14         ` Christoph Hellwig
2025-10-27  8:12       ` Christoph Hellwig
2025-10-27 13:27         ` Chuck Lever
2025-10-27 13:30           ` Chuck Lever
2025-10-27 13:31             ` Christoph Hellwig
2025-10-27 14:11         ` Chuck Lever
2025-10-27 14:45           ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aQA4AkzjlDybKzCR@kernel.org \
    --to=snitzer@kernel.org \
    --cc=cel@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=dai.ngo@oracle.com \
    --cc=hch@infradead.org \
    --cc=hch@lst.de \
    --cc=jlayton@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neil@brown.name \
    --cc=okorniev@redhat.com \
    --cc=tom@talpey.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.