All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Snitzer <snitzer@kernel.org>
To: Chuck Lever <cel@kernel.org>
Cc: Jeff Layton <jlayton@kernel.org>,
	Christoph Hellwig <hch@infradead.org>,
	NeilBrown <neil@brown.name>,
	Olga Kornievskaia <okorniev@redhat.com>,
	Dai Ngo <dai.ngo@oracle.com>, Tom Talpey <tom@talpey.com>,
	linux-nfs@vger.kernel.org, Chuck Lever <chuck.lever@oracle.com>,
	Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
Date: Tue, 28 Oct 2025 12:04:28 -0400	[thread overview]
Message-ID: <aQDpjNnj47ISRItg@kernel.org> (raw)
In-Reply-To: <1f7b30d2-f806-400f-81d3-80b6c924c410@kernel.org>

On Tue, Oct 28, 2025 at 11:37:52AM -0400, Chuck Lever wrote:
> On 10/27/25 11:26 PM, Mike Snitzer wrote:
> > So can we please revisit your desire to eliminate the use of
> > IOCB_DSYNC for NFSD_IO_DIRECT WRITEs?
> 
> Let's have a little less breathless panic, please. The whole point of
> review is to revisit our decisions. And nothing I've done so far is
> written in stone... even after merge, we can still apply patches. If
> the consensus is we don't like v8 or some particular patch, I will
> rewrite or replace it, as I've already said.
> 
> You might view me as a whimsical and authoritarian maintainer, but
> actually, I am resisting the urge to merge a patch that the community
> (that is now responsible for free long-term support of the patch) hasn't
> yet fully learned about and carefully reviewed. I see my role as
> enforcing a consensus process, and both learning and consensus takes
> time.

Cool, thanks for level setting.  Helps!

> >> So perhaps the issue here is that the rationale for using IOCB_DSYNC
> >> for all NFSD_IO_DIRECT writes is hazy and needs to be clarified.
> 
> > How is it still hazy?  We've had repeat discussion about the need for
> > IOCB_DSYNC (and IOCB_SYNC if we really want to honor intent of
> > NFS_FILE_SYNC).
> 
> TL;DR: it's hazy for folks who were not in the room in Raleigh.
> 
> I don't see any comments that explain /why/ the unaligned ends need to
> be durable along with the middle segment. It appears to be assumed that
> everyone agrees it's necessary. Patch review has shown that is perhaps
> not a valid assumption.

How so?  I missed any patch review that called into question the need
to ensure all segments are on stable storage.  The only was to do that
for buffered and direct I is to set IOCB_DSYNC.
 
> So far it's been discussed verbally, but we really /really/ want to have
> this documented, because we're all going to forget the rationale in a
> few months.

But it isn't so complex.  DSYNC is needed if you want data to be
ondisk when the call returns.

> And please do not forget that this is open source code. The code has to
> be able to be modified by developers outside our community. Right now it
> isn't well enough documented for anyone who was outside of the room in
> Raleigh two weeks ago to understand WTF is going on. The point being
> that we cannot make final decisions in a closed room -- eventually they
> need to face the larger community.
> 
> If we need to insist that NFSD_IO_DIRECT mode is always going to be
> fully durable, that design choice needs to be explained in code comments
> that are very close to the code that implements it.
> 
> 
> > Christoph has repeatedly said DSYNC is needed with O_DIRECT, yet you
> > keep removing it.
> 
> That's not what I read. Over the course of three email threads, he wrote
> that:
> 
> - IOCB_DSYNC is always needed when IOCB_SYNC is set, whether or not
>   we're using IOCB_DIRECT.
> 
> - in order to guarantee that a direct write is durable, we /either/ need
>   IOCB_DSYNC + IOCB_DIRECT, /or/ IOCB_DIRECT by itself with a follow-up
>   COMMIT.
> 
> - for some commonly deployed media types, IOCB_DSYNC with IOCB_DIRECT
>   might be slower than IOCB_DIRECT followed up with COMMIT.

Hmm, I have other quotes that stand out to me.. but I'll spare you
(especially since I'm short on time to reply at the moment).

> Therefore, we need to carefully justify why the current patches stick
> with only IOCB_DSYNC + IOCB_DIRECT, or decide it's truly not necessary
> to force all NFSD_IO_DIRECT writes to be IOCB_DSYNC.

The misaligned DIO WRITE is what I'm concerned about given the 3
segments that make up the whole of a given NFS WRITE.

We do have the ability to know a misaligned DIO WRITE is occuring
(nsegs > 1).

I would _really_ appreciate it if we could ensure setting IOCB_DSYNC
for that case.

> Christoph and I (if I may put words in his mouth) both seem to be
> interested in making NFSD_IO_DIRECT useful in contexts other than a very
> specific enterprise-grade server with esoteric NVMe devices and ultra
> high bandwidth networking. After all, one of the deep requirements for
> merging something upstream is that it is likely to be useful to more
> than a very narrow constituency (see recent discussions about merging
> TernFS).

You're fine to extend NFSD_IO_DIRECT to be broadly applicable, I
completely agree with that.  Accomplishing that by removing flags that
ensure data integrity for misaligned DIO WRITE isn't the way forward.

That's by only point.  Its typical to start with more conservtive
protection, especially for something manifesting as a new optional
debug setting like NFSD_IO_DIRECT.

Relaxing the code in a sensible and safe manner is perfectly
acceptable.  I'm just saying it is _not_ in the misaligned DIO WRITE
case (I'll stop repeating that now, promise.. heh).

> If we truly care about making NFSD_IO_DIRECT valuable for small memory
> NFSD systems, then we need to acknowledge that their durable storage is
> likely to be virtual or in some other way bandwidth-compromised, and not
> a directly-attached real NVMe device.

Not always the case, but I agree that is one possibility.

> >> Or, we might decide that, no, NFS WRITE has no data visibility mandate;
> >> applications achieve data visibility explicitly using COMMIT and file
> >> locking, so none of this matters.
> > 
> > We don't have that freedom if we cannot preserve file offset integrity
> > (as would be the case if we removed IOCB_DSYNC when handling all 3
> > segments of a misaligned DIO WRITE).  Removing IOCB_DSYNC would
> > compromise misaligned DIO WRITE as implemented.
> 
> As I understand it, IOCB_DSYNC has nothing to do with whether the three
> segments are directed to the correct file offsets. Serially initiating
> the writes with the same iocb should be sufficient to ensure
> correctness.

IOCB_DSYNC just ensures when they complete they are on-disk.  And any
failure of, or short WRITE, is trapped and immediate cause for early
return.  Whereby ensuring file offset integrity.

> The concern about integrity is that in the multi-segment case, the
> segments won't make it to durable storage at the same time, and an
> intervening NFS READ might see an intermittent file state.

Well that is one concern yes; but I'm also concerned about initiating
any page cache invalidation off the back of the DIO WRITE in the
middle (and its potential to fallback to buffered if that invalidation
fails... any associated buffered IO fallback or ends best have O_DSYNC
set it ensure everything ondisk).

> If the risk of a torn write is an actual problem for an application, it
> should serialize its reads, writes, and flushes itself. I think there
> are already plausible situations in today's non-DIRECT world where
> incomplete writes are visible, so it might be sensible not to worry too
> much about it here.
> 
> Jeff, please do clarify if I've misunderstood your concern.

Yeah, applications really shouldn't expect perfection in the face of
contended buffered READ vs DIO WRITE.. I'm most concerned about us
trying to have misaligned DIO WRITE's IO be handled as if all 3
segments are a unit (_not_ atomic) that when written will still
preserve file offset integrity.  Kills 2 birds of "integrity" (data
and file offset).

> >>> And we already showed that doing so really isn't slow.
> >>
> >> Well we don't have a comparison with "IOCB_DIRECT without IOCB_DSYNC".
> >> That might be faster than what you tested? Plus I think your test was
> >> on esoteric enterprise NVMe devices, not on the significantly more
> >> commonly deployed SSD devices.
> > 
> > IOCB_DIRECT without IOCB_DSYNC isn't an option because we must ensure
> > the data is ondisk.
> 
> You are stating the exact assumption that we are testing right now. It
> is not 100% clear from code and comments in these patches why "we must
> ensure the data is on disk".

So misaligned DIO WRITE in terms of 3 segments works.

Thank you for your attention to this matter! ;)

Mike

  reply	other threads:[~2025-10-28 16:04 UTC|newest]

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-24 14:42 [PATCH v7 00/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-10-24 14:42 ` [PATCH v7 01/14] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
2025-10-24 15:21   ` Jeff Layton
2025-10-27  8:02   ` Christoph Hellwig
2025-10-24 14:42 ` [PATCH v7 02/14] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
2025-10-27  8:03   ` Christoph Hellwig
2025-10-24 14:42 ` [PATCH v7 03/14] NFSD: Refactor nfsd_vfs_write() Chuck Lever
2025-10-27  8:04   ` Christoph Hellwig
2025-10-24 14:42 ` [PATCH v7 04/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-10-24 17:12   ` Mike Snitzer
2025-10-24 17:24     ` Chuck Lever
2025-10-26  0:03   ` kernel test robot
2025-10-26  1:16   ` kernel test robot
2025-10-24 14:42 ` [PATCH v7 05/14] NFSD: @stable for direct writes is always NFS_FILE_SYNC Chuck Lever
2025-10-24 15:22   ` Jeff Layton
2025-10-24 15:23     ` Chuck Lever
2025-10-27  8:05   ` Christoph Hellwig
2025-10-27 13:23     ` Chuck Lever
2025-10-27 13:27       ` Christoph Hellwig
2025-10-27 14:31         ` Mike Snitzer
2025-10-27 14:36           ` Christoph Hellwig
2025-10-27 14:58             ` Mike Snitzer
2025-10-27 15:04               ` Chuck Lever
2025-10-27 15:19                 ` Mike Snitzer
2025-10-27 15:05               ` Christoph Hellwig
2025-10-24 14:42 ` [PATCH v7 06/14] NFSD: Always set IOCB_SYNC in direct write path Chuck Lever
2025-10-24 15:22   ` Jeff Layton
2025-10-27  8:08   ` Christoph Hellwig
2025-10-27 10:38     ` Jeff Layton
2025-10-27 10:40       ` Christoph Hellwig
2025-10-24 14:42 ` [PATCH v7 07/14] NFSD: Remove specific error handling Chuck Lever
2025-10-24 15:22   ` Jeff Layton
2025-10-24 14:43 ` [PATCH v7 08/14] NFSD: Remove alignment size checking Chuck Lever
2025-10-24 15:22   ` Jeff Layton
2025-10-27  8:09   ` Christoph Hellwig
2025-10-27 13:25     ` Chuck Lever
2025-10-27 13:30       ` Christoph Hellwig
2025-10-24 14:43 ` [PATCH v7 09/14] NFSD: Remove the len_mask check Chuck Lever
2025-10-24 15:23   ` Jeff Layton
2025-10-24 17:16   ` Mike Snitzer
2025-10-24 17:22     ` Chuck Lever
2025-10-24 14:43 ` [PATCH v7 10/14] NFSD: Clean up synopsis of nfsd_iov_iter_aligned_bvec() Chuck Lever
2025-10-24 15:24   ` Jeff Layton
2025-10-24 14:43 ` [PATCH v7 11/14] NFSD: Clean up struct nfsd_write_dio Chuck Lever
2025-10-24 15:26   ` Jeff Layton
2025-10-24 17:20   ` Mike Snitzer
2025-10-24 14:43 ` [PATCH v7 12/14] NFSD: Introduce struct nfsd_write_dio_seg Chuck Lever
2025-10-24 15:30   ` Jeff Layton
2025-10-24 15:37     ` Chuck Lever
2025-10-24 17:57   ` Mike Snitzer
2025-10-24 14:43 ` [PATCH v7 13/14] NFSD: Clean up direct write fall back error flow Chuck Lever
2025-10-24 15:32   ` Jeff Layton
2025-10-24 18:01   ` Mike Snitzer
2025-10-24 14:43 ` [PATCH v7 14/14] NFSD: Initialize separate ki_flags Chuck Lever
2025-10-24 15:34   ` Jeff Layton
2025-10-24 18:13   ` Mike Snitzer
2025-10-24 19:34     ` Chuck Lever
2025-10-24 20:37       ` Mike Snitzer
2025-10-24 21:16         ` Chuck Lever
2025-10-24 23:56           ` Mike Snitzer
2025-10-27  8:15             ` Christoph Hellwig
2025-10-27 10:50               ` Jeff Layton
2025-10-27 10:55                 ` Christoph Hellwig
2025-10-27 13:48                 ` Chuck Lever
2025-10-27 13:49                   ` Christoph Hellwig
2025-10-27 16:18                   ` Mike Snitzer
2025-10-27 16:59                     ` Mike Snitzer
2025-10-29  7:20                     ` Christoph Hellwig
2025-10-27 16:05                 ` Mike Snitzer
2025-10-27 17:57                   ` Chuck Lever
2025-10-28  3:26                     ` Mike Snitzer
2025-10-28 15:37                       ` Chuck Lever
2025-10-28 16:04                         ` Mike Snitzer [this message]
2025-10-28 18:48                           ` Chuck Lever
2025-10-28 23:56                             ` Mike Snitzer
2025-10-29 15:22                               ` Chuck Lever
2025-10-29 16:54                                 ` Mike Snitzer
2025-10-29  7:37                         ` Christoph Hellwig
2025-10-29  7:32                       ` Christoph Hellwig
2025-10-29  7:25                     ` Christoph Hellwig
2025-10-27  8:14         ` Christoph Hellwig
2025-10-27  8:12       ` Christoph Hellwig
2025-10-27 13:27         ` Chuck Lever
2025-10-27 13:30           ` Chuck Lever
2025-10-27 13:31             ` Christoph Hellwig
2025-10-27 14:11         ` Chuck Lever
2025-10-27 14:45           ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aQDpjNnj47ISRItg@kernel.org \
    --to=snitzer@kernel.org \
    --cc=cel@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=dai.ngo@oracle.com \
    --cc=hch@infradead.org \
    --cc=hch@lst.de \
    --cc=jlayton@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neil@brown.name \
    --cc=okorniev@redhat.com \
    --cc=tom@talpey.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.