All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Snitzer <snitzer@kernel.org>
To: Chuck Lever <cel@kernel.org>
Cc: Jeff Layton <jlayton@kernel.org>,
	Christoph Hellwig <hch@infradead.org>,
	NeilBrown <neil@brown.name>,
	Olga Kornievskaia <okorniev@redhat.com>,
	Dai Ngo <dai.ngo@oracle.com>, Tom Talpey <tom@talpey.com>,
	linux-nfs@vger.kernel.org, Chuck Lever <chuck.lever@oracle.com>,
	Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
Date: Tue, 28 Oct 2025 19:56:28 -0400	[thread overview]
Message-ID: <aQFYLKmC2Nr1QUrM@kernel.org> (raw)
In-Reply-To: <b63ed4d4-8bbb-4794-ae07-64a4000e0ea9@kernel.org>

On Tue, Oct 28, 2025 at 02:48:33PM -0400, Chuck Lever wrote:
> On 10/28/25 12:04 PM, Mike Snitzer wrote:
> > I'm also concerned about initiating
> > any page cache invalidation off the back of the DIO WRITE in the
> > middle (and its potential to fallback to buffered if that invalidation
> > fails... any associated buffered IO fallback or ends best have O_DSYNC
> > set it ensure everything ondisk).
>
> My questions:

All good questions, and I appreciate the time you spent analyzing
these aspects.

I'm all for NFSD's IO modes being able to work with all possible
stable_how.  Your analysis helps expand the scope of these IO modes to
really be pushed to their limits (specifically thinking of
NFSD_IO_DIRECT being paired with UNSTABLE and it _not_ using
IOCB_DSYNC, interesting combo but IMO quite a risky default given
we're only at the debugfs interface stage for NFSD_IO_DIRECT).
 
> A. Which filesystem code can return -ENOTBLK?
> 
> - iomap-based DIO (fs/iomap/direct-io.c:715): XFS, ext4, f2fs, gfs2,
>   zonefs, erofs
> - Legacy DIO path (fs/direct-io.c:984): ext2, some ext4 code paths
> - btrfs (has its own DIO implementation that uses -ENOTBLK)
> 
> Code auditing shows that vfs_iocb_iter_write() never returns -ENOTBLK
> because all filesystems that could generate this error internally
> convert it to 0 or another error code before returning from their
> write_iter implementation.
> 
> Therefore NFSD itself does not need to know about or handle page cache
> invalidation failure.

OK, good to hear; but have you reviewed if they are all using D_SYNC
as part of their internal handling of their buffered fallback?  NFSD
setting DSYNC just acknowledges it is required to ensure O_DIRECT
WRITEs are ondisk upon return to client -- something I don't want to
leave to the client to ensure via NFS_UNSTABLE's subsequent COMMIT.

Setting O_DSYNC gives us a baseline where we don't leave some other
subsystem to ensure data gets to disk as quickly as possible (O_DSYNC
offers NFS_DATA_SYNC, but O_DSYNC|SYNC's NFS_FILE_SYNC offers a win by
avoiding COMMIT, even more enticing).

So I need NFSD_IO_DIRECT to be able to require IOCB_DSYNC and possibly
IOCB_DSYNC|IOCB_SYNC.  And I'm most comfortable with the misaligned
DIO WRITE support using O_DSYNC for all 3 segments; so that WRITE's
data is ondisk before returning from the WRITE (NFS_DATA_SYNC), and
bonus if we can avoid extra COMMIT (NFS_FILE_SYNC).

> B. Even if NFSD had to recover from a failed page cache invalidation,
>    does a fallback write need to set IOCB_DSYNC (not considering the NFS
>    protocol requirements) ?
> 
> When invalidation fails, some pages remain in the cache (dirty or with
> private data). The buffered fallback writes to those same pages,
> updating them with the correct data. There's no torn state or corruption
> hazard.
> 
> The fallback restarts the entire write from scratch, as buffered. Any
> partial progress (like a buffered first segment) gets rewritten with
> the same data. No orphaned blocks or inconsistent state.
> 
> If another process does a DIO read of this region, it will call
> kiocb_write_and_wait() first (see mm/filemap.c:2912), which flushes any
> dirty pages before reading. So they'll see correct data even if it's
> still dirty in cache (Jeff's concern).

Yes, but the point is there is extra work that involves the page
cache by other layers.  And O_DIRECT is intended to work expecting the
IO has been flushed to disk.  O_DIRECT is aiming to reduce page cache
usage, to keep memory use low.  So ensuring the page cache invalidated
to disk by the subsequent write as a side-efffect of O_DSYNC helps
(more on this below [0]).

> C. When setting up a three-segment write, is IOCB_DSYNC needed on the
>    unaligned ends (not considering the NFS protocol requirements) ?
> 
> Before writing the DIO middle segment, the VFS must first invalidate the
> page cache, so the first (buffered) segment is flushed to durability
> anyway. The use of IOCB_DSYNC for the first segment is superfluous.

No, the first segment reflects the misaligned head of the WRITE. The
DIO-aligned middle segment's DIO will just invalidate pages aligned on
the middle segment's alignment (which is page aligned).  So the DIO
from the middle segment won't have any side-effect on the first
segment's page.  First page needs to be written out with IOCB_DSYNC.

> After writing the DIO middle segment, the unaligned end remains only in
> the page cache if IOCB_DSYNC is not used. But a subsequent DIO read will
> flush that to durability (via kiocb_write_and_wait) before satisfying
> the read. So, IOCB_DSYNC is not needed there to meet putative file data
> visibility mandates.

The same applies to both the misaligned first (head) and third (tail)
segments, they both need O_DSYNC to ensure their contents are ondisk.

[0]: Again, my goal for NFSD_IO_DIRECT is to operate in terms of
O_DIRECT|O_DSYNC ondisk.. and not lean on the page cache more than
needed. This makes the VM subsystem more scalable as a side-effect of
it having more clean pages that are quickly dropped and/or reused if
something needs memory.

> >> As I understand it, IOCB_DSYNC has nothing to do with whether the three
> >> segments are directed to the correct file offsets. Serially initiating
> >> the writes with the same iocb should be sufficient to ensure
> >> correctness.
> > 
> > IOCB_DSYNC just ensures when they complete they are on-disk.  And any
> > failure of, or short WRITE, is trapped and immediate cause for early
> > return.  Whereby ensuring file offset integrity.
> 
> AFAICT, generic_perform_write() updates ki_pos synchronously whenever
> it returns successfully, regardless of the IOCB_DSYNC or IOCB_DIRECT
> flag settings. If the vfs_iocb_iter_write() call fails or returns fewer
> bytes than expected, the loop terminates immediately and writes to
> subsequent segments are not initiated. I don't see a data integrity
> issue here.

For ki_pos we're saying the same thing: yes ki_pos is updated (and
it'll reflect a short write if one happens, etc). IOCB_DSYNC doesn't
influence if ki_pos is advanced. But if IOCB_DSYNC isn't set then
ki_pos is advanced beyond what was written _to disk_. That obviously
doesn't matter if NFS_UNSTABLE, thanks to its COMMIT, but it does
increase latency (though it is worth benchmarking with various
workloads!).

Anyway, I've been working to ensure NFSD_IO_DIRECT acts as if at least
NFS_DATA_SYNC, but preferably NFS_FILE_SYNC, set.  I should have more  
clearly stated that.  Bypassing all caches as much possible allows for
enabling scalable cache coherence, e.g. intelligent applications that
don't rely on file locking, think multiple writers writing to their
own extent of a large file.

Having NFSD_IO_DIRECT be handled as UNSTABLE isn't adequate for my
needs (immediate ondisk requirements, lowest memory and CPU usage as
possible).  Would you be OK with adding 2 new debugfs knobs?:

/sys/kernel/debug/nfsd/io_cache_read_stable_how
/sys/kernel/debug/nfsd/io_cache_write_stable_how

- Each defaults to using the stable_how that the client requested.

- Each will serve as a floor, and only override default if they are
  greater/stricter than the client requested stable_how. (so if
  'io_cache_write_stable_how' set to NFS_FILE_SYNC it'd override
  client specified UNSTABLE).

- Each can override the stable_how used so each IO mode behaves
  accordingly (e.g. I can set NFSD_IO_DIRECT and NFS_FILE_SYNC to get
  the bahviour I'd like).

> Thus I'm still unconvinced that IOCB_DSYNC is required if the client has
> requested an UNSTABLE write. I'm open to reviewing actual evidence of a
> failure mode where IOCB_DSYNC might prevent data corruption, but so far
> I don't see how it can happen.

Yeah, I see your point now.  If io_cache_{read,write}_stable_how
debugfs controls are acceptable to you it'll give us maximum
flexibility for controlling how NFSD behaves in each NFSD_IO mode.

Thanks.
Mike

  reply	other threads:[~2025-10-28 23:56 UTC|newest]

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-24 14:42 [PATCH v7 00/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-10-24 14:42 ` [PATCH v7 01/14] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
2025-10-24 15:21   ` Jeff Layton
2025-10-27  8:02   ` Christoph Hellwig
2025-10-24 14:42 ` [PATCH v7 02/14] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
2025-10-27  8:03   ` Christoph Hellwig
2025-10-24 14:42 ` [PATCH v7 03/14] NFSD: Refactor nfsd_vfs_write() Chuck Lever
2025-10-27  8:04   ` Christoph Hellwig
2025-10-24 14:42 ` [PATCH v7 04/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-10-24 17:12   ` Mike Snitzer
2025-10-24 17:24     ` Chuck Lever
2025-10-26  0:03   ` kernel test robot
2025-10-26  1:16   ` kernel test robot
2025-10-24 14:42 ` [PATCH v7 05/14] NFSD: @stable for direct writes is always NFS_FILE_SYNC Chuck Lever
2025-10-24 15:22   ` Jeff Layton
2025-10-24 15:23     ` Chuck Lever
2025-10-27  8:05   ` Christoph Hellwig
2025-10-27 13:23     ` Chuck Lever
2025-10-27 13:27       ` Christoph Hellwig
2025-10-27 14:31         ` Mike Snitzer
2025-10-27 14:36           ` Christoph Hellwig
2025-10-27 14:58             ` Mike Snitzer
2025-10-27 15:04               ` Chuck Lever
2025-10-27 15:19                 ` Mike Snitzer
2025-10-27 15:05               ` Christoph Hellwig
2025-10-24 14:42 ` [PATCH v7 06/14] NFSD: Always set IOCB_SYNC in direct write path Chuck Lever
2025-10-24 15:22   ` Jeff Layton
2025-10-27  8:08   ` Christoph Hellwig
2025-10-27 10:38     ` Jeff Layton
2025-10-27 10:40       ` Christoph Hellwig
2025-10-24 14:42 ` [PATCH v7 07/14] NFSD: Remove specific error handling Chuck Lever
2025-10-24 15:22   ` Jeff Layton
2025-10-24 14:43 ` [PATCH v7 08/14] NFSD: Remove alignment size checking Chuck Lever
2025-10-24 15:22   ` Jeff Layton
2025-10-27  8:09   ` Christoph Hellwig
2025-10-27 13:25     ` Chuck Lever
2025-10-27 13:30       ` Christoph Hellwig
2025-10-24 14:43 ` [PATCH v7 09/14] NFSD: Remove the len_mask check Chuck Lever
2025-10-24 15:23   ` Jeff Layton
2025-10-24 17:16   ` Mike Snitzer
2025-10-24 17:22     ` Chuck Lever
2025-10-24 14:43 ` [PATCH v7 10/14] NFSD: Clean up synopsis of nfsd_iov_iter_aligned_bvec() Chuck Lever
2025-10-24 15:24   ` Jeff Layton
2025-10-24 14:43 ` [PATCH v7 11/14] NFSD: Clean up struct nfsd_write_dio Chuck Lever
2025-10-24 15:26   ` Jeff Layton
2025-10-24 17:20   ` Mike Snitzer
2025-10-24 14:43 ` [PATCH v7 12/14] NFSD: Introduce struct nfsd_write_dio_seg Chuck Lever
2025-10-24 15:30   ` Jeff Layton
2025-10-24 15:37     ` Chuck Lever
2025-10-24 17:57   ` Mike Snitzer
2025-10-24 14:43 ` [PATCH v7 13/14] NFSD: Clean up direct write fall back error flow Chuck Lever
2025-10-24 15:32   ` Jeff Layton
2025-10-24 18:01   ` Mike Snitzer
2025-10-24 14:43 ` [PATCH v7 14/14] NFSD: Initialize separate ki_flags Chuck Lever
2025-10-24 15:34   ` Jeff Layton
2025-10-24 18:13   ` Mike Snitzer
2025-10-24 19:34     ` Chuck Lever
2025-10-24 20:37       ` Mike Snitzer
2025-10-24 21:16         ` Chuck Lever
2025-10-24 23:56           ` Mike Snitzer
2025-10-27  8:15             ` Christoph Hellwig
2025-10-27 10:50               ` Jeff Layton
2025-10-27 10:55                 ` Christoph Hellwig
2025-10-27 13:48                 ` Chuck Lever
2025-10-27 13:49                   ` Christoph Hellwig
2025-10-27 16:18                   ` Mike Snitzer
2025-10-27 16:59                     ` Mike Snitzer
2025-10-29  7:20                     ` Christoph Hellwig
2025-10-27 16:05                 ` Mike Snitzer
2025-10-27 17:57                   ` Chuck Lever
2025-10-28  3:26                     ` Mike Snitzer
2025-10-28 15:37                       ` Chuck Lever
2025-10-28 16:04                         ` Mike Snitzer
2025-10-28 18:48                           ` Chuck Lever
2025-10-28 23:56                             ` Mike Snitzer [this message]
2025-10-29 15:22                               ` Chuck Lever
2025-10-29 16:54                                 ` Mike Snitzer
2025-10-29  7:37                         ` Christoph Hellwig
2025-10-29  7:32                       ` Christoph Hellwig
2025-10-29  7:25                     ` Christoph Hellwig
2025-10-27  8:14         ` Christoph Hellwig
2025-10-27  8:12       ` Christoph Hellwig
2025-10-27 13:27         ` Chuck Lever
2025-10-27 13:30           ` Chuck Lever
2025-10-27 13:31             ` Christoph Hellwig
2025-10-27 14:11         ` Chuck Lever
2025-10-27 14:45           ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aQFYLKmC2Nr1QUrM@kernel.org \
    --to=snitzer@kernel.org \
    --cc=cel@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=dai.ngo@oracle.com \
    --cc=hch@infradead.org \
    --cc=hch@lst.de \
    --cc=jlayton@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neil@brown.name \
    --cc=okorniev@redhat.com \
    --cc=tom@talpey.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.