linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Christoph Hellwig <hch@infradead.org>
To: Mike Snitzer <snitzer@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>,
	Chuck Lever <cel@kernel.org>, NeilBrown <neil@brown.name>,
	Jeff Layton <jlayton@kernel.org>,
	Olga Kornievskaia <okorniev@redhat.com>,
	Dai Ngo <dai.ngo@oracle.com>, Tom Talpey <tom@talpey.com>,
	linux-nfs@vger.kernel.org
Subject: Re: [PATCH v4 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
Date: Tue, 21 Oct 2025 22:14:42 -0700	[thread overview]
Message-ID: <aPhoQtHH8iscmmKJ@infradead.org> (raw)
In-Reply-To: <aPZi2DXtdELwjTu2@kernel.org>

On Mon, Oct 20, 2025 at 12:27:04PM -0400, Mike Snitzer wrote:
> > Can you define synchronous better here?  The term is unfortunately
> > overloaded between synchronous syscalls vs aio/io_uring and O_(D)SYNC
> > style I/O.  As of now I don't understand which one you mean, especially
> > with the DONTCACHE reference thrown in, but I guess I'll figure it out
> > reading the patch.
> 
> It clearly talks about synchronous IO.  DONTCACHE just happens to be
> the buffered IO that'll be used if supported.
> 
> I don't see anything that needs changing here.

Again, what synchronous?  O_(D)SYNC, or !async?

> 
> > > If vfs_iocb_iter_write() returns -ENOTBLK, due to its inability to
> > > invalidate the page cache on behalf of the DIO WRITE, then
> > > nfsd_issue_write_dio() will fall back to using buffered IO.
> > 
> > Did you see -ENOTBLK leaking out of the file systems?  Because at
> > least for iomap it is supposed to be an indication that the
> > file system ->write_iter handler needs to retry using buffered
> > I/O and never leak to the caller.
> 
> fs/iomap/direct-io.c:__iomap_dio_rw() will return -ENOTBLK on its own.

Yes, and the callers in the file system methods are supposed to handle
that and fall back to buffered I/O.  Take a look at xfs_file_write_iter.

If this leaks out of ->write_iter we have a problem that needs fixing
in the affected file system.

> 
> > > These changes served as the original starting point for the NFS
> > > client's misaligned O_DIRECT support that landed with
> > > commit c817248fc831 ("nfs/localio: add proper O_DIRECT support for
> > > READ and WRITE"). But NFSD's support is simpler because it currently
> > > doesn't use AIO completion.
> > 
> > I don't understand this paragraph.  What does starting point mean
> > here?  How does it matter for the patch description?
> 
> This patch's NFSD changes were starting point for NFS commit
> c817248fc831

But that commit is already in?  This sentence really confuses me.

> Would prefer that cleanup, if done, to happen with an incremental
> follow-up patch.

Starting out with well structured code is generally a good idea :(

> 
> > > +static bool
> > > +nfsd_iov_iter_aligned_bvec(const struct iov_iter *i, unsigned int addr_mask,
> > > +                        unsigned int len_mask)
> > 
> > Wouldn't it make sense to track the alignment when building the bio_vec
> > array instead of doing another walk here touching all cache lines?
> 
> Yes, that is the conventional wisdom that justified Keith removing
> iov_iter_aligned.  But for NFSD's WRITE payload the bio_vec is built
> outside of the fs/nfsd/vfs.c code.  Could be there is a natural way to
> clean this up (to make the sunrpc layer to conditionally care about
> alignment) but I didn't confront that yet.

Well, for the block code it's also build outside the layer consuming it.
But Keith showed that you can easily communicate that information and
avoid extra loops touching the cache lines.

> 
> Room for follow-on improvement to be sure.
> 
> > > +	if (unlikely(dio_blocksize > PAGE_SIZE))
> > > +		return false;
> > 
> > Why does this matter?  Can you add a comment explaining it?
> 
> I did/do have concerns that a bug in the storage driver could expose
> dio_offset_align that is far too large and so bounded dio_blocksize to
> a conservative size.  What is a better constant to check?

I don't think there is a good upper limit, we not do supper LBA sizes
larger than PAGE_SIZE, although I don't think there's any shipping
devices that do that.  But the most important thing is to always add
a comment with the rationale for such non-obvious checks so that someone
who has to lift it later can understand why it is done.

> > > +	 * Also update @stable_how to avoid need for COMMIT.
> > > +	 */
> > > +	kiocb->ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
> > 
> > What do you mean with completed before returning?  I guess you
> > mean writeback actually happening, right?  Why do you need that,
> > why do you also force it for the direct I/O?
> 
> The IO needs to have reached stable storage.

Please spell that out.  Because that's different from how completed
is generally used in file systems and storage protocols.

> > Also IOCB_SYNC is wrong here, as the only thing it does over
> > IOCB_DSYNC is also forcing back of metadata not needed to find
> > data (aka timestamps), which I can't see any need for here.
> 
> Well, we're eliding any followup SYNC from the NFS client by setting
> stable_how to NFS_FILE_SYNC, so I made sure to use SYNC:

Why do you care about timestamps reaching stable storage?  Which is the
only practical extra thing IOCB_SYNC will give you.  If there is a good
reason for that, add a comment because that's a bit unexpected (and
in general IOCB_SYNC / O_SYNC is just a sign that someone does not know
about SYNC vs DSYNC semantics).

> > The VFS certainly does not, and if it leaks out of a specific file
> > system we need to fix that.
> 
> As I said above, fs/iomap/direct-io.c:__iomap_dio_rw() does.

As stated above that's an internal helper and the code is supposed to
be handled by the file system and never leak out.

> Regardless, the most likely outcome from issing what should be aligned
> DIO only to get -EINVAL is exactly what is addressed with this
> _unlikely_ error handling: misaligned IO..

No.  There is no reason why this is any more likely than any of the
other -EINVAL cases.


  reply	other threads:[~2025-10-22  5:14 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-18  0:54 [PATCH v4 0/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-10-18  0:54 ` [PATCH v4 1/3] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
2025-10-20  7:02   ` Christoph Hellwig
2025-10-18  0:54 ` [PATCH v4 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-10-20  7:19   ` Christoph Hellwig
2025-10-20 13:56     ` Chuck Lever
2025-10-20 14:05       ` Christoph Hellwig
2025-10-20 16:27     ` Mike Snitzer
2025-10-22  5:14       ` Christoph Hellwig [this message]
2025-10-22 14:37         ` Chuck Lever
2025-10-23  5:46           ` Christoph Hellwig
2025-10-21 11:24     ` Jeff Layton
2025-10-22  5:16       ` Christoph Hellwig
2025-10-22 10:15         ` Jeff Layton
2025-10-22 11:17           ` Christoph Hellwig
2025-10-22 11:30             ` Jeff Layton
2025-10-22 13:31             ` Chuck Lever
2025-10-23  5:27               ` Christoph Hellwig
2025-10-22 17:59     ` Chuck Lever
2025-10-23  5:52       ` Christoph Hellwig
2025-10-18  0:54 ` [PATCH v4 3/3] svcrdma: Mark Read chunks Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aPhoQtHH8iscmmKJ@infradead.org \
    --to=hch@infradead.org \
    --cc=cel@kernel.org \
    --cc=dai.ngo@oracle.com \
    --cc=jlayton@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neil@brown.name \
    --cc=okorniev@redhat.com \
    --cc=snitzer@kernel.org \
    --cc=tom@talpey.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).