From: Mike Snitzer <snitzer@kernel.org>
To: NeilBrown <neilb@ownmail.net>
Cc: Chuck Lever <cel@kernel.org>,
Christoph Hellwig <hch@infradead.org>,
Jeff Layton <jlayton@kernel.org>,
Olga Kornievskaia <okorniev@redhat.com>,
Dai Ngo <dai.ngo@oracle.com>, Tom Talpey <tom@talpey.com>,
linux-nfs@vger.kernel.org, Chuck Lever <chuck.lever@oracle.com>
Subject: Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
Date: Fri, 7 Nov 2025 21:01:53 -0500 [thread overview]
Message-ID: <aQ6kkd74pj2aUd8b@kernel.org> (raw)
In-Reply-To: <176255894778.634289.2265909350991291087@noble.neil.brown.name>
On Sat, Nov 08, 2025 at 10:42:27AM +1100, NeilBrown wrote:
> On Sat, 08 Nov 2025, Mike Snitzer wrote:
> > On Sat, Nov 08, 2025 at 08:58:56AM +1100, NeilBrown wrote:
> > > On Sat, 08 Nov 2025, Mike Snitzer wrote:
> > > > On Fri, Nov 07, 2025 at 03:08:11PM -0500, Chuck Lever wrote:
> > > > > On 11/7/25 3:05 PM, Mike Snitzer wrote:
> > > > > >> Agreed. The cover letter noted that this is still controversial.
> > > > > > Only see this, must be missing what you're referring to:
> > > > > >
> > > > > > Changes since v9:
> > > > > > * Unaligned segments no longer use IOCB_DONTCACHE
> > > > >
> > > > > From the v11 cover letter:
> > > > >
> > > > > > One controversy remains: Whether to set DONTCACHE for the unaligned
> > > > > > segments.
> > > >
> > > > Ha, blind as a bat...
> > > >
> > > > hopefully the rest of my reply helps dispel the controversy
> > > >
> > >
> > > Unfortunately I don't think it does. I don't think it even addresses
> > > it.
> > >
> > > What Christoph said was:
> > >
> > > I'd like to sort out the discussion on why to set IOCB_DONTCACHE when
> > > nothing is aligned, but not for the non-aligned parts as that is
> > > extremely counter-intuitive.
> > >
> > > You gave a lengthy (and probably valid) description on why "not for the
> > > non-aligned parts" but don't seem to address the "why to set
> > > IOCB_DONTCACHE when nothing is aligned" bit.
> >
> > Because if the entire IO is so poorly formed that it cannot be issued
> > with DIO then at least we can provide the spirit of what was requested
> > from the admin having cared to enable NFSD_IO_DIRECT.
>
> "so poorly formed"... How can IO be poorly formed in a way that is
> worse than not being aligned?
I was talking about the no_dio case when iov_iter_bvec_offset() isn't
aligned (case 2 below). The entire WRITE must use buffered IO even
though NFSD_IO_DIRECT configured.
> > Those IOs could be quite large.. full WRITE payload. I have a
> > reproducer if you'd like it!
>
> Can they? How does that work?
> From looking at the code I can only see that "Those IOs" are too small
> to have any whole pages in them.
Huh? You're clearly confused by the 2 cases of NFSD_IO_DIRECT's
unaligned WRITE handling (as evidenced by you solidly mixing them).
1: WRITE is split, has DIO-aligned middle and prefix and/or suffix.
prefix/suffix are the subpage IOs of a misaligned WRITE that _does_
have a DIO-aligned middle, I covered that in detail in this email:
https://lore.kernel.org/linux-nfs/aQ5Q99Kvw0ZE09Th@kernel.org/
it left you wanting, thinking I missed Christoph's point (I didn't).
2: WRITE cannot be split such that middle is DIO-aligned.
The case you wanted me to elaborate on, I did so here:
https://lore.kernel.org/linux-nfs/aQ5xkjIzf6uU_zLa@kernel.org/
when entire WRITE is misaligned it makes all the sense in the world to
use DONTCACHE. Chuck, Christoph and I all agree on this.
But just because it makes sense for case 2's (potentially large) WRITE
to not use cached buffered IO if NFSD_IO_DIRECT, it doesn't logically
follow that case 1's two individual pages for prefix/suffix must use
DONTCACHE buffered IO too.
Now do you see the so-called "controversy"?
Christoph expects case 1 to use DONTCACHE because it best reflects
O_DIRECT semantics (avoid using caching).
Here is another way to look at it:
Q: Case 2 uses DONTCACHE, so case 1 should too right?
A: NO. There is legit benefit to having case 1 use cached buffered IO
when issuing its 2 subpage IOs; and that benefit doesn't cause harm
because order-0 page management is not causing the MM problems that
NFSD_IO_DIRECT sets out to avoid (whereas higher order cached buffered
IO is exactly what both DONTCACHE and NFSD_IO_DIRECT aim to avoid.
Otherwise MM spins and spins trying to find adequate free pages,
cannot so does dirty writeback and reclaim which causes kswapd and
kcompactd to burn cpu, etc).
NFSD_IO_DIRECT and DONTCACHE both want to avoid the pathological worst
cases of the MM subsystem when its coping with large IOs. But
NFSD_IO_DIRECT is doing so in terms of O_DIRECT whereas DONTCACHE is
approximating O_DIRECT (uses page cache with drop-behind semantics,
without concern for DIO alignment, etc). Both are effective in their
own way, but NFSD_IO_DIRECT only uniformly benefits from using
DONTCACHE buffered IO for case 2.
I'm advocating we make an informed decision for NFSD_IO_DIRECT to use
cached buffered IO for case 1's simple usage of order-0 allocations.
Let's please not get hung up of intent of O_DIRECT because
NFSD_IO_DIRECT achieves that intent very well -- and by using cached
buffered IO it can avoid pathological performance problems (case 1's
handling of streaming misaligned WRITEs that _must_ do RMW).
> Maybe we need a clear description of the sort of IOs that are big enough
> that they impose a performance cost on the page-cache, but that somehow
> cannot be usefully split into a prefix/middle/suffix where prefix and
> suffix are less than a page each.
So defining case 2 again: entire WRITE payload is unaligned because
each aligned logical_block_size doesn't start on a dma_alignment
boundary in memory. (Also the definition of "the entire IO is so
poorly formed that it cannot be issued with DIO".)
Mike
next prev parent reply other threads:[~2025-11-08 2:01 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-07 15:34 [PATCH v11 0/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-07 15:34 ` [PATCH v11 1/3] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
2025-11-07 15:34 ` [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-07 15:39 ` Christoph Hellwig
2025-11-07 15:40 ` Chuck Lever
2025-11-07 20:05 ` Mike Snitzer
2025-11-07 20:08 ` Chuck Lever
2025-11-07 20:10 ` Mike Snitzer
2025-11-07 21:58 ` NeilBrown
2025-11-07 22:24 ` Mike Snitzer
2025-11-07 23:42 ` NeilBrown
2025-11-08 2:01 ` Mike Snitzer [this message]
2025-11-10 16:41 ` Chuck Lever
2025-11-10 17:57 ` Mike Snitzer
2025-11-11 8:51 ` Christoph Hellwig
2025-11-11 14:20 ` Chuck Lever
2025-11-11 14:21 ` Christoph Hellwig
2025-11-12 0:06 ` Mike Snitzer
2025-11-12 15:02 ` Christoph Hellwig
2025-11-12 23:14 ` Mike Snitzer
2025-11-13 8:13 ` Christoph Hellwig
2025-11-13 21:45 ` Mike Snitzer
2025-11-07 20:28 ` Chuck Lever
2025-11-07 22:16 ` Mike Snitzer
2025-11-10 9:12 ` Christoph Hellwig
2025-11-10 15:42 ` Mike Snitzer
2025-11-11 8:44 ` Christoph Hellwig
2025-11-10 9:17 ` Christoph Hellwig
2025-11-10 15:43 ` Mike Snitzer
2025-11-07 17:18 ` Mike Snitzer
2025-11-07 22:13 ` NeilBrown
2025-11-07 15:34 ` [PATCH v11 3/3] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst Chuck Lever
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aQ6kkd74pj2aUd8b@kernel.org \
--to=snitzer@kernel.org \
--cc=cel@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=dai.ngo@oracle.com \
--cc=hch@infradead.org \
--cc=jlayton@kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=neilb@ownmail.net \
--cc=okorniev@redhat.com \
--cc=tom@talpey.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).