From: Dave Chinner <david@fromorbit.com>
To: Brian Foster <bfoster@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>,
linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
avi@scylladb.com, andres@anarazel.de
Subject: Re: [PATCH 6/6] xfs: reduce exclusive locking on unaligned dio
Date: Wed, 13 Jan 2021 09:06:41 +1100 [thread overview]
Message-ID: <20210112220641.GT331610@dread.disaster.area> (raw)
In-Reply-To: <20210112170133.GD1137163@bfoster>
On Tue, Jan 12, 2021 at 12:01:33PM -0500, Brian Foster wrote:
> On Tue, Jan 12, 2021 at 11:42:57AM +0100, Christoph Hellwig wrote:
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index bba33be17eff..f5c75404b8a5 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -408,7 +408,7 @@ xfs_file_aio_write_checks(
> > > drained_dio = true;
> > > goto restart;
> > > }
> > > -
> > > +
> >
> > Spurious unrelated whitespace change.
> >
> > > struct iomap_dio_rw_args args = {
> > > .iocb = iocb,
> > > .iter = from,
> > > .ops = &xfs_direct_write_iomap_ops,
> > > .dops = &xfs_dio_write_ops,
> > > .wait_for_completion = is_sync_kiocb(iocb),
> > > - .nonblocking = (iocb->ki_flags & IOCB_NOWAIT),
> > > + .nonblocking = true,
> >
> > I think this is in many ways wrong. As far as I can tell you want this
> > so that we get the imap_spans_range in xfs_direct_write_iomap_begin. But
> > we should not trigger any of the other checks, so we'd really need
> > another flag instead of reusing this one.
> >
>
> It's really the br_state != XFS_EXT_NORM check that we want for the
> unaligned case, isn't it?
We can only submit unaligned DIO with a shared IOLOCK to a written
range, which means we need to abort the IO if we hit a COW range
(imap_needs_cow()), a hole (imap_needs_alloc()), the range spans
multiple extents (imap_spans_range()) and, finally, unwritten
extents (the new check I added).
IOMAP_NOWAIT aborts on all these cases and returns EAGAIN.
> > imap_spans_range is a bit pessimistic for avoiding the exclusive lock,
No, it's absolutely required.
If the sub-block aligned dio spans multiple extents, we don't know
what locking is required for that next extent until iomap_apply()
loops and calls us again for that range. WHile the first range might
be written and OK to issue, the next extent range could
require allocation, COW or unwritten extent conversion and so would
require exclusive IO locking. And so we end up with partial IO
submission, which causes all sorts of problems...
IOWs, if the unaligned dio cannot be mapped to a single written
extent, we can't do it under shared locking conditions - it must be
done under exclusive locking to maintain the "no partial submission"
rules we have for DIO.
> > but I guess we could live that if it is clearly documented as helping
> > with the implementation, but we really should not automatically trigger
> > all the other effects of nowait I/O.
>
> Regardless, I agree on this point.
The only thing that IOMAP_NOWAIT does that might be questionable is
the xfs_ilock_nowait() call on the ILOCK. We want it to abort shared
IO if we don't have the extents read in - Christoph's patch made
this trigger exclusive IO, too and so of all the things that
IOMAP_NOWAIT triggers, the -only thing- we can raise a question
about is the trylock.
And, quite frankly, if something is modifying the inode metadata
while we are trying to sub-block DIO, I want the sub-block DIO to
fall back to exclusive locking just to be safe. It may not be
necessary, but right now I'd prefer to err on the side of caution
and be conservative about when this optimisation triggers. If we get
it wrong, we corrupt data....
> I don't have a strong opinion in general on this approach vs. the
> other, but it does seem odd to me to overload the broader nowait
> semantics with the unaligned I/O checks. I see that it works for
> the primary case we care about, but this also means things like
> the _has_page() check now trigger exclusivity for the unaligned
> case where that doesn't seem to be necessary.
Actually, it's another case of being safe rather than sorry. In the
sub-block DIO is racing with mmap or write() dirtying the page that
spans the DIO range, we end up issuing concurrent IOs to the same
LBA range, something that results in undefined behaviour and is
something we must absolutely not do.
That is:
DIO (1024, 512)
submit_bio (1024, 512)
.....
mmap
(0, 4096)
touch byte 0
page dirty
DIO (2048, 512)
filemap_write_and_wait_range(2048, 512)
submit_bio(0, 4096)
.....
and now we have overlapping concurrent IO in flight even though
usrespace has not done any overlapping modifications at all.
Overlapping IO should never be issued by the filesystem as the
result is undefined. Yes, the application should not be mixing
mmap+DIO, but we the filesystem in this case is doing something even
worse and something we tell userspace developers that *they should
never do*. We can trivially avoid this corruption case by falling
back to exclusive locking for subblock dio if writeback and/or page
cache invalidation may be required.
IOWs, IOMAP_NOWAIT gives us exactly the behaviour we need here for
serialising concurrent sub-block dio against page cache based IO...
> I do like the
> previous cleanups so I suspect if we worked this into a new
> 'subblock_io' flag that indicates to the lower layer whether the
> filesystem can allow zeroing, that might clean much of this up.
Allow zeroing where, exactly? e.g. some filesystems do zeroing in
their allocation routines during mapping. IOWs, this strikes me as
encoding specific filesystem implementation requirements into the
generic API as opposed to using generic functionality to implement
specific FS behavioural requirements.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2021-01-12 22:07 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-01-12 1:07 [RFC] xfs: reduce sub-block DIO serialisation Dave Chinner
2021-01-12 1:07 ` [PATCH 1/6] iomap: convert iomap_dio_rw() to an args structure Dave Chinner
2021-01-12 1:22 ` Damien Le Moal
2021-01-12 1:40 ` Darrick J. Wong
2021-01-12 1:53 ` Dave Chinner
2021-01-12 10:31 ` Christoph Hellwig
2021-01-12 1:07 ` [PATCH 2/6] iomap: move DIO NOWAIT setup up into filesystems Dave Chinner
2021-01-12 1:07 ` [PATCH 3/6] xfs: factor out a xfs_ilock_iocb helper Dave Chinner
2021-01-12 1:07 ` [PATCH 4/6] xfs: make xfs_file_aio_write_checks IOCB_NOWAIT-aware Dave Chinner
2021-01-12 1:07 ` [PATCH 5/6] xfs: split unaligned DIO write code out Dave Chinner
2021-01-12 10:37 ` Christoph Hellwig
2021-01-12 1:07 ` [PATCH 6/6] xfs: reduce exclusive locking on unaligned dio Dave Chinner
2021-01-12 10:42 ` Christoph Hellwig
2021-01-12 17:01 ` Brian Foster
2021-01-12 17:10 ` Christoph Hellwig
2021-01-12 22:06 ` Dave Chinner [this message]
2021-01-12 8:01 ` [RFC] xfs: reduce sub-block DIO serialisation Avi Kivity
2021-01-12 22:13 ` Dave Chinner
2021-01-13 8:00 ` Avi Kivity
2021-01-13 20:38 ` Dave Chinner
2021-01-14 6:48 ` Avi Kivity
2021-01-17 21:34 ` Dave Chinner
2021-01-18 7:41 ` Avi Kivity
[not found] ` <CACz=WechdgSnVHQsg0LKjMiG8kHLujBshmc270yrdjxfpffmDQ@mail.gmail.com>
2021-01-17 21:36 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210112220641.GT331610@dread.disaster.area \
--to=david@fromorbit.com \
--cc=andres@anarazel.de \
--cc=avi@scylladb.com \
--cc=bfoster@redhat.com \
--cc=hch@infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox