linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Amir Goldstein <amir73il@gmail.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>,
	Pankaj Raghav <p.raghav@samsung.com>,
	Jens Axboe <axboe@kernel.dk>, Chris Mason <clm@fb.com>,
	Matthew Wilcox <willy@infradead.org>,
	Daniel Gomez <da.gomez@samsung.com>,
	linux-mm <linux-mm@kvack.org>,
	Luis Chamberlain <mcgrof@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Christoph Hellwig <hch@lst.de>,
	Josef Bacik <josef@toxicpanda.com>, Jan Kara <jack@suse.cz>
Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
Date: Thu, 29 Feb 2024 11:25:33 +1100	[thread overview]
Message-ID: <Zd/O/S3rdvZ8OxZJ@dread.disaster.area> (raw)
In-Reply-To: <CAOQ4uxi=fdjXq7q0_+0mDovmBd6Afb=xteFBSnE-rUmQMJYgRQ@mail.gmail.com>

On Wed, Feb 28, 2024 at 09:48:46AM +0200, Amir Goldstein wrote:
> On Wed, Feb 28, 2024 at 12:42 AM Dave Chinner via Lsf-pc
> <lsf-pc@lists.linux-foundation.org> wrote:
> >
> > On Tue, Feb 27, 2024 at 05:21:20PM -0500, Kent Overstreet wrote:
> > > On Wed, Feb 28, 2024 at 09:13:05AM +1100, Dave Chinner wrote:
> > > > On Tue, Feb 27, 2024 at 05:07:30AM -0500, Kent Overstreet wrote:
> > > > > AFAIK every filesystem allows concurrent direct writes, not just xfs,
> > > > > it's _buffered_ writes that we care about here.
> > > >
> > > > We could do concurrent buffered writes in XFS - we would just use
> > > > the same locking strategy as direct IO and fall back on folio locks
> > > > for copy-in exclusion like ext4 does.
> > >
> > > ext4 code doesn't do that. it takes the inode lock in exclusive mode,
> > > just like everyone else.
> >
> > Uhuh. ext4 does allow concurrent DIO writes. It's just much more
> > constrained than XFS. See ext4_dio_write_checks().
> >
> > > > The real question is how much of userspace will that break, because
> > > > of implicit assumptions that the kernel has always serialised
> > > > buffered writes?
> > >
> > > What would break?
> >
> > Good question. If you don't know the answer, then you've got the
> > same problem as I have. i.e. we don't know if concurrent
> > applications that use buffered IO extensively (eg. postgres) assume
> > data coherency because of the implicit serialisation occurring
> > during buffered IO writes?
> >
> > > > > If we do a short write because of a page fault (despite previously
> > > > > faulting in the userspace buffer), there is no way to completely prevent
> > > > > torn writes an atomicity breakage; we could at least try a trylock on
> > > > > the inode lock, I didn't do that here.
> > > >
> > > > As soon as we go for concurrent writes, we give up on any concept of
> > > > atomicity of buffered writes (esp. w.r.t reads), so this really
> > > > doesn't matter at all.
> > >
> > > We've already given up buffered write vs. read atomicity, have for a
> > > long time - buffered read path takes no locks.
> >
> > We still have explicit buffered read() vs buffered write() atomicity
> > in XFS via buffered reads taking the inode lock shared (see
> > xfs_file_buffered_read()) because that's what POSIX says we should
> > have.
> >
> > Essentially, we need to explicitly give POSIX the big finger and
> > state that there are no atomicity guarantees given for write() calls
> > of any size, nor are there any guarantees for data coherency for
> > any overlapping concurrent buffered IO operations.
> >
> 
> I have disabled read vs. write atomicity (out-of-tree) to make xfs behave
> as the other fs ever since Jan has added the invalidate_lock and I believe
> that Meta kernel has done that way before.
> 
> > Those are things we haven't completely given up yet w.r.t. buffered
> > IO, and enabling concurrent buffered writes will expose to users.
> > So we need to have explicit policies for this and document them
> > clearly in all the places that application developers might look
> > for behavioural hints.
> 
> That's doable - I can try to do that.
> What is your take regarding opt-in/opt-out of legacy behavior?

Screw the legacy code, don't even make it an option. No-one should
be relying on large buffered writes being atomic anymore, and with
high order folios in the page cache most small buffered writes are
going to be atomic w.r.t. both reads and writes anyway.

> At the time, I have proposed POSIX_FADV_TORN_RW API [1]
> to opt-out of the legacy POSIX behavior, but I guess that an xfs mount
> option would make more sense for consistent and clear semantics across
> the fs - it is easier if all buffered IO to inode behaved the same way.

No mount options, just change the behaviour. Applications already
have to avoid concurrent overlapping buffered reads and writes if
they care about data integrity and coherency, so making buffered
writes concurrent doesn't change anything.

Dave.
-- 
Dave Chinner
david@fromorbit.com

  parent reply	other threads:[~2024-02-29  0:25 UTC|newest]

Thread overview: 90+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-23 23:59 [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO Luis Chamberlain
2024-02-24  4:12 ` Matthew Wilcox
2024-02-24 17:31   ` Linus Torvalds
2024-02-24 18:13     ` Matthew Wilcox
2024-02-24 18:24       ` Linus Torvalds
2024-02-24 18:20     ` Linus Torvalds
2024-02-24 19:11       ` Linus Torvalds
2024-02-24 21:42         ` Theodore Ts'o
2024-02-24 22:57         ` Chris Mason
2024-02-24 23:40           ` Linus Torvalds
2024-05-10 23:57           ` Luis Chamberlain
2024-02-25  5:18     ` Kent Overstreet
2024-02-25  6:04       ` Kent Overstreet
2024-02-25 13:10       ` Matthew Wilcox
2024-02-25 17:03         ` Linus Torvalds
2024-02-25 21:14           ` Matthew Wilcox
2024-02-25 23:45             ` Linus Torvalds
2024-02-26  1:02               ` Kent Overstreet
2024-02-26  1:32                 ` Linus Torvalds
2024-02-26  1:58                   ` Kent Overstreet
2024-02-26  2:06                     ` Kent Overstreet
2024-02-26  2:34                     ` Linus Torvalds
2024-02-26  2:50                   ` Al Viro
2024-02-26 17:17                     ` Linus Torvalds
2024-02-26 21:07                       ` Matthew Wilcox
2024-02-26 21:17                         ` Kent Overstreet
2024-02-26 21:19                           ` Kent Overstreet
2024-02-26 21:55                             ` Paul E. McKenney
2024-02-26 23:29                               ` Kent Overstreet
2024-02-27  0:05                                 ` Paul E. McKenney
2024-02-27  0:29                                   ` Kent Overstreet
2024-02-27  0:55                                     ` Paul E. McKenney
2024-02-27  1:08                                       ` Kent Overstreet
2024-02-27  5:17                                         ` Paul E. McKenney
2024-02-27  6:21                                           ` Kent Overstreet
2024-02-27 15:32                                             ` Paul E. McKenney
2024-02-27 15:52                                               ` Kent Overstreet
2024-02-27 16:06                                                 ` Paul E. McKenney
2024-02-27 15:54                                               ` Matthew Wilcox
2024-02-27 16:21                                                 ` Paul E. McKenney
2024-02-27 16:34                                                   ` Kent Overstreet
2024-02-27 17:58                                                     ` Paul E. McKenney
2024-02-28 23:55                                                       ` Kent Overstreet
2024-02-29 19:42                                                         ` Paul E. McKenney
2024-02-29 20:51                                                           ` Kent Overstreet
2024-03-05  2:19                                                             ` Paul E. McKenney
2024-02-27  0:43                                 ` Dave Chinner
2024-02-26 22:46                       ` Linus Torvalds
2024-02-26 23:48                         ` Linus Torvalds
2024-02-27  7:21                           ` Kent Overstreet
2024-02-27 15:39                             ` Matthew Wilcox
2024-02-27 15:54                               ` Kent Overstreet
2024-02-27 16:34                             ` Linus Torvalds
2024-02-27 16:47                               ` Kent Overstreet
2024-02-27 17:07                                 ` Linus Torvalds
2024-02-27 17:20                                   ` Kent Overstreet
2024-02-27 18:02                                     ` Linus Torvalds
2024-05-14 11:52                         ` Luis Chamberlain
2024-05-14 16:04                           ` Linus Torvalds
2024-11-15 19:43                           ` Linus Torvalds
2024-11-15 20:42                             ` Matthew Wilcox
2024-11-15 21:52                               ` Linus Torvalds
2024-02-25 21:29           ` Kent Overstreet
2024-02-25 17:32         ` Kent Overstreet
2024-02-24 17:55   ` Luis Chamberlain
2024-02-25  5:24 ` Kent Overstreet
2024-02-26 12:22 ` Dave Chinner
2024-02-27 10:07 ` Kent Overstreet
2024-02-27 14:08   ` Luis Chamberlain
2024-02-27 14:57     ` Kent Overstreet
2024-02-27 22:13   ` Dave Chinner
2024-02-27 22:21     ` Kent Overstreet
2024-02-27 22:42       ` Dave Chinner
2024-02-28  7:48         ` [Lsf-pc] " Amir Goldstein
2024-02-28 14:01           ` Chris Mason
2024-02-29  0:25           ` Dave Chinner [this message]
2024-02-29  0:57             ` Kent Overstreet
2024-03-04  0:46               ` Dave Chinner
2024-02-27 22:46       ` Linus Torvalds
2024-02-27 23:00         ` Linus Torvalds
2024-02-28  2:22         ` Kent Overstreet
2024-02-28  3:00           ` Matthew Wilcox
2024-02-28  4:22             ` Matthew Wilcox
2024-02-28 17:34               ` Kent Overstreet
2024-02-28 18:04                 ` Matthew Wilcox
2024-02-28 18:18         ` Kent Overstreet
2024-02-28 19:09           ` Linus Torvalds
2024-02-28 19:29             ` Kent Overstreet
2024-02-28 20:17               ` Linus Torvalds
2024-02-28 23:21                 ` Kent Overstreet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zd/O/S3rdvZ8OxZJ@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=amir73il@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=clm@fb.com \
    --cc=da.gomez@samsung.com \
    --cc=hannes@cmpxchg.org \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=josef@toxicpanda.com \
    --cc=kent.overstreet@linux.dev \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mcgrof@kernel.org \
    --cc=p.raghav@samsung.com \
    --cc=torvalds@linux-foundation.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).