From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: John Garry <john.g.garry@oracle.com>,
linux-kernel@vger.kernel.org, linux-api@vger.kernel.org,
martin.petersen@oracle.com, himanshu.madhani@oracle.com
Subject: Re: [PATCH 2/4] readv.2: Document RWF_ATOMIC flag
Date: Mon, 9 Oct 2023 14:05:31 -0700 [thread overview]
Message-ID: <20231009210531.GB214073@frogsfrogsfrogs> (raw)
In-Reply-To: <ZSRk9Z6/i2E+YV9A@dread.disaster.area>
On Tue, Oct 10, 2023 at 07:39:17AM +1100, Dave Chinner wrote:
> On Mon, Oct 09, 2023 at 10:44:38AM -0700, Darrick J. Wong wrote:
> > On Fri, Sep 29, 2023 at 09:37:15AM +0000, John Garry wrote:
> > > From: Himanshu Madhani <himanshu.madhani@oracle.com>
> > >
> > > Add RWF_ATOMIC flag description for pwritev2().
> > >
> > > Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
> > > #jpg: complete rewrite
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > > man2/readv.2 | 45 +++++++++++++++++++++++++++++++++++++++++++++
> > > 1 file changed, 45 insertions(+)
> ....
> > > +For when regular files are opened with
> > > +.BR open (2)
> > > +but without
> > > +.B O_SYNC
> > > +or
> > > +.B O_DSYNC
> > > +and the
> > > +.BR pwritev2()
> > > +call is made without
> > > +.B RWF_SYNC
> > > +or
> > > +.BR RWF_DSYNC
> > > +set, the range metadata must already be flushed to storage and the data range
> > > +must not be in unwritten state, shared, a preallocation, or a hole.
> >
> > I think that we can drop all of these flags requirements, since the
> > contiguous small space allocation requirement means that the fs can
> > provide all-or-nothing writes even if metadata updates are needed:
> >
> > If the file range is allocated and marked unwritten (i.e. a
> > preallocation), the ioend will clear the unwritten bit from the file
> > mapping atomically. After a crash, the application sees either zeroes
> > or all the data that was written.
> >
> > If the file range is shared, the ioend will map the COW staging extent
> > into the file atomically. After a crash, the application sees either
> > the old contents from the old blocks, or the new contents from the new
> > blocks.
> >
> > If the file range is a sparse hole, the directio setup will allocate
> > space and create an unwritten mapping before issuing the write bio. The
> > rest of the process works the same as preallocations and has the same
> > behaviors.
> >
> > If the file range is allocated and was previously written, the write is
> > issued and that's all that's needed from the fs. After a crash, reads
> > of the storage device produce the old contents or the new contents.
>
> This is exactly what I explained when reviewing the code that
> rejected RWF_ATOMIC without O_DSYNC on metadata dirty inodes.
I'm glad we agree. :)
John, when you're back from vacation, can we get rid of this language
and all those checks under _is_dsync() in the iomap patch?
(That code is 100% the result of me handwaving and bellyaching 6 months
ago when the team was trying to get all the atomic writes bits working
prior to LSF and I was too burned out to think the xfs part through.
As a result, I decided that we'd only support strict overwrites for the
first iteration.)
> > Summarizing:
> >
> > An (ATOMIC|SYNC) request provides the strongest guarantees (data
> > will not be torn, and all file metadata updates are persisted before
> > the write is returned to userspace. Programs see either the old data or
> > the new data, even if there's a crash.
> >
> > (ATOMIC|DSYNC) is less strong -- data will not be torn, and any file
> > updates for just that region are persisted before the write is returned.
> >
> > (ATOMIC) is the least strong -- data will not be torn. Neither the
> > filesystem nor the device make guarantees that anything ended up on
> > stable storage, but if it does, programs see either the old data or the
> > new data.
>
> Yup, that makes sense to me.
Perhaps this ^^ is what we should be documenting here.
> > Maybe we should rename the whole UAPI s/atomic/untorn/...
>
> Perhaps, though "torn writes" is nomenclature that nobody outside
> storage and filesystem developers really knows about. All I ever
> hear from userspace developers is "we want atomic/all-or-nothing
> data writes"...
Fair 'enuf.
--D
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
next prev parent reply other threads:[~2023-10-09 21:05 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-29 9:37 [PATCH 0/4] man2: Document RWF_ATOMIC John Garry
2023-09-29 9:37 ` [PATCH 1/4] statx.2: Document STATX_WRITE_ATOMIC John Garry
2023-09-29 9:37 ` [PATCH 2/4] readv.2: Document RWF_ATOMIC flag John Garry
2023-10-03 19:25 ` Bart Van Assche
2023-10-04 8:47 ` John Garry
2023-10-04 17:36 ` Bart Van Assche
2023-10-04 22:48 ` Dave Chinner
2023-10-09 17:44 ` Darrick J. Wong
2023-10-09 20:39 ` Dave Chinner
2023-10-09 21:05 ` Darrick J. Wong [this message]
2023-10-24 12:35 ` John Garry
2023-10-24 15:39 ` Darrick J. Wong
2023-10-24 12:30 ` John Garry
2023-10-24 15:39 ` Darrick J. Wong
2023-09-29 9:37 ` [PATCH 3/4] man2/open.2: Document RWF_ATOMIC John Garry
2023-09-29 9:37 ` [PATCH 4/4] io_submit.2: " John Garry
2023-10-09 17:45 ` Darrick J. Wong
2023-10-24 11:51 ` John Garry
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20231009210531.GB214073@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=david@fromorbit.com \
--cc=himanshu.madhani@oracle.com \
--cc=john.g.garry@oracle.com \
--cc=linux-api@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=martin.petersen@oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).