From: Dave Chinner <david@fromorbit.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: Ritesh Harjani <ritesh.list@gmail.com>,
Theodore Ts'o <tytso@mit.edu>,
John Garry <john.g.garry@oracle.com>,
linux-ext4@vger.kernel.org, Jan Kara <jack@suse.cz>,
Christoph Hellwig <hch@infradead.org>,
Ojaswin Mujoo <ojaswin@linux.ibm.com>,
linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org,
linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 5/6] iomap: Lift blocksize restriction on atomic writes
Date: Mon, 4 Nov 2024 12:52:42 +1100 [thread overview]
Message-ID: <Zygo6nqOJMoJxYrm@dread.disaster.area> (raw)
In-Reply-To: <20241031213640.GB21832@frogsfrogsfrogs>
On Thu, Oct 31, 2024 at 02:36:40PM -0700, Darrick J. Wong wrote:
> On Sat, Oct 26, 2024 at 10:05:44AM +0530, Ritesh Harjani wrote:
> > > This gets me to the third and much less general solution -- only allow
> > > untorn writes if we know that the ioend only ever has to run a single
> > > transaction. That's why untorn writes are limited to a single fsblock
> > > for now -- it's a simple solution so that we can get our downstream
> > > customers to kick the tires and start on the next iteration instead of
> > > spending years on waterfalling.
> > >
> > > Did you notice that in all of these cases, the capabilities of the
> > > filesystem's ioend processing determines the restrictions on the number
> > > and type of mappings that ->iomap_begin can give to iomap?
> > >
> > > Now that we have a second system trying to hook up to the iomap support,
> > > it's clear to me that the restrictions on mappings are specific to each
> > > filesystem. Therefore, the iomap directio code should not impose
> > > restrictions on the mappings it receives unless they would prevent the
> > > creation of the single aligned bio.
> > >
> > > Instead, xfs_direct_write_iomap_begin and ext4_iomap_begin should return
> > > EINVAL or something if they look at the file mappings and discover that
> > > they cannot perform the ioend without risking torn mapping updates. In
> > > the long run, ->iomap_begin is where this iomap->len <= iter->len check
> > > really belongs, but hold that thought.
> > >
> > > For the multi fsblock case, the ->iomap_begin functions would have to
> > > check that only one metadata update would be necessary in the ioend.
> > > That's where things get murky, since ext4/xfs drop their mapping locks
> > > between calls to ->iomap_begin. So you'd have to check all the mappings
> > > for unsupported mixed state every time. Yuck.
> > >
> >
> > Thanks Darrick for taking time summarizing what all has been done
> > and your thoughts here.
> >
> > > It might be less gross to retain the restriction that iomap accepts only
> > > one mapping for the entire file range, like Ritesh has here.
> >
> > less gross :) sure.
> >
> > I would like to think of this as, being less restrictive (compared to
> > only allowing a single fsblock) by adding a constraint on the atomic
> > write I/O request i.e.
> >
> > "Atomic write I/O request to a region in a file is only allowed if that
> > region has no partially allocated extents. Otherwise, the file system
> > can fail the I/O operation by returning -EINVAL."
> >
> > Essentially by adding this constraint to the I/O request, we are
> > helping the user to prevent atomic writes from accidentally getting
> > torned and also allowing multi-fsblock writes. So I still think that
> > might be the right thing to do here or at least a better start. FS can
> > later work on adding such support where we don't even need above
> > such constraint on a given atomic write I/O request.
>
> On today's ext4 call, Ted and Ritesh and I realized that there's a bit
> more to it than this -- it's not possible to support untorn writes to a
> mix of written/(cow,unwritten) mappings even if they all point to the
> same physical space. If the system fails after the storage device
> commits the write but before any of the ioend processing is scheduled, a
> subsequent read of the previously written blocks will produce the new
> data, but reads to the other areas will produce the old contents (or
> zeroes, or whatever). That's a torn write.
I'm *really* surprised that people are only realising that IO
completion processing for atomic writes *must be atomic*.
This was the foundational constraint that the forced alignment
proposal for XFS was intended to address. i.e. to prevent fs
operations from violating atomic write IO constraints (e.g. punching
sub-atomic write size holes in the file) so that the physical IO can
be done without tearing and the IO completion processing that
exposes the new data can be done atomically.
> Therefore, iomap ought to stick to requiring that ->iomap_begin returns
> a single iomap to cover the entire file range for the untorn write. For
> an unwritten extent, the post-recovery read will see either zeroes or
> the new contents; for a single-mapping COW it'll see old or new contents
> but not both.
I'm pretty sure we enforced that in the XFS mapping implemention for
atomic writes using forced alignment. i.e. we have to return a
correctly aligned, contiguous mapping to iomap or we have to return
-EINVAL to indicate atomic write mapping failed.
Yes, we can check this in iomap, but it's really the filesystem that
has to implement and enforce it...
-Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2024-11-04 1:52 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-25 3:45 [PATCH 0/6] ext4: Add atomic write support for DIO Ritesh Harjani (IBM)
2024-10-25 3:45 ` [PATCH 1/6] ext4: Add statx support for atomic writes Ritesh Harjani (IBM)
2024-10-25 9:41 ` John Garry
2024-10-25 10:08 ` Ritesh Harjani
2024-10-25 16:09 ` Darrick J. Wong
2024-10-25 17:45 ` Ritesh Harjani
2024-10-25 3:45 ` [PATCH 2/6] ext4: Check for atomic writes support in write iter Ritesh Harjani (IBM)
2024-10-25 9:44 ` John Garry
2024-10-25 10:33 ` Ritesh Harjani
2024-10-25 16:11 ` Darrick J. Wong
2024-10-25 17:50 ` Ritesh Harjani
2024-10-25 3:45 ` [PATCH 3/6] ext4: Support setting FMODE_CAN_ATOMIC_WRITE Ritesh Harjani (IBM)
2024-10-25 3:45 ` [PATCH 4/6] ext4: Warn if we ever fallback to buffered-io for DIO atomic writes Ritesh Harjani (IBM)
2024-10-25 16:16 ` Darrick J. Wong
2024-10-25 17:51 ` Ritesh Harjani
2024-10-27 22:26 ` Dave Chinner
2024-10-28 1:09 ` Ritesh Harjani
2024-10-28 5:26 ` Dave Chinner
2024-10-28 8:43 ` Ritesh Harjani
2024-10-28 18:14 ` Ritesh Harjani
2024-10-29 22:29 ` Dave Chinner
2024-10-29 23:51 ` Ritesh Harjani
2024-10-25 3:45 ` [PATCH 5/6] iomap: Lift blocksize restriction on " Ritesh Harjani (IBM)
2024-10-25 8:52 ` John Garry
2024-10-25 9:31 ` Ritesh Harjani
2024-10-25 9:59 ` John Garry
2024-10-25 10:35 ` Ritesh Harjani
2024-10-25 11:07 ` John Garry
2024-10-25 11:19 ` Ritesh Harjani
2024-10-25 12:23 ` John Garry
2024-10-25 12:36 ` Ritesh Harjani
2024-10-25 14:04 ` John Garry
2024-10-25 14:13 ` Ritesh Harjani
2024-10-25 18:28 ` Darrick J. Wong
2024-10-26 4:35 ` Ritesh Harjani
2024-10-31 21:36 ` Darrick J. Wong
2024-11-04 1:52 ` Dave Chinner [this message]
2024-11-05 0:09 ` Darrick J. Wong
2024-10-25 3:45 ` [PATCH 6/6] ext4: Add atomic write support for bigalloc Ritesh Harjani (IBM)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Zygo6nqOJMoJxYrm@dread.disaster.area \
--to=david@fromorbit.com \
--cc=djwong@kernel.org \
--cc=hch@infradead.org \
--cc=jack@suse.cz \
--cc=john.g.garry@oracle.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=ojaswin@linux.ibm.com \
--cc=ritesh.list@gmail.com \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox