From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Ritesh Harjani <ritesh.list@gmail.com>,
Theodore Ts'o <tytso@mit.edu>,
John Garry <john.g.garry@oracle.com>,
linux-ext4@vger.kernel.org, Jan Kara <jack@suse.cz>,
Christoph Hellwig <hch@infradead.org>,
Ojaswin Mujoo <ojaswin@linux.ibm.com>,
linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org,
linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 5/6] iomap: Lift blocksize restriction on atomic writes
Date: Mon, 4 Nov 2024 16:09:56 -0800 [thread overview]
Message-ID: <20241105000956.GJ21832@frogsfrogsfrogs> (raw)
In-Reply-To: <Zygo6nqOJMoJxYrm@dread.disaster.area>
On Mon, Nov 04, 2024 at 12:52:42PM +1100, Dave Chinner wrote:
> On Thu, Oct 31, 2024 at 02:36:40PM -0700, Darrick J. Wong wrote:
> > On Sat, Oct 26, 2024 at 10:05:44AM +0530, Ritesh Harjani wrote:
> > > > This gets me to the third and much less general solution -- only allow
> > > > untorn writes if we know that the ioend only ever has to run a single
> > > > transaction. That's why untorn writes are limited to a single fsblock
> > > > for now -- it's a simple solution so that we can get our downstream
> > > > customers to kick the tires and start on the next iteration instead of
> > > > spending years on waterfalling.
> > > >
> > > > Did you notice that in all of these cases, the capabilities of the
> > > > filesystem's ioend processing determines the restrictions on the number
> > > > and type of mappings that ->iomap_begin can give to iomap?
> > > >
> > > > Now that we have a second system trying to hook up to the iomap support,
> > > > it's clear to me that the restrictions on mappings are specific to each
> > > > filesystem. Therefore, the iomap directio code should not impose
> > > > restrictions on the mappings it receives unless they would prevent the
> > > > creation of the single aligned bio.
> > > >
> > > > Instead, xfs_direct_write_iomap_begin and ext4_iomap_begin should return
> > > > EINVAL or something if they look at the file mappings and discover that
> > > > they cannot perform the ioend without risking torn mapping updates. In
> > > > the long run, ->iomap_begin is where this iomap->len <= iter->len check
> > > > really belongs, but hold that thought.
> > > >
> > > > For the multi fsblock case, the ->iomap_begin functions would have to
> > > > check that only one metadata update would be necessary in the ioend.
> > > > That's where things get murky, since ext4/xfs drop their mapping locks
> > > > between calls to ->iomap_begin. So you'd have to check all the mappings
> > > > for unsupported mixed state every time. Yuck.
> > > >
> > >
> > > Thanks Darrick for taking time summarizing what all has been done
> > > and your thoughts here.
> > >
> > > > It might be less gross to retain the restriction that iomap accepts only
> > > > one mapping for the entire file range, like Ritesh has here.
> > >
> > > less gross :) sure.
> > >
> > > I would like to think of this as, being less restrictive (compared to
> > > only allowing a single fsblock) by adding a constraint on the atomic
> > > write I/O request i.e.
> > >
> > > "Atomic write I/O request to a region in a file is only allowed if that
> > > region has no partially allocated extents. Otherwise, the file system
> > > can fail the I/O operation by returning -EINVAL."
> > >
> > > Essentially by adding this constraint to the I/O request, we are
> > > helping the user to prevent atomic writes from accidentally getting
> > > torned and also allowing multi-fsblock writes. So I still think that
> > > might be the right thing to do here or at least a better start. FS can
> > > later work on adding such support where we don't even need above
> > > such constraint on a given atomic write I/O request.
> >
> > On today's ext4 call, Ted and Ritesh and I realized that there's a bit
> > more to it than this -- it's not possible to support untorn writes to a
> > mix of written/(cow,unwritten) mappings even if they all point to the
> > same physical space. If the system fails after the storage device
> > commits the write but before any of the ioend processing is scheduled, a
> > subsequent read of the previously written blocks will produce the new
> > data, but reads to the other areas will produce the old contents (or
> > zeroes, or whatever). That's a torn write.
>
> I'm *really* surprised that people are only realising that IO
> completion processing for atomic writes *must be atomic*.
I've been saying for a while; it's just that I didn't realize until last
week that there were more rules than "can't do an untorn write if you
need to do more than 1 mapping update":
Untorn writes are not possible if:
1. More than 1 mapping update is needed
2. 1 mapping update is needed but there's a written block
(1) can be worked around with a log intent item for ioend processing,
but I don't think (2) can at all. That's why I went back to saying that
untorn writes require that there can only be one mapping.
> This was the foundational constraint that the forced alignment
> proposal for XFS was intended to address. i.e. to prevent fs
> operations from violating atomic write IO constraints (e.g. punching
> sub-atomic write size holes in the file) so that the physical IO can
> be done without tearing and the IO completion processing that
> exposes the new data can be done atomically.
>
> > Therefore, iomap ought to stick to requiring that ->iomap_begin returns
> > a single iomap to cover the entire file range for the untorn write. For
> > an unwritten extent, the post-recovery read will see either zeroes or
> > the new contents; for a single-mapping COW it'll see old or new contents
> > but not both.
>
> I'm pretty sure we enforced that in the XFS mapping implemention for
> atomic writes using forced alignment. i.e. we have to return a
> correctly aligned, contiguous mapping to iomap or we have to return
> -EINVAL to indicate atomic write mapping failed.
I think you guys did, it's just the ext4 bigalloc thing that started
this back up again. :/
> Yes, we can check this in iomap, but it's really the filesystem that
> has to implement and enforce it...
I think we ought to keep the "1 mapping per untorn write" check in iomap
until someone decides that it's worth the trouble to make a filesystem
handle mixed states correctly. Mostly as a guard against the
implementations.
--D
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
next prev parent reply other threads:[~2024-11-05 0:09 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-25 3:45 [PATCH 0/6] ext4: Add atomic write support for DIO Ritesh Harjani (IBM)
2024-10-25 3:45 ` [PATCH 1/6] ext4: Add statx support for atomic writes Ritesh Harjani (IBM)
2024-10-25 9:41 ` John Garry
2024-10-25 10:08 ` Ritesh Harjani
2024-10-25 16:09 ` Darrick J. Wong
2024-10-25 17:45 ` Ritesh Harjani
2024-10-25 3:45 ` [PATCH 2/6] ext4: Check for atomic writes support in write iter Ritesh Harjani (IBM)
2024-10-25 9:44 ` John Garry
2024-10-25 10:33 ` Ritesh Harjani
2024-10-25 16:11 ` Darrick J. Wong
2024-10-25 17:50 ` Ritesh Harjani
2024-10-25 3:45 ` [PATCH 3/6] ext4: Support setting FMODE_CAN_ATOMIC_WRITE Ritesh Harjani (IBM)
2024-10-25 3:45 ` [PATCH 4/6] ext4: Warn if we ever fallback to buffered-io for DIO atomic writes Ritesh Harjani (IBM)
2024-10-25 16:16 ` Darrick J. Wong
2024-10-25 17:51 ` Ritesh Harjani
2024-10-27 22:26 ` Dave Chinner
2024-10-28 1:09 ` Ritesh Harjani
2024-10-28 5:26 ` Dave Chinner
2024-10-28 8:43 ` Ritesh Harjani
2024-10-28 18:14 ` Ritesh Harjani
2024-10-29 22:29 ` Dave Chinner
2024-10-29 23:51 ` Ritesh Harjani
2024-10-25 3:45 ` [PATCH 5/6] iomap: Lift blocksize restriction on " Ritesh Harjani (IBM)
2024-10-25 8:52 ` John Garry
2024-10-25 9:31 ` Ritesh Harjani
2024-10-25 9:59 ` John Garry
2024-10-25 10:35 ` Ritesh Harjani
2024-10-25 11:07 ` John Garry
2024-10-25 11:19 ` Ritesh Harjani
2024-10-25 12:23 ` John Garry
2024-10-25 12:36 ` Ritesh Harjani
2024-10-25 14:04 ` John Garry
2024-10-25 14:13 ` Ritesh Harjani
2024-10-25 18:28 ` Darrick J. Wong
2024-10-26 4:35 ` Ritesh Harjani
2024-10-31 21:36 ` Darrick J. Wong
2024-11-04 1:52 ` Dave Chinner
2024-11-05 0:09 ` Darrick J. Wong [this message]
2024-10-25 3:45 ` [PATCH 6/6] ext4: Add atomic write support for bigalloc Ritesh Harjani (IBM)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20241105000956.GJ21832@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=david@fromorbit.com \
--cc=hch@infradead.org \
--cc=jack@suse.cz \
--cc=john.g.garry@oracle.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=ojaswin@linux.ibm.com \
--cc=ritesh.list@gmail.com \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox