From: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-ext4@vger.kernel.org, Theodore Ts'o <tytso@mit.edu>,
Jan Kara <jack@suse.cz>, "Darrick J . Wong" <djwong@kernel.org>,
Christoph Hellwig <hch@infradead.org>,
John Garry <john.g.garry@oracle.com>,
Ojaswin Mujoo <ojaswin@linux.ibm.com>,
linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org,
linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 4/6] ext4: Warn if we ever fallback to buffered-io for DIO atomic writes
Date: Wed, 30 Oct 2024 05:21:48 +0530 [thread overview]
Message-ID: <87h68u79ij.fsf@gmail.com> (raw)
In-Reply-To: <ZyFh3uCGqB20+2X2@dread.disaster.area>
Hi Dave,
Dave Chinner <david@fromorbit.com> writes:
> On Mon, Oct 28, 2024 at 11:44:00PM +0530, Ritesh Harjani wrote:
>>
>> Hi Dave,
>>
>> Dave Chinner <david@fromorbit.com> writes:
>>
>> > On Mon, Oct 28, 2024 at 06:39:36AM +0530, Ritesh Harjani wrote:
>> >>
>> >> Hi Dave,
>> >>
>> >> Dave Chinner <david@fromorbit.com> writes:
>> >>
>> >> > On Fri, Oct 25, 2024 at 09:15:53AM +0530, Ritesh Harjani (IBM) wrote:
>> >> >> iomap will not return -ENOTBLK in case of dio atomic writes. But let's
>> >> >> also add a WARN_ON_ONCE and return -EIO as a safety net.
>> >> >>
>> >> >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> >> >> ---
>> >> >> fs/ext4/file.c | 10 +++++++++-
>> >> >> 1 file changed, 9 insertions(+), 1 deletion(-)
>> >> >>
>> >> >> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
>> >> >> index f9516121a036..af6ebd0ac0d6 100644
>> >> >> --- a/fs/ext4/file.c
>> >> >> +++ b/fs/ext4/file.c
>> >> >> @@ -576,8 +576,16 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>> >> >> iomap_ops = &ext4_iomap_overwrite_ops;
>> >> >> ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
>> >> >> dio_flags, NULL, 0);
>> >> >> - if (ret == -ENOTBLK)
>> >> >> + if (ret == -ENOTBLK) {
>> >> >> ret = 0;
>> >> >> + /*
>> >> >> + * iomap will never return -ENOTBLK if write fails for atomic
>> >> >> + * write. But let's just add a safety net.
>> >> >> + */
>> >> >> + if (WARN_ON_ONCE(iocb->ki_flags & IOCB_ATOMIC))
>> >> >> + ret = -EIO;
>> >> >> + }
>> >> >
>> >> > Why can't the iomap code return EIO in this case for IOCB_ATOMIC?
>> >> > That way we don't have to put this logic into every filesystem.
>> >>
>> >> This was origially intended as a safety net hence the WARN_ON_ONCE.
>> >> Later Darrick pointed out that we still might have an unconverted
>> >> condition in iomap which can return ENOTBLK for DIO atomic writes (page
>> >> cache invalidation).
>> >
>> > Yes. That's my point - iomap knows that it's an atomic write, it
>> > knows that invalidation failed, and it knows that there is no such
>> > thing as buffered atomic writes. So there is no possible fallback
>> > here, and it should be returning EIO in the page cache invalidation
>> > failure case and not ENOTBLK.
>> >
>>
>> So the iomap DIO can return following as return values which can make
>> some filesystems fallback to buffered-io (if they implement fallback
>> logic) -
>> (1) -ENOTBLK -> this is only returned for pagecache invalidation failure.
>> (2) 0 or partial write size -> This can never happen for atomic writes
>> (since we are only allowing for single fsblock as of now).
>
> Even when we allow multi-FSB atomic writes, the definition of
> atomic write is still "all or nothing". There is no scope for "short
> writes" when IOCB_ATOMIC is set - any condition that means we can't
> write the entire IO as a single bio, we need to abort and return
> EINVAL.
yes. As long as it is a single bio, I agree even the short write
condition should not hit based on the current iomap code.
>
> Hence -ENOTBLK should never be returned by iomap for atomic DIO
> writes - we need to say -EINVAL if the write could not be issued
> atomically for whatever reason it may be so the application knows
> that atomic IO submission was not possible for that IO.
>
Agreed Dave. That is what iomap is doing today for atomic write code.
(Except maybe one minor difference where it returns -EAGAIN in case of
page cache invalidation assuming the failure maybe transient and the
request could be tried again).
>> Now looking at XFS, it never fallsback to buffered-io ever except just 2
>> cases -
>> 1. When pagecache invalidation fails in iomap (can never happen for
>> atomic writes)
>
> Why can't this happen for atomic DIO writes? It's the same failure
> cases as for normal DIO writes, isn't it? (i.e. race with mmap
> writes)
>
I meant after the patch which adds atomic write support in iomap code
from John, make sure we don't return -ENOTBLK in case of atomic write request.
> My point is that if it's an atomic write, this failure should get
> turned into -EINVAL by the iomap code. We do not want a fallback to
> buffered IO when this situation happens for atomic IO.
>
>> 2. On unaligned DIO writes to reflinked CoW (not possible for atomic writes)
>
> This path doesn't ever go through iomap - XFS catches that case
> before it calls into iomap, so it's not relevant to how iomap
> behaves w.r.t atomic IO.
>
Right.
>> So it anyways should never happen that XFS ever fallback to buffered-io
>> for DIO atomic writes. Even today it does not fallback to buffered-io
>> for non-atomic short DIO writes.
>>
>> >> You pointed it right that it should be fixed in iomap. However do you
>> >> think filesystems can still keep this as safety net (maybe no need of
>> >> WARN_ON_ONCE).
>> >
>> > I don't see any point in adding "impossible to hit" checks into
>> > filesystems just in case some core infrastructure has a bug
>> > introduced....
>>
>> Yes, that is true for XFS. EXT4 however can return -ENOTBLK for short
>> writes, though it should not happen for current atomic write case where
>> we are only allowing for 1 fsblock.
>
> Yes, but the -ENOTBLK error returned from ext4_iomap_end() if
> nothing was written does not get returned to ext4 from
> __iomap_dio_rw(). It is consumed by the iomap code:
>
> /* magic error code to fall back to buffered I/O */
> if (ret == -ENOTBLK) {
> wait_for_completion = true;
> ret = 0;
> }
>
> This means that all the IO that was issued gets completed before
> returning to the caller and that's how the short write comes about.
>
> -ENOTBLK is *not returned to the caller* on a short write -
yes. That's my understanding too of the short write case handling in
iomap.
> iomap_dio_rw will return 0 (success). The caller then has to look
> at the iov_iter state to determine if the write was fully completed.
> This is exactly what the ext4 code currently does for all DIO
> writes, not just those that return -ENOTBLK.
>
yes. Agreed.
>> I would still like to go with a WARN_ON_ONCE where we are calling ext4
>> buffered-io handling for DIO fallback writes. This is to catch any bugs
>> even in future when we move to multi-fsblock case (until we have atomic
>> write support for buffered-io).
>
> Your choice, but please realise that it is not going to catch short
> atomic writes at all.
>
Thanks Dave. Yes, I would like to maybe keep a WARN_ON_ONCE since ext4
has a fallback handling logic where a short DIO or -ENOTBLK case could
be later handled by buffered-io logic (though I agree iomap won't let it
happen for atomic write case).
But a WARN_ON_ONCE just before buffered-io fallback handling logic in
ext4 DIO path would be my preferred choice only to make sure we could
catch any unwanted bugs in future too.
So I was thinking of this change instead -
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 8116bd78910b..61787a37e9d4 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -599,6 +599,13 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
ssize_t err;
loff_t endbyte;
+ /*
+ * There is no support for atomic writes on buffered-io yet,
+ * we should never fallback to buffered-io for DIO atomic
+ * writes.
+ */
+ WARN_ON_ONCE(iocb->ki_flags & IOCB_ATOMIC);
+
offset = iocb->ki_pos;
err = ext4_buffered_write_iter(iocb, from);
if (err < 0)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index fcdee27b9aa2..26b3c84d7f64 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3449,12 +3449,16 @@ static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
{
/*
* Check to see whether an error occurred while writing out the data to
- * the allocated blocks. If so, return the magic error code so that we
- * fallback to buffered I/O and attempt to complete the remainder of
- * the I/O. Any blocks that may have been allocated in preparation for
- * the direct I/O will be reused during buffered I/O.
+ * the allocated blocks. If so, return the magic error code for
+ * non-atomic write so that we fallback to buffered I/O and attempt to
+ * complete the remainder of the I/O.
+ * For atomic writes we will simply fail the I/O request if we coudn't
+ * write anything. For non-atomic writes, any blocks that may have been
+ * allocated in preparation for the direct I/O will be reused during
+ * buffered I/O.
*/
- if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0)
+ if (!(flags & IOMAP_ATOMIC) && (flags & (IOMAP_WRITE | IOMAP_DIRECT))
+ && written == 0)
return -ENOTBLK;
return 0;
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
Thanks a lot for the review!
-ritesh
next prev parent reply other threads:[~2024-10-30 0:19 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-25 3:45 [PATCH 0/6] ext4: Add atomic write support for DIO Ritesh Harjani (IBM)
2024-10-25 3:45 ` [PATCH 1/6] ext4: Add statx support for atomic writes Ritesh Harjani (IBM)
2024-10-25 9:41 ` John Garry
2024-10-25 10:08 ` Ritesh Harjani
2024-10-25 16:09 ` Darrick J. Wong
2024-10-25 17:45 ` Ritesh Harjani
2024-10-25 3:45 ` [PATCH 2/6] ext4: Check for atomic writes support in write iter Ritesh Harjani (IBM)
2024-10-25 9:44 ` John Garry
2024-10-25 10:33 ` Ritesh Harjani
2024-10-25 16:11 ` Darrick J. Wong
2024-10-25 17:50 ` Ritesh Harjani
2024-10-25 3:45 ` [PATCH 3/6] ext4: Support setting FMODE_CAN_ATOMIC_WRITE Ritesh Harjani (IBM)
2024-10-25 3:45 ` [PATCH 4/6] ext4: Warn if we ever fallback to buffered-io for DIO atomic writes Ritesh Harjani (IBM)
2024-10-25 16:16 ` Darrick J. Wong
2024-10-25 17:51 ` Ritesh Harjani
2024-10-27 22:26 ` Dave Chinner
2024-10-28 1:09 ` Ritesh Harjani
2024-10-28 5:26 ` Dave Chinner
2024-10-28 8:43 ` Ritesh Harjani
2024-10-28 18:14 ` Ritesh Harjani
2024-10-29 22:29 ` Dave Chinner
2024-10-29 23:51 ` Ritesh Harjani [this message]
2024-10-25 3:45 ` [PATCH 5/6] iomap: Lift blocksize restriction on " Ritesh Harjani (IBM)
2024-10-25 8:52 ` John Garry
2024-10-25 9:31 ` Ritesh Harjani
2024-10-25 9:59 ` John Garry
2024-10-25 10:35 ` Ritesh Harjani
2024-10-25 11:07 ` John Garry
2024-10-25 11:19 ` Ritesh Harjani
2024-10-25 12:23 ` John Garry
2024-10-25 12:36 ` Ritesh Harjani
2024-10-25 14:04 ` John Garry
2024-10-25 14:13 ` Ritesh Harjani
2024-10-25 18:28 ` Darrick J. Wong
2024-10-26 4:35 ` Ritesh Harjani
2024-10-31 21:36 ` Darrick J. Wong
2024-11-04 1:52 ` Dave Chinner
2024-11-05 0:09 ` Darrick J. Wong
2024-10-25 3:45 ` [PATCH 6/6] ext4: Add atomic write support for bigalloc Ritesh Harjani (IBM)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87h68u79ij.fsf@gmail.com \
--to=ritesh.list@gmail.com \
--cc=david@fromorbit.com \
--cc=djwong@kernel.org \
--cc=hch@infradead.org \
--cc=jack@suse.cz \
--cc=john.g.garry@oracle.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=ojaswin@linux.ibm.com \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).