From: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
To: Jan Kara <jack@suse.cz>
Cc: Keith Busch <kbusch@kernel.org>, Jan Kara <jack@suse.cz>,
Keith Busch <kbusch@meta.com>,
linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org,
snitzer@kernel.org, axboe@kernel.dk, dw@davidwei.uk,
brauner@kernel.org, hch@lst.de, martin.petersen@oracle.com,
djwong@kernel.org, linux-xfs@vger.kernel.org,
viro@zeniv.linux.org.uk, Jan Kara <jack@suse.com>,
Brian Foster <bfoster@redhat.com>
Subject: Re: [PATCHv3 0/8] direct-io: even more flexible io vectors
Date: Fri, 29 Aug 2025 08:49:02 +0530 [thread overview]
Message-ID: <878qj2g6hl.fsf@gmail.com> (raw)
In-Reply-To: <87bjnyg9me.fsf@gmail.com>
Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:
> Jan Kara <jack@suse.cz> writes:
>
>> On Tue 26-08-25 10:29:58, Ritesh Harjani wrote:
>>> Keith Busch <kbusch@kernel.org> writes:
>>>
>>> > On Mon, Aug 25, 2025 at 02:07:15PM +0200, Jan Kara wrote:
>>> >> On Fri 22-08-25 18:57:08, Ritesh Harjani wrote:
>>> >> > Keith Busch <kbusch@meta.com> writes:
>>> >> > >
>>> >> > > - EXT4 falls back to buffered io for writes but not for reads.
>>> >> >
>>> >> > ++linux-ext4 to get any historical context behind why the difference of
>>> >> > behaviour in reads v/s writes for EXT4 DIO.
>>> >>
>>> >> Hum, how did you test? Because in the basic testing I did (with vanilla
>>> >> kernel) I get EINVAL when doing unaligned DIO write in ext4... We should be
>>> >> falling back to buffered IO only if the underlying file itself does not
>>> >> support any kind of direct IO.
>>> >
>>> > Simple test case (dio-offset-test.c) below.
>>> >
>>> > I also ran this on vanilla kernel and got these results:
>>> >
>>> > # mkfs.ext4 /dev/vda
>>> > # mount /dev/vda /mnt/ext4/
>>> > # make dio-offset-test
>>> > # ./dio-offset-test /mnt/ext4/foobar
>>> > write: Success
>>> > read: Invalid argument
>>> >
>>> > I tracked the "write: Success" down to ext4's handling for the "special"
>>> > -ENOTBLK error after ext4_want_directio_fallback() returns "true".
>>> >
>>>
>>> Right. Ext4 has fallback only for dio writes but not for DIO reads...
>>>
>>> buffered
>>> static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
>>> {
>>> /* must be a directio to fall back to buffered */
>>> if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) !=
>>> (IOMAP_WRITE | IOMAP_DIRECT))
>>> return false;
>>>
>>> ...
>>> }
>>>
>>> So basically the path is ext4_file_[read|write]_iter() -> iomap_dio_rw
>>> -> iomap_dio_bio_iter() -> return -EINVAL. i.e. from...
>>>
>>>
>>> if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
>>> !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
>>> return -EINVAL;
>>>
>>> EXT4 then fallsback to buffered-io only for writes, but not for reads.
>>
>> Right. And the fallback for writes was actually inadvertedly "added" by
>> commit bc264fea0f6f "iomap: support incremental iomap_iter advances". That
>> changed the error handling logic. Previously if iomap_dio_bio_iter()
>> returned EINVAL, it got propagated to userspace regardless of what
>> ->iomap_end() returned. After this commit if ->iomap_end() returns error
>> (which is ENOTBLK in ext4 case), it gets propagated to userspace instead of
>> the error returned by iomap_dio_bio_iter().
>>
>> Now both the old and new behavior make some sense so I won't argue that the
>> new iomap_iter() behavior is wrong. But I think we should change ext4 back
>> to the old behavior of failing unaligned dio writes instead of them falling
>> back to buffered IO. I think something like the attached patch should do
>> the trick - it makes unaligned dio writes fail again while writes to holes
>> of indirect-block mapped files still correctly fall back to buffered IO.
>> Once fstests run completes, I'll do a proper submission...
>>
>
> Aah, right. So it wasn't EXT4 which had this behaviour of falling back
> to buffered I/O for unaligned writes. Earlier EXT4 was assuming an error
> code will be detected by iomap and will be passed to it as "written" in
> ext4_iomap_end() for such unaligned writes. But I guess that logic
> silently got changed with that commit. Thanks for analyzing that.
> I missed looking underneath iomap behaviour change :).
>
>
>>
>> Honza
>> --
>> Jan Kara <jack@suse.com>
>> SUSE Labs, CR
>> From ce6da00a09647a03013c3f420c2e7ef7489c3de8 Mon Sep 17 00:00:00 2001
>> From: Jan Kara <jack@suse.cz>
>> Date: Wed, 27 Aug 2025 14:55:19 +0200
>> Subject: [PATCH] ext4: Fail unaligned direct IO write with EINVAL
>>
>> Commit bc264fea0f6f ("iomap: support incremental iomap_iter advances")
>> changed the error handling logic in iomap_iter(). Previously any error
>> from iomap_dio_bio_iter() got propagated to userspace, after this commit
>> if ->iomap_end returns error, it gets propagated to userspace instead of
>> an error from iomap_dio_bio_iter(). This results in unaligned writes to
>> ext4 to silently fallback to buffered IO instead of erroring out.
>>
>> Now returning ENOTBLK for DIO writes from ext4_iomap_end() seems
>> unnecessary these days. It is enough to return ENOTBLK from
>> ext4_iomap_begin() when we don't support DIO write for that particular
>> file offset (due to hole).
>
> Right. This mainly only happens if we have holes in non-extent (indirect
> blocks) case.
>
Thinking more on this case. Do we really want a fallback to buffered-io
for unaligned writes in this case (indirect block case)?
I don't think we care much here, right? And anyways the unaligned writes
should have the same behaviour for extents v/s non-extents case right?
I guess the problem is, iomap alignment check happens in
iomap_dio_bio_iter() where it has a valid bdev (populated by filesystem
during ->iomap_begin() call) to check the alignment against. But in this
indirect block case we return -ENOTBLK much earlier from ->iomap_begin()
call itself.
-ritesh
> Also, as I see ext4 always just fallsback to buffered-io for no or
> partial writes (unless iomap returned any error code). So, I was just
> wondering if that could ever happen for DIO atomic write case. It's good
> that we have a WARN_ON_ONCE() check in there to catch it. But I was
> wondering if this needs an explicit handling in ext4_dio_write_iter() to
> not fallback to buffered-writes for atomic DIO requests?
>
> -ritesh
>
>
>
>>
>> Fixes: bc264fea0f6f ("iomap: support incremental iomap_iter advances")
>> Signed-off-by: Jan Kara <jack@suse.cz>
>> ---
>> fs/ext4/file.c | 2 --
>> fs/ext4/inode.c | 35 -----------------------------------
>> 2 files changed, 37 deletions(-)
>>
>> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
>> index 93240e35ee36..cf39f57d21e9 100644
>> --- a/fs/ext4/file.c
>> +++ b/fs/ext4/file.c
>> @@ -579,8 +579,6 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>> iomap_ops = &ext4_iomap_overwrite_ops;
>> ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
>> dio_flags, NULL, 0);
>> - if (ret == -ENOTBLK)
>> - ret = 0;
>> if (extend) {
>> /*
>> * We always perform extending DIO write synchronously so by
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 5b7a15db4953..c3b23c90fd11 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -3872,47 +3872,12 @@ static int ext4_iomap_overwrite_begin(struct inode *inode, loff_t offset,
>> return ret;
>> }
>>
>> -static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
>> -{
>> - /* must be a directio to fall back to buffered */
>> - if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) !=
>> - (IOMAP_WRITE | IOMAP_DIRECT))
>> - return false;
>> -
>> - /* atomic writes are all-or-nothing */
>> - if (flags & IOMAP_ATOMIC)
>> - return false;
>> -
>> - /* can only try again if we wrote nothing */
>> - return written == 0;
>> -}
>> -
>> -static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
>> - ssize_t written, unsigned flags, struct iomap *iomap)
>> -{
>> - /*
>> - * Check to see whether an error occurred while writing out the data to
>> - * the allocated blocks. If so, return the magic error code for
>> - * non-atomic write so that we fallback to buffered I/O and attempt to
>> - * complete the remainder of the I/O.
>> - * For non-atomic writes, any blocks that may have been
>> - * allocated in preparation for the direct I/O will be reused during
>> - * buffered I/O. For atomic write, we never fallback to buffered-io.
>> - */
>> - if (ext4_want_directio_fallback(flags, written))
>> - return -ENOTBLK;
>> -
>> - return 0;
>> -}
>> -
>> const struct iomap_ops ext4_iomap_ops = {
>> .iomap_begin = ext4_iomap_begin,
>> - .iomap_end = ext4_iomap_end,
>> };
>>
>> const struct iomap_ops ext4_iomap_overwrite_ops = {
>> .iomap_begin = ext4_iomap_overwrite_begin,
>> - .iomap_end = ext4_iomap_end,
>> };
>>
>> static int ext4_iomap_begin_report(struct inode *inode, loff_t offset,
>> --
>> 2.43.0
next prev parent reply other threads:[~2025-08-29 3:40 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20250819164922.640964-1-kbusch@meta.com>
2025-08-19 23:36 ` [PATCHv3 0/8] direct-io: even more flexible io vectors Mike Snitzer
2025-08-20 1:52 ` Song Chen
[not found] ` <20250819164922.640964-2-kbusch@meta.com>
2025-08-20 7:02 ` [PATCHv3 1/8] block: check for valid bio while splitting Damien Le Moal
2025-08-20 14:25 ` Keith Busch
2025-08-20 7:04 ` Damien Le Moal
2025-08-25 7:35 ` Christoph Hellwig
2025-08-22 13:27 ` [PATCHv3 0/8] direct-io: even more flexible io vectors Ritesh Harjani
2025-08-22 14:30 ` Keith Busch
2025-08-25 12:07 ` Jan Kara
2025-08-25 14:53 ` Keith Busch
2025-08-26 4:59 ` Ritesh Harjani
2025-08-27 15:20 ` Jan Kara
2025-08-27 16:09 ` Mike Snitzer
2025-09-01 7:55 ` Jan Kara
2025-09-02 14:39 ` Mike Snitzer
2025-08-27 17:52 ` Brian Foster
2025-08-27 19:20 ` Keith Busch
2025-09-01 8:22 ` Jan Kara
2025-08-29 2:11 ` Ritesh Harjani
2025-08-29 3:19 ` Ritesh Harjani [this message]
[not found] ` <20250819164922.640964-3-kbusch@meta.com>
2025-08-25 7:36 ` [PATCHv3 2/8] block: add size alignment to bio_iov_iter_get_pages Christoph Hellwig
[not found] ` <20250819164922.640964-4-kbusch@meta.com>
2025-08-20 7:07 ` [PATCHv3 3/8] block: align the bio after building it Damien Le Moal
2025-08-25 7:46 ` Christoph Hellwig
2025-08-25 13:57 ` Keith Busch
2025-08-25 7:47 ` Christoph Hellwig
2025-08-26 0:37 ` Keith Busch
2025-08-26 8:02 ` Christoph Hellwig
2025-08-26 23:11 ` Keith Busch
[not found] ` <20250819164922.640964-5-kbusch@meta.com>
2025-08-25 7:48 ` [PATCHv3 4/8] block: simplify direct io validity check Christoph Hellwig
[not found] ` <20250819164922.640964-6-kbusch@meta.com>
2025-08-25 7:48 ` [PATCHv3 5/8] iomap: " Christoph Hellwig
[not found] ` <20250819164922.640964-7-kbusch@meta.com>
2025-08-25 7:48 ` [PATCHv3 6/8] block: remove bdev_iter_is_aligned Christoph Hellwig
[not found] ` <20250819164922.640964-8-kbusch@meta.com>
2025-08-25 7:49 ` [PATCHv3 7/8] blk-integrity: use simpler alignment check Christoph Hellwig
[not found] ` <20250819164922.640964-9-kbusch@meta.com>
2025-08-25 7:50 ` [PATCHv3 8/8] iov_iter: remove iov_iter_is_aligned Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=878qj2g6hl.fsf@gmail.com \
--to=ritesh.list@gmail.com \
--cc=axboe@kernel.dk \
--cc=bfoster@redhat.com \
--cc=brauner@kernel.org \
--cc=djwong@kernel.org \
--cc=dw@davidwei.uk \
--cc=hch@lst.de \
--cc=jack@suse.com \
--cc=jack@suse.cz \
--cc=kbusch@kernel.org \
--cc=kbusch@meta.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=snitzer@kernel.org \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).